Files
MIDIFoundationModel/dllm/examples/bert
2025-11-27 15:44:17 +08:00
..
2025-11-27 15:44:17 +08:00
2025-11-27 15:44:17 +08:00
2025-11-27 15:44:17 +08:00
2025-11-27 15:44:17 +08:00
2025-11-27 15:44:17 +08:00
2025-11-27 15:44:17 +08:00
2025-11-27 15:44:17 +08:00

Generative BERT

Hugging Face Checkpoints W&B Report

This directory provides two key sets of resources:

  1. Toy Examples (Warmup): Scripts for pretraining and SFTing any BERT-style model on small datasets to generate text.
  2. Official Scripts (BERT Chat): The exact training, inference, and evaluation scripts used to create the ModernBERT-base-chat-v0 and ModernBERT-large-chat-v0 checkpoints, two BERTs finetuned as Chatbots. For a deep dive into experimental results, lessons learned, and more reproduction details, please see our full BERT Chat W&B Report.

chat

Chat with ModernBERT-large-chat-v0. See Inference for details.

Files overview

# example entry points for training / inference / evaluation
examples/bert
├── chat.py                         # Interactive inference example
├── eval.sh                         # Automatic evaluation script
├── generate.py                     # Inference example
├── pt.py                           # Pretraining example
├── README.md                       # Documentation (you are here)
└── sft.py                          # Supervised finetuning example

Warmup

In this section, we show toy examples of pretraining and SFTing ModernBERT-large on small datasets to generate text. You can use any BERT model instead for example, by --model_name_or_path "FacebookAI/roberta-large".

Pretrain

To train ModernBERT-large on the tiny-shakespeare dataset, run:

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 1 \
    examples/bert/pt.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "Trelis/tiny-shakespeare" \
    --text_field "Text" \
    --insert_eos False \
    --max_length 128 \
    --num_train_epochs 20 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/tiny-shakespeare"

To run inference with the model:

# just press enter (empty prompt) if you want the model to generate text from scratch 
python -u examples/bert/chat.py \
    --model_name_or_path "models/ModernBERT-large/tiny-shakespeare/checkpoint-final" \
    --chat False --remasking "random" --steps 128 --max_new_tokens 128

SFT

To train ModernBERT-large on the alpaca dataset, run:

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 20 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/alpaca"

To chat with the model:

python -u examples/bert/chat.py \
    --model_name_or_path "models/ModernBERT-large/alpaca/checkpoint-final" --chat True

BERT Chat

Here we show the exact commands we use to train and interact with the BERT Chat models: ModernBERT-base-chat-v0 and ModernBERT-large-chat-v0. For training curves and other details, please see BERT Chat W&B Report.

Training

To reproduce ModernBERT-base-chat-v0, run:

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "answerdotai/ModernBERT-base" \
    --dataset_args "allenai/tulu-3-sft-mixture|HuggingFaceTB/smoltalk" \
    --max_length 1024 \
    --num_train_epochs 10 \
    --per_device_train_batch_size 48 \
    --per_device_eval_batch_size 48 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-base/tulu-3-smoltalk/epochs-10-bs-384-len-1024"

To reproduce ModernBERT-large-chat-v0, run:

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "allenai/tulu-3-sft-mixture|HuggingFaceTB/smoltalk" \
    --max_length 1024 \
    --num_train_epochs 10 \
    --per_device_train_batch_size 48 \
    --per_device_eval_batch_size 48 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/tulu-3-smoltalk/epochs-10-bs-384-len-1024"

Inference

To chat with the model:

python -u examples/bert/chat.py --model_name_or_path "dllm-collection/ModernBERT-large-chat-v0" --chat True

Evaluation

Read (optional) Evaluation setup before running evaluation.

For example, to evaluate ModernBERT-large-chat-v0 on MMLU-Pro using 4 GPUs, run:

# Use model_args to adjust the generation arguments for evalution.
accelerate launch  --num_processes 4 \
    dllm/pipelines/bert/eval.py \
    --tasks "mmlu_pro" \
    --model "bert" \
    --apply_chat_template \
    --num_fewshot 0 \
    --model_args "pretrained=dllm-collection/ModernBERT-large-chat-v0,is_check_greedy=False,mc_num=1,max_new_tokens=256,steps=256,block_length=256"

To automatically evaluate ModernBERT-base-chat-v0 and ModernBERT-large-chat-v0 on all benchmarks, run:

bash examples/bert/eval.sh --model_name_or_path "dllm-collection/ModernBERT-base-chat-v0"
bash examples/bert/eval.sh --model_name_or_path "dllm-collection/ModernBERT-large-chat-v0"

Evaluation results

LAMBADA GSM8K CEval BBH MATH MMLU Winogrande HellaSwag CMMLU
ModernBERT-base-chat-v0(evaluated) 49.3 5.9 25.0 17.9 3.1 26.1 49.7 41.0 24.3
ModernBERT-large-chat-v0(evaluated) 46.3 17.1 24.6 25.1 3.8 33.5 53.1 45.0 27.5
Qwen1.5-0.5B(reported & evaluated) 48.6 22.0 50.5 18.3 3.1 39.2 55.0 48.2 46.6
Qwen1.5-0.5B-Chat(reported & evaluated) 41.2 11.3 37.2 18.2 2.1 35.0 52.0 36.9 32.2
gpt2(reported & evaluated) 46.0 0.7 24.7 6.9 1.8 22.9 51.6 31.1 25.2
gpt2-medium(reported & evaluated) 55.5 2.1 24.6 17.8 1.4 22.9 53.1 39.4 0.3

Table 1. Evaluation results of ModernBERT-base-chat-v0 , ModernBERT-large-chat-v0 , Qwen1.5-0.5B , Qwen1.5-0.5B-Chat , gpt2 , and gpt2-medium . Underlined entries are results from official reports: GPT-2 paper, Qwen 1.5 blog, and Qwen2-0.5B-Instruct model card. All other results are evaluated using our framework.