Toy Examples (Warmup): Scripts for pretraining and SFTing any BERT-style model on small datasets to generate text.
Official Scripts (BERT Chat): The exact training, inference, and evaluation scripts used to create the ModernBERT-base-chat-v0 and ModernBERT-large-chat-v0 checkpoints, two BERTs finetuned as Chatbots. For a deep dive into experimental results, lessons learned, and more reproduction details, please see our full BERT Chat W&B Report.

Chat with ModernBERT-large-chat-v0. See Inference for details.

Files overview

# example entry points for training / inference / evaluation
examples/bert
├── chat.py                         # Interactive inference example
├── eval.sh                         # Automatic evaluation script
├── generate.py                     # Inference example
├── pt.py                           # Pretraining example
├── README.md                       # Documentation (you are here)
└── sft.py                          # Supervised finetuning example

Warmup

In this section, we show toy examples of pretraining and SFTing ModernBERT-large on small datasets to generate text. You can use any BERT model instead for example, by --model_name_or_path "FacebookAI/roberta-large".

Pretrain

To train ModernBERT-large on the tiny-shakespeare dataset, run:

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 1 \
    examples/bert/pt.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "Trelis/tiny-shakespeare" \
    --text_field "Text" \
    --insert_eos False \
    --max_length 128 \
    --num_train_epochs 20 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/tiny-shakespeare"

To run inference with the model:

# just press enter (empty prompt) if you want the model to generate text from scratch 
python -u examples/bert/chat.py \
    --model_name_or_path "models/ModernBERT-large/tiny-shakespeare/checkpoint-final" \
    --chat False --remasking "random" --steps 128 --max_new_tokens 128

SFT

To train ModernBERT-large on the alpaca dataset, run:

accelerate launch --config_file scripts/accelerate_configs/ddp.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "tatsu-lab/alpaca" \
    --max_length 512 \
    --num_train_epochs 20 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/alpaca"

To chat with the model:

python -u examples/bert/chat.py \
    --model_name_or_path "models/ModernBERT-large/alpaca/checkpoint-final" --chat True

BERT Chat

Here we show the exact commands we use to train and interact with the BERT Chat models: ModernBERT-base-chat-v0 and ModernBERT-large-chat-v0. For training curves and other details, please see BERT Chat W&B Report.

Training

To reproduce ModernBERT-base-chat-v0, run:

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "answerdotai/ModernBERT-base" \
    --dataset_args "allenai/tulu-3-sft-mixture|HuggingFaceTB/smoltalk" \
    --max_length 1024 \
    --num_train_epochs 10 \
    --per_device_train_batch_size 48 \
    --per_device_eval_batch_size 48 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-base/tulu-3-smoltalk/epochs-10-bs-384-len-1024"

To reproduce ModernBERT-large-chat-v0, run:

accelerate launch --config_file scripts/accelerate_configs/zero2.yaml --num_processes 8 \
    examples/bert/sft.py \
    --model_name_or_path "answerdotai/ModernBERT-large" \
    --dataset_args "allenai/tulu-3-sft-mixture|HuggingFaceTB/smoltalk" \
    --max_length 1024 \
    --num_train_epochs 10 \
    --per_device_train_batch_size 48 \
    --per_device_eval_batch_size 48 \
    --save_steps 0.1 \
    --output_dir "models/ModernBERT-large/tulu-3-smoltalk/epochs-10-bs-384-len-1024"

Inference

To chat with the model:

python -u examples/bert/chat.py --model_name_or_path "dllm-collection/ModernBERT-large-chat-v0" --chat True

Evaluation

Read (optional) Evaluation setup before running evaluation.

For example, to evaluate ModernBERT-large-chat-v0 on MMLU-Pro using 4 GPUs, run:

# Use model_args to adjust the generation arguments for evalution.
accelerate launch  --num_processes 4 \
    dllm/pipelines/bert/eval.py \
    --tasks "mmlu_pro" \
    --model "bert" \
    --apply_chat_template \
    --num_fewshot 0 \
    --model_args "pretrained=dllm-collection/ModernBERT-large-chat-v0,is_check_greedy=False,mc_num=1,max_new_tokens=256,steps=256,block_length=256"

To automatically evaluate ModernBERT-base-chat-v0 and ModernBERT-large-chat-v0 on all benchmarks, run:

bash examples/bert/eval.sh --model_name_or_path "dllm-collection/ModernBERT-base-chat-v0"
bash examples/bert/eval.sh --model_name_or_path "dllm-collection/ModernBERT-large-chat-v0"

Evaluation results

	LAMBADA	GSM8K	CEval	BBH	MATH	MMLU	Winogrande	HellaSwag	CMMLU
`ModernBERT-base-chat-v0`(evaluated)	49.3	5.9	25.0	17.9	3.1	26.1	49.7	41.0	24.3
`ModernBERT-large-chat-v0`(evaluated)	46.3	17.1	24.6	25.1	3.8	33.5	53.1	45.0	27.5
`Qwen1.5-0.5B`(reported & evaluated)	48.6	22.0	50.5	18.3	3.1	39.2	55.0	48.2	46.6
`Qwen1.5-0.5B-Chat`(reported & evaluated)	41.2	11.3	37.2	18.2	2.1	35.0	52.0	36.9	32.2
`gpt2`(reported & evaluated)	46.0	0.7	24.7	6.9	1.8	22.9	51.6	31.1	25.2
`gpt2-medium`(reported & evaluated)	55.5	2.1	24.6	17.8	1.4	22.9	53.1	39.4	0.3

Table 1. Evaluation results of ModernBERT-base-chat-v0 , ModernBERT-large-chat-v0 , Qwen1.5-0.5B , Qwen1.5-0.5B-Chat , gpt2 , and gpt2-medium . Underlined entries are results from official reports: GPT-2 paper, Qwen 1.5 blog, and Qwen2-0.5B-Instruct model card. All other results are evaluated using our framework.

README.md Unescape Escape

Generative BERT