SpanBERTa: How We Trained RoBERTa Language Model for Spanish from Scratch

Originally published by Skim AI’s Machine Learning Research Intern, Chris Tran.


Run in Google Colab


Self-training methods with transformer models have achieved state-of-the-art performance on most NLP tasks. However, because training them is computationally expensive, most currently available pretrained transformer models are only for English. Therefore, to improve performance in NLP tasks in our projects on Spanish, my team at Skim AI decided to train a RoBERTa language model for Spanish from scratch and call it SpanBERTa.

SpanBERTa has the same size as RoBERTa-base. We followed RoBERTa's training schema to train the model on 18 GB of OSCAR's Spanish corpus in 8 days using 4 Tesla P100 GPUs.

In this blog post, we will walk through an end-to-end process to train a BERT-like language model from scratch using transformers and tokenizers libraries by Hugging Face. There is also a Google Colab notebook to run the codes in this article directly. You can also modify the notebook accordingly to train a BERT-like model for other languages or fine-tune it on your customized dataset.

Before moving on, I want to express a huge thank to the Hugging Face team for making state-of-the-art NLP models accessible for everyone.


1. Install Dependencies

In [0]:
!pip uninstall -y tensorflow
!pip install transformers==2.8.0

2. Data

We pretrained SpanBERTa on OSCAR's Spanish corpus. The full size of the dataset is 150 GB and we used a portion of 18 GB to train.

In this example, for simplicity, we will use a dataset of Spanish movie subtitles from OpenSubtitles. This dataset has a size of 5.4 GB and we will train on a subset of ~300 MB.

In [0]:
import os

# Download and unzip movie substitle dataset
if not os.path.exists('data/dataset.txt'):
  !wget "" -O dataset.txt.gz
  !gzip -d dataset.txt.gz
  !mkdir data
  !mv dataset.txt data
--2020-04-06 15:53:04--
Resolving (,
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1859673728 (1.7G) [application/gzip]
Saving to: ‘dataset.txt.gz’

dataset.txt.gz      100%[===================>]   1.73G  17.0MB/s    in 1m 46s  

2020-04-06 15:54:51 (16.8 MB/s) - ‘dataset.txt.gz’ saved [1859673728/1859673728]

In [0]:
# Total number of lines and some random lines
!wc -l data/dataset.txt
!shuf -n 5 data/dataset.txt
179287150 data/dataset.txt
Sabes, pensé que tenías más pelotas que para enfrentarme a través de mi hermano.
Supe todos los encantamientos en todas las lenguas de los Elfos hombres y Orcos.
Anteriormente en Blue Bloods:
Y quiero que prometas que no habrá ningún trato con Daniel Stafford.
Fue comiquísimo.
In [0]:
# Get a subset of first 1,000,000 lines for training
TRAIN_SIZE = 1000000 #@param {type:"integer"}
!(head -n $TRAIN_SIZE data/dataset.txt) > data/train.txt
In [0]:
# Get a subset of next 10,000 lines for validation
VAL_SIZE = 10000 #@param {type:"integer"}
!(sed -n {TRAIN_SIZE + 1},{TRAIN_SIZE + VAL_SIZE}p data/dataset.txt) > data/dev.txt

3. Train a Tokenizer

The original BERT implementation uses a WordPiece tokenizer with a vocabulary of 32K subword units. This method, however, can introduce "unknown" tokens when processing rare words.

In this implementation, we use a byte-level BPE tokenizer with a vocabulary of 50,265 subword units (same as RoBERTa-base). Using byte-level BPE makes it possible to learn a subword vocabulary of modest size that can encode any input without getting "unknown" tokens.

Because ByteLevelBPETokenizer produces 2 files ["vocab.json", "merges.txt"] while BertWordPieceTokenizer produces only 1 file vocab.txt, it will cause an error if we use BertWordPieceTokenizer to load outputs of a BPE tokenizer.

In [0]:
from tokenizers import ByteLevelBPETokenizer

path = "data/train.txt"

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
                special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

# Save files to disk
!mkdir -p "models/roberta""models/roberta")
CPU times: user 1min 37s, sys: 1.02 s, total: 1min 38s
Wall time: 1min 38s

Super fast! It takes only 2 minutes to train on 10 million lines.

Traing Language Model from Scratch

1. Model Architecture

RoBERTa has exactly the same architecture as BERT. The only differences are:

  • RoBERTa uses a Byte-Level BPE tokenizer with a larger subword vocabulary (50k vs 32k).
  • RoBERTa implements dynamic word masking and drops next sentence prediction task.
  • RoBERTa's training hyperparameters.

Other architecture configurations can be found in the documentation (RoBERTa, BERT).

In [0]:
import json
config = {
	"architectures": [
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "roberta",
	"num_attention_heads": 12,
	"num_hidden_layers": 12,
	"type_vocab_size": 1,
	"vocab_size": 50265

with open("models/roberta/config.json", 'w') as fp:
    json.dump(config, fp)

tokenizer_config = {"max_len": 512}

with open("models/roberta/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

2. Training Hyperparameters

Hyperparam BERT-base RoBERTa-base
Sequence Length 128, 512 512
Batch Size 256 8K
Peak Learning Rate 1e-4 6e-4
Max Steps 1M 500K
Warmup Steps 10K 24K
Weight Decay 0.01 0.01
Adam $\epsilon$ 1e-6 1e-6
Adam $\beta_1$ 0.9 0.9
Adam $\beta_2$ 0.999 0.98
Gradient Clipping 0.0 0.0

Note the batch size when training RoBERTa is 8000. Therefore, although RoBERTa-base was trained for 500K steps, its training computational cost is 16 times that of BERT-base. In the RoBERTa paper, it is shown that training with large batches improves perplexity for the masked language modeling objective, as well as end-task accuracy. Larger batch size can be obtained by tweaking gradient_accumulation_steps.

Due to computational constraint, we followed BERT-base's training schema and trained our SpanBERTa model using 4 Tesla P100 GPUs for 200K steps in 8 days.

3. Start Training

We will train our model from scratch using, a script provided by Hugging Face, which will preprocess, tokenize the corpus and train the model on Masked Language Modeling task. The script is optimized to train on a single big corpus. Therefore, if your dataset is large and you want to split it to train sequentially, you will need to modify the script, or be ready to get a monster machine with high memory.

In [0]:
# Update April 22, 2020: Hugging Face updated script.
# Please use this version before the update.
!wget -c
--2020-04-24 02:28:21--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34328 (34K) [text/plain]
Saving to: ‘’

run_language_modeli 100%[===================>]  33.52K  --.-KB/s    in 0.003s  

2020-04-24 02:28:21 (10.1 MB/s) - ‘’ saved [34328/34328]

Important Arguments

  • --line_by_line Whether distinct lines of text in the dataset are to be handled as distinct sequences. If each line in your dataset is long and has ~512 tokens or more, you should use this setting. If each line is short, the default text preprocessing will concatenate all lines, tokenize them and slit tokenized outputs into blocks of 512 tokens. You can also split your datasets into small chunks and preprocess them separately. 3GB of text will take ~50 minutes to process with the default TextDataset class.
  • --should_continue Whether to continue from latest checkpoint in output_dir.
  • --model_name_or_path The model checkpoint for weights initialization. Leave None if you want to train a model from scratch.
  • --mlm Train with masked-language modeling loss instead of language modeling.
  • --config_name, --tokenizer_name Optional pretrained config and tokenizer name or path if not the same as model_name_or_path. If both are None, initialize a new config.
  • --per_gpu_train_batch_size Batch size per GPU/CPU for training. Choose the largest number you can fit on your GPUs. You will see an error if your batch size is too large.
  • --gradient_accumulation_steps Number of updates steps to accumulate before performing a backward/update pass. You can use this trick to increase batch size. For example, if per_gpu_train_batch_size = 16 and gradient_accumulation_steps = 4, your total train batch size will be 64.
  • --overwrite_output_dir Overwrite the content of the output directory.
  • --no_cuda, --fp16, --fp16_opt_level Arguments for training on GPU/CPU.
  • Other arguments are model paths and training hyperparameters.

It's highly recommended to include model type (eg. "roberta", "bert", "gpt2" etc.) in the model path because the script uses the AutoModels class to guess the model's configuration using pattern matching on the provided path.

In [0]:
# Model paths
MODEL_TYPE = "roberta" #@param ["roberta", "bert"]
MODEL_DIR = "models/roberta" #@param {type: "string"}
OUTPUT_DIR = "models/roberta/output" #@param {type: "string"}
TRAIN_PATH = "data/train.txt" #@param {type: "string"}
EVAL_PATH = "data/dev.txt" #@param {type: "string"}

For this example, we will train for only 25 steps on a Tesla P4 GPU provided by Colab.

In [0]:
Mon Apr  6 15:59:35 2020       
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No running processes found                                                 |
In [0]:
# Command line
cmd = """python \
    --output_dir {output_dir} \
    --model_type {model_type} \
    --mlm \
    --config_name {config_name} \
    --tokenizer_name {tokenizer_name} \
    {line_by_line} \
    {should_continue} \
    {model_name_or_path} \
    --train_data_file {train_path} \
    --eval_data_file {eval_path} \
    --do_train \
    {do_eval} \
    {evaluate_during_training} \
    --overwrite_output_dir \
    --block_size 512 \
    --max_step 25 \
    --warmup_steps 10 \
    --learning_rate 5e-5 \
    --per_gpu_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --weight_decay 0.01 \
    --adam_epsilon 1e-6 \
    --max_grad_norm 100.0 \
    --save_total_limit 10 \
    --save_steps 10 \
    --logging_steps 2 \
    --seed 42
In [0]:
# Arguments for training from scratch. I turn off evaluate_during_training,
#   line_by_line, should_continue, and model_name_or_path.
train_params = {
    "output_dir": OUTPUT_DIR,
    "model_type": MODEL_TYPE,
    "config_name": MODEL_DIR,
    "tokenizer_name": MODEL_DIR,
    "train_path": TRAIN_PATH,
    "eval_path": EVAL_PATH,
    "do_eval": "--do_eval",
    "evaluate_during_training": "",
    "line_by_line": "",
    "should_continue": "",
    "model_name_or_path": "",

If you are training on a virtual machine, you can install tensorboard to monitor the training process. Here is our Tensorboard for training SpanBERTa.

pip install tensorboard==2.1.0
tensorboard dev upload --logdir runs

After 200k steps, the loss reached 1.8 and the perplexity reached 5.2.

Now let's start training!

In [ ]:
    04/06/2020 15:59:55 - INFO - __main__ -   Creating features from dataset file at data
    04/06/2020 16:04:43 - INFO - __main__ -   Saving features into cached file data/roberta_cached_lm_510_train.txt
    04/06/2020 16:04:46 - INFO - __main__ -   ***** Running training *****
    04/06/2020 16:04:46 - INFO - __main__ -     Num examples = 165994
    04/06/2020 16:04:46 - INFO - __main__ -     Num Epochs = 1
    04/06/2020 16:04:46 - INFO - __main__ -     Instantaneous batch size per GPU = 4
    04/06/2020 16:04:46 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 16
    04/06/2020 16:04:46 - INFO - __main__ -     Gradient Accumulation steps = 4
    04/06/2020 16:04:46 - INFO - __main__ -     Total optimization steps = 25
    Epoch:   0% 0/1 [00:00<?, ?it/s]
    Iteration:   0% 0/41499 [00:00<?, ?it/s]
    Iteration:   0% 1/41499 [00:01<13:18:02,  1.15s/it]
    Iteration:   0% 2/41499 [00:01<11:26:47,  1.01it/s]
    Iteration:   0% 3/41499 [00:02<10:10:30,  1.13it/s]
    Iteration:   0% 4/41499 [00:03<9:38:10,  1.20it/s] 
    Iteration:   0% 5/41499 [00:03<8:52:44,  1.30it/s]
    Iteration:   0% 6/41499 [00:04<8:22:47,  1.38it/s]
    Iteration:   0% 7/41499 [00:04<8:00:55,  1.44it/s]
    Iteration:   0% 8/41499 [00:05<8:03:40,  1.43it/s]
    Iteration:   0% 9/41499 [00:06<7:46:57,  1.48it/s]
    Iteration:   0% 10/41499 [00:06<7:35:35,  1.52it/s]
    Iteration:   0% 11/41499 [00:07<7:28:29,  1.54it/s]
    Iteration:   0% 12/41499 [00:08<7:41:41,  1.50it/s]
    Iteration:   0% 13/41499 [00:08<7:34:28,  1.52it/s]
    Iteration:   0% 14/41499 [00:09<7:28:46,  1.54it/s]
    Iteration:   0% 15/41499 [00:10<7:23:29,  1.56it/s]
    Iteration:   0% 16/41499 [00:10<7:38:06,  1.51it/s]
    Iteration:   0% 17/41499 [00:11<7:29:13,  1.54it/s]
    Iteration:   0% 18/41499 [00:12<7:24:04,  1.56it/s]
    Iteration:   0% 19/41499 [00:12<7:21:59,  1.56it/s]
    Iteration:   0% 20/41499 [00:13<7:38:06,  1.51it/s]
    04/06/2020 16:06:23 - INFO - __main__ -   ***** Running evaluation  *****
    04/06/2020 16:06:23 - INFO - __main__ -     Num examples = 156
    04/06/2020 16:06:23 - INFO - __main__ -     Batch size = 4
    Evaluating: 100% 39/39 [00:08<00:00,  4.41it/s]
    04/06/2020 16:06:32 - INFO - __main__ -   ***** Eval results  *****
    04/06/2020 16:06:32 - INFO - __main__ -     perplexity = tensor(6077.6812)

4. Predict Masked Words

After training your language model, you can upload and share your model with the community. We have uploaded our SpanBERTa model to Hugging Face's server. Before evaluating the model on downstream tasks, let's see how it has learned to fill masked words given a context.

In [0]:
from transformers import pipeline

fill_mask = pipeline(

I pick a sentence from Wikipedia's article about COVID-19.

The original sentence is "Lavarse frecuentemente las manos con agua y jabón," meaning "Frequently wash your hands with soap and water."

The masked word is "jabón" (soap) and the top 5 predictions are soap, salt, steam, lemon and vinegar. It is interesting that the model somehow learns that we should wash our hands with things that can kill bacteria or contain acid.

In [0]:
fill_mask("Lavarse frecuentemente las manos con agua y <mask>.")
[{'score': 0.6469631195068359,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y jabón.</s>',
  'token': 18493},
 {'score': 0.06074320897459984,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y sal.</s>',
  'token': 619},
 {'score': 0.029787985607981682,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y vapor.</s>',
  'token': 11079},
 {'score': 0.026410052552819252,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y limón.</s>',
  'token': 12788},
 {'score': 0.017029203474521637,
  'sequence': '<s> Lavarse frecuentemente las manos con agua y vinagre.</s>',
  'token': 18424}]


We have walked through how to train a BERT language model for Spanish from scratch and seen that the model has learned properties of the language by trying to predict masked words given a context. You can also follow this article to fine-tune a pretrained BERT-like model on your customized dataset.

Next, we will implement the pretrained models on downstream tasks including Sequence Classification, NER, POS tagging, and NLI, as well as compare the model's performance with some non-BERT models.

Stay tuned for our next posts!