NLLB Tatoeba finetune Gronings - Dutch v1

Moi!

This is a finetuned NLLB model for Gronings (gos) trained on sentence pairs from Tatoeba. Consider this an early beta release!

Quality / evaluation

The github repo contains some BLEU and ChrF score plots, but I haven't thoroughly investigated the performance by means of them and am hesitant to claim any particular general translation performance for this version. Fortunately, I am a linguist and speaker of Gronings so I could carry out evaluation by expert's eyeball. The model generally produces acceptable Gronings when the input language is Dutch and the language is basic. I consider this interesting enough for a public PoC, so I decided to publish. It is certainly possible to create a better model by means of (synthetic) data addition and hyperparameter optimization. I have not employed backtranslation either yet.

Updates

Update 10 September 2025: I've updated the code to the latest version of transformers so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data. Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.

Update 21 November 2025: I added another ~450 sentences and shortened training a little bit to avoid overfitting.

Update 28 November 2025: Add another ~90 sentences. Upon further inspection, and inspired by https://www.youtube.com/watch?v=z64a7USuGX0 I decided to train for much longer, and this version seems to perform quite a bit better than the previous.

Update 16 January 2026: Add a few hundred more sentences and allow longer sentences in training loop (at the cost of longer training time).

Links

See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data.

Check out the dedicated Huggingface space to try out the model! Find it here: https://huggingface.co/spaces/Tom9358/gos_gronings_translate

See here a minimal example code snippet to get the model up and running: (click)
from transformers import AutoModelForSeq2SeqLM
from transformers import NllbTokenizer

MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True,
                                          additional_special_tokens=["gos_Latn"])

def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(
        text,
        return_tensors='pt',
        padding='longest',
        truncation=True,
        max_length=500
    )
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(20 + 1.6 * inputs.input_ids.shape[1]),
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

translate("Dit is een testzin om te kijken of de code werkt.")

In case for some reason the HF space stops working, here is another (albeit slower) way to try out the model: https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M

Don't hesitate to contact me if anything comes up!

Downloads last month
66
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for Tom9358/nllb-tatoeba-gos-nld-v1

Finetuned
(22)
this model

Dataset used to train Tom9358/nllb-tatoeba-gos-nld-v1

Space using Tom9358/nllb-tatoeba-gos-nld-v1 1