Model Card for Latxa-Llama-3.1-8B-Instruct-Multimodal
⚠️ DEPRECATION NOTICE: This model is deprecated. Please use the updated models available in the
HiTZ/latxa-vl collection.
This model is an open Multimodal Large Language Model (MLLM) specifically developed for the Basque language. It adapts the Basque-instructed Latxa backbone to process both image and text inputs, enabling multimodal capabilities for a low-resource language.
Model Details
- Developed by: HITZ Basque Center for Language Technology - Ixa NLP Group, University of the Basque Country UPV/EHU
- Model type: Multimodal Large Language Model (Late-fusion architecture)
- Language(s) (NLP): Basque (eu), English (en)
- Backbone LLM: Latxa-Llama-3.1-8B-Instruct
- Vision Encoder: CLIP (
clip-vit-large-patch14-336)
- Vision-Language Connector: Single-layer fully connected linear layer
Uses
Direct Use
This model is intended for general-purpose multimodal understanding and generation tasks in Basque and English. Typical use cases include:
- Image captioning
- Visual Question Answering (VQA)
- Open-ended text generation from visual inputs
Out-of-Scope Use
- The model is not optimized for specialized multimodal skills such as Optical Character Recognition (OCR) or complex table/chart understanding.
Training Details
The model was developed using a two-stage training procedure specifically adapted for low-resource language constraints.
Stage 1: Vision-Language Alignment
- Goal: Align the visual representations generated by the CLIP encoder with the embedding space of the Latxa backbone.
- Dataset: A mix of the original and translated Conceptual Captions dataset (CC3M and $CC3M_{Eus}$).
- Data Mixture: 80% Basque and 20% English samples[cite: 271].
- Trainable Parameters: Only the linear connector was trained; the vision encoder and LLM remained frozen.
Stage 2: Multimodal Instruction Tuning
- Goal: Fine-tune both the connector and the backbone LLM to follow complex multimodal instructions.
- Dataset Composition: The model was trained on a final mixture of 173k samples, consisting of 83% multimodal data and 17% text-only data.
- Multimodal Data: Derived from the Pixmo-AMA dataset (translated to $Pixmo-AMA_{Eus}$). The mixture used was 80% Basque and 20% English multimodal instructions.
- Text-only Data: Augmented with 29k text-only instructions (80% Basque and 20% English) sampled from the Magpie-Llama-3.1-8B-Instruct-Filtered-1M dataset. Incorporating this text-only instruction dataset helps counteract the decline in text-only tasks often caused by multimodal training.
Limitations and Bias
- Cultural Knowledge: Because the multimodal training and evaluation datasets were created by translating English-centric datasets (via machine translation), the model does not inherently include Basque multimodal cultural knowledge.
- Bias and Safety: Like other MLLMs, this model presents risks including the potential enhancement of existing human biases, which might be extended to local behaviors due to the multilingual adaptation for a low-resource language.
Citation
If you use this model, please cite the following paper:
@misc{arana2025multimodallargelanguagemodels,
title={Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque},
author={Lukas Arana and Julen Etxaniz and Ander Salaberria and Gorka Azkune},
year={2025},
eprint={2511.09396},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.09396},
}