---
language:
- ja
- en
base_model:
- sbintuitions/sarashina2.2-3b-instruct-v0.1
license: mit
tags:
- multimodal
- vision-language
pipeline_tag: image-to-text
library_name: transformers
---
# Sarashina2.2-Vision-3B
**Sarashina2.2-Vision-3B** is a Japanese Large Vision Language Model trained by [SB Intuitions](https://www.sbintuitions.co.jp).
This model is based on [Sarashina2.2-3B-Instruct](https://huggingface.co/sbintuitions/sarashina2.2-3b-instruct-v0.1) and Image Encoder of [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).
## Model Performance
### Japanese Performance
|Model|Params(B)|[BussinessSlide VQA](https://github.com/stockmarkteam/business-slide-questions)*1|[Heron-Bench](https://arxiv.org/abs/2404.07824)*1|[JDocQA](https://arxiv.org/abs/2403.19454)*1|[JMMMU](https://arxiv.org/abs/2410.17250)|
|-|-|-|-|-|-|
|[Sarashina2.2-Vision-3B](https://huggingface.co/sbintuitions/sarashina2.2-vision-3b)|3.8|3.932|3.214|3.327|0.486|
|[Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|3.8|3.516|2.000|3.019|0.450|
|[Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)|4.4|4.105|2.330|3.596|0.493|
|[InternVL3_5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B)|4.7|3.311|1.893|2.626|0.437|
|[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)|14.4|3.110|2.184|-*2|0.432|
|[Stockmark-2-VL-100B-beta](https://huggingface.co/stockmark/Stockmark-2-VL-100B-beta)|96.5|3.973|2.563|3.168|-*2|
*1. [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) was used for LLM-as-a-Judge.
*2. These scores cannot be measured because some input data exceeds the model's `max_position_embeddings`.
### English Performance
|Model|Params(B)|[DocVQA](https://arxiv.org/abs/2007.00398)|[InfoVQA](https://arxiv.org/abs/2104.12756)|[RealworldQA](https://huggingface.co/datasets/xai-org/RealworldQA)
|-|-|-|-|-|
|[Sarashina2.2-Vision-3B](https://huggingface.co/sbintuitions/sarashina2.2-vision-3b)|3.8|0.831|0.567|0.625|
|[Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|3.8|0.924|0.750|0.586|
|[Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)|4.4|0.948|0.798|0.712|
|[InternVL3_5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B)|4.7|0.823|0.541|0.553|
|[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)|14.4|0.729|0.490|0.519||
## How to use
### 1. Install dependencies
```sh
pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate
```
### 2. Inference
The following script loads the model and allows inference.
```python
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, set_seed
# Define model path
model_path = "sbintuitions/sarashina2.2-vision-3b"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
set_seed(42)
image_url = "https://huggingface.co/sbintuitions/sarashina2.2-vision-3b/resolve/main/sample.jpg"
message = [
{
"role": "user",
"content": [
{
"type": "image",
"image": image_url,
},
{
"type": "text",
"text": "これはどこで撮った写真ですか?",
},
],
}
]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <|user|><|prefix|><|file|><|suffix|>これはどこで撮った写真ですか?<|assistant|>"""
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
inputs = processor(
text=[text_prompt],
images=[image],
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
output_ids = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.2,
)
generated_ids = [
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""
この写真は、**道後温泉本館(どうごおんせんほんかん)** の入り口を夜景で撮影した写真です。
---
場所の詳細:
- **名称**:道後温泉本館(Dogo Onsen Honkan)
- **所在地**:〒790-0842 愛媛県松山市道後湯之町1丁目3番5号
- **アクセス**:JR松山駅から市内電車「道後温泉駅」下車すぐ
- **特徴**:日本最古の温泉の一つとして知られる「道後温泉」の中心的な施設。国の重要文化財にも指定されています。
---
写真の特徴から判断した理由:
- 建物の屋根や装飾が伝統的な和風建築で、「道後温泉」の看板が目立つ。
- 入口の垂れ幕には「道後」「道後」と書かれており、白い鳳凰の模様が描かれている → 道後温泉の象徴的デザイン。
- 夜の照明と石灯籠、提灯風の灯りが日本の温泉地らしい雰囲気を醸し出している。
- 看板に「道後温泉」の文字が明確に表示されている。
---
補足情報:
道後温泉本館は、夏目漱石の小説『坊っちゃん』の舞台としても有名で、多くの観光客が訪れる人気スポットです。また、2020年にリニューアルされ、現代的な設備も導入されていますが、外観は伝統を残しています。
---
よって、この写真は **愛媛県松山市にある「道後温泉本館」の夜景** です。
"""
```
## Training
**Sarashina2.2-Vision-3B** is created through the following five-stage training process:
### PreTrain
1. Projector Warmup: To bridge the gap between the text and image embedding spaces within the LLM
2. Vision Encoder Pretraining: To enhance image comprehension, especially for understanding Japan-specific images and text
3. Full Model Pretraining: To enhance the model's unified understanding of images and language using interleaved data
### PostTrain
1. Supervised Fine-Tuning(SFT): To improve the model's ability to follow instructions and respond appropriately to user prompts
2. Mixed Preference Optimization(MPO): To align the model's outputs with user preferences, ensuring it generates more desirable responses
## Limitations
This model has limited safety training. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs. Before using it, we would like developers to tune models based on human preferences and safety considerations.
## LICENSE
[MIT License](./LICENSE)