---
language:
- ja
- en
base_model:
- sbintuitions/sarashina2.2-3b-instruct-v0.1
license: mit
tags:
  - multimodal
  - vision-language
pipeline_tag: image-to-text
library_name: transformers
---

# Sarashina2.2-Vision-3B
**Sarashina2.2-Vision-3B** is a Japanese Large Vision Language Model trained by [SB Intuitions](https://www.sbintuitions.co.jp).

This model is based on [Sarashina2.2-3B-Instruct](https://huggingface.co/sbintuitions/sarashina2.2-3b-instruct-v0.1) and Image Encoder of [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).

## Model Performance
### Japanese Performance

|Model|Params(B)|[BussinessSlide VQA](https://github.com/stockmarkteam/business-slide-questions)<sup>*1</sup>|[Heron-Bench](https://arxiv.org/abs/2404.07824)<sup>*1</sup>|[JDocQA](https://arxiv.org/abs/2403.19454)<sup>*1</sup>|[JMMMU](https://arxiv.org/abs/2410.17250)|
|-|-|-|-|-|-|
|[Sarashina2.2-Vision-3B](https://huggingface.co/sbintuitions/sarashina2.2-vision-3b)|3.8|3.932|3.214|3.327|0.486|
|[Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|3.8|3.516|2.000|3.019|0.450|
|[Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)|4.4|4.105|2.330|3.596|0.493|
|[InternVL3_5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B)|4.7|3.311|1.893|2.626|0.437|
|[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)|14.4|3.110|2.184|-<sup>*2</sup>|0.432|
|[Stockmark-2-VL-100B-beta](https://huggingface.co/stockmark/Stockmark-2-VL-100B-beta)|96.5|<u>3.973</u>|<u>2.563</u>|3.168|-<sup>*2</sup>|

*1. [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) was used for LLM-as-a-Judge.

*2. These scores cannot be measured because some input data exceeds the model's `max_position_embeddings`.

### English Performance

|Model|Params(B)|[DocVQA](https://arxiv.org/abs/2007.00398)|[InfoVQA](https://arxiv.org/abs/2104.12756)|[RealworldQA](https://huggingface.co/datasets/xai-org/RealworldQA)
|-|-|-|-|-|
|[Sarashina2.2-Vision-3B](https://huggingface.co/sbintuitions/sarashina2.2-vision-3b)|3.8|0.831|0.567|0.625|
|[Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|3.8|0.924|0.750|0.586|
|[Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)|4.4|0.948|0.798|0.712|
|[InternVL3_5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B)|4.7|0.823|0.541|0.553|
|[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)|14.4|0.729|0.490|0.519||

## How to use
### 1. Install dependencies

```sh
pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate
```

### 2. Inference
The following script loads the model and allows inference.
```python
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, set_seed

# Define model path
model_path = "sbintuitions/sarashina2.2-vision-3b"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)

image_url = "https://huggingface.co/sbintuitions/sarashina2.2-vision-3b/resolve/main/sample.jpg"
message = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image_url,
            },
            {
                "type": "text",
                "text": "これはどこで撮った写真ですか？",
            },
        ],
    }
]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <|user|><|prefix|><|file|><|suffix|>これはどこで撮った写真ですか？</s><|assistant|>"""

image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
output_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.2,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""
この写真は、**道後温泉本館（どうごおんせんほんかん）** の入り口を夜景で撮影した写真です。

---
 場所の詳細：
- **名称**：道後温泉本館（Dogo Onsen Honkan）
- **所在地**：〒790-0842 愛媛県松山市道後湯之町1丁目3番5号
- **アクセス**：JR松山駅から市内電車「道後温泉駅」下車すぐ
- **特徴**：日本最古の温泉の一つとして知られる「道後温泉」の中心的な施設。国の重要文化財にも指定されています。

---
 写真の特徴から判断した理由：
- 建物の屋根や装飾が伝統的な和風建築で、「道後温泉」の看板が目立つ。
- 入口の垂れ幕には「道後」「道後」と書かれており、白い鳳凰の模様が描かれている → 道後温泉の象徴的デザイン。
- 夜の照明と石灯籠、提灯風の灯りが日本の温泉地らしい雰囲気を醸し出している。
- 看板に「道後温泉」の文字が明確に表示されている。

---
 補足情報：
道後温泉本館は、夏目漱石の小説『坊っちゃん』の舞台としても有名で、多くの観光客が訪れる人気スポットです。また、2020年にリニューアルされ、現代的な設備も導入されていますが、外観は伝統を残しています。

---
よって、この写真は **愛媛県松山市にある「道後温泉本館」の夜景** です。
"""
```

## Training
**Sarashina2.2-Vision-3B** is created through the following five-stage training process:

### PreTrain
1. Projector Warmup: To bridge the gap between the text and image embedding spaces within the LLM
2. Vision Encoder Pretraining: To enhance image comprehension, especially for understanding Japan-specific images and text
3. Full Model Pretraining: To enhance the model's unified understanding of images and language using interleaved data

### PostTrain
1. Supervised Fine-Tuning(SFT): To improve the model's ability to follow instructions and respond appropriately to user prompts
2. Mixed Preference Optimization(MPO): To align the model's outputs with user preferences, ensuring it generates more desirable responses

## Limitations
This model has limited safety training. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs. Before using it, we would like developers to tune models based on human preferences and safety considerations.

## LICENSE
[MIT License](./LICENSE)