--- language: - ja - en base_model: - sbintuitions/sarashina2.2-3b-instruct-v0.1 license: mit tags: - multimodal - vision-language pipeline_tag: image-to-text library_name: transformers --- # Sarashina2.2-Vision-3B **Sarashina2.2-Vision-3B** is a Japanese Large Vision Language Model trained by [SB Intuitions](https://www.sbintuitions.co.jp). This model is based on [Sarashina2.2-3B-Instruct](https://huggingface.co/sbintuitions/sarashina2.2-3b-instruct-v0.1) and Image Encoder of [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384). ## Model Performance ### Japanese Performance |Model|Params(B)|[BussinessSlide VQA](https://github.com/stockmarkteam/business-slide-questions)*1|[Heron-Bench](https://arxiv.org/abs/2404.07824)*1|[JDocQA](https://arxiv.org/abs/2403.19454)*1|[JMMMU](https://arxiv.org/abs/2410.17250)| |-|-|-|-|-|-| |[Sarashina2.2-Vision-3B](https://huggingface.co/sbintuitions/sarashina2.2-vision-3b)|3.8|3.932|3.214|3.327|0.486| |[Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|3.8|3.516|2.000|3.019|0.450| |[Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)|4.4|4.105|2.330|3.596|0.493| |[InternVL3_5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B)|4.7|3.311|1.893|2.626|0.437| |[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)|14.4|3.110|2.184|-*2|0.432| |[Stockmark-2-VL-100B-beta](https://huggingface.co/stockmark/Stockmark-2-VL-100B-beta)|96.5|3.973|2.563|3.168|-*2| *1. [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) was used for LLM-as-a-Judge. *2. These scores cannot be measured because some input data exceeds the model's `max_position_embeddings`. ### English Performance |Model|Params(B)|[DocVQA](https://arxiv.org/abs/2007.00398)|[InfoVQA](https://arxiv.org/abs/2104.12756)|[RealworldQA](https://huggingface.co/datasets/xai-org/RealworldQA) |-|-|-|-|-| |[Sarashina2.2-Vision-3B](https://huggingface.co/sbintuitions/sarashina2.2-vision-3b)|3.8|0.831|0.567|0.625| |[Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|3.8|0.924|0.750|0.586| |[Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)|4.4|0.948|0.798|0.712| |[InternVL3_5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B)|4.7|0.823|0.541|0.553| |[Sarashina2-Vision-14B](https://huggingface.co/sbintuitions/sarashina2-vision-14b)|14.4|0.729|0.490|0.519|| ## How to use ### 1. Install dependencies ```sh pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate ``` ### 2. Inference The following script loads the model and allows inference. ```python import requests from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor, set_seed # Define model path model_path = "sbintuitions/sarashina2.2-vision-3b" # Load model and processor processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="cuda", torch_dtype="auto", trust_remote_code=True, ) set_seed(42) image_url = "https://huggingface.co/sbintuitions/sarashina2.2-vision-3b/resolve/main/sample.jpg" message = [ { "role": "user", "content": [ { "type": "image", "image": image_url, }, { "type": "text", "text": "これはどこで撮った写真ですか?", }, ], } ] text_prompt = processor.apply_chat_template(message, add_generation_prompt=True) """text_prompt: <|user|><|prefix|><|file|><|suffix|>これはどこで撮った写真ですか?<|assistant|>""" image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB") inputs = processor( text=[text_prompt], images=[image], padding=True, return_tensors="pt", ) inputs = inputs.to(model.device) # Inference: Generation of the output output_ids = model.generate( **inputs, max_new_tokens=512, temperature=0.7, top_p=0.95, repetition_penalty=1.2, ) generated_ids = [ output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids) ] output_text = processor.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True ) print(output_text[0]) """ この写真は、**道後温泉本館(どうごおんせんほんかん)** の入り口を夜景で撮影した写真です。 --- 場所の詳細: - **名称**:道後温泉本館(Dogo Onsen Honkan) - **所在地**:〒790-0842 愛媛県松山市道後湯之町1丁目3番5号 - **アクセス**:JR松山駅から市内電車「道後温泉駅」下車すぐ - **特徴**:日本最古の温泉の一つとして知られる「道後温泉」の中心的な施設。国の重要文化財にも指定されています。 --- 写真の特徴から判断した理由: - 建物の屋根や装飾が伝統的な和風建築で、「道後温泉」の看板が目立つ。 - 入口の垂れ幕には「道後」「道後」と書かれており、白い鳳凰の模様が描かれている → 道後温泉の象徴的デザイン。 - 夜の照明と石灯籠、提灯風の灯りが日本の温泉地らしい雰囲気を醸し出している。 - 看板に「道後温泉」の文字が明確に表示されている。 --- 補足情報: 道後温泉本館は、夏目漱石の小説『坊っちゃん』の舞台としても有名で、多くの観光客が訪れる人気スポットです。また、2020年にリニューアルされ、現代的な設備も導入されていますが、外観は伝統を残しています。 --- よって、この写真は **愛媛県松山市にある「道後温泉本館」の夜景** です。 """ ``` ## Training **Sarashina2.2-Vision-3B** is created through the following five-stage training process: ### PreTrain 1. Projector Warmup: To bridge the gap between the text and image embedding spaces within the LLM 2. Vision Encoder Pretraining: To enhance image comprehension, especially for understanding Japan-specific images and text 3. Full Model Pretraining: To enhance the model's unified understanding of images and language using interleaved data ### PostTrain 1. Supervised Fine-Tuning(SFT): To improve the model's ability to follow instructions and respond appropriately to user prompts 2. Mixed Preference Optimization(MPO): To align the model's outputs with user preferences, ensuring it generates more desirable responses ## Limitations This model has limited safety training. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs. Before using it, we would like developers to tune models based on human preferences and safety considerations. ## LICENSE [MIT License](./LICENSE)