SiQ-VL: A Vision-Language Model for Multimodal Understanding
Abstract
SiQ-VL is a vision-language model (VLM) that integrates a SigLIP-based vision encoder with a Qwen2.5 language model through a learnable projection module. The architecture employs a multi-stage training paradigm designed to progressively develop capabilities in multimodal understanding and text generation tasks.
Experiment Tracking
Training runs and experiments are tracked using Weights & Biases. View training metrics, model checkpoints, and experiment logs at: https://wandb.ai/ReproduceAI/siq-vl
Architecture Overview
The SiQ-VL architecture comprises three principal components:
- Vision Encoder: A SigLIP-based vision tower that remains frozen throughout the training process
- Projection Module: A learnable projector that transforms vision features into the language model embedding space, incorporating pixel shuffle operations for sequence length compression
- Language Model: A Qwen2.5 transformer-based model responsible for text generation, which remains frozen in Stage 1 and is fine-tuned in subsequent training stages
Architectural Diagram
Model Architecture Diagram (Mermaid)
graph TB
Image[Input Image] --> IP[Image Processor<br/>SigLIP]
Text[Text Prompt] --> Tokenizer[Tokenizer<br/>Qwen2.5]
IP --> Vision[Vision Tower<br/>SigLIP<br/>π FROZEN]
Tokenizer --> TextEmb[Text Embeddings]
Vision --> VisionFeat[Vision Features<br/>729Γ1152]
VisionFeat --> PixelShuffle[Pixel Shuffle<br/>Factor=3]
PixelShuffle --> Proj[Linear Projection<br/>10368β896]
Proj --> Norm[LayerNorm]
Norm --> VisionEmb[Vision Embeddings<br/>81Γ896]
VisionEmb --> Fusion[Embedding Fusion<br/>Splice Image Tokens]
TextEmb --> Fusion
Fusion --> LLM[Language Model<br/>Qwen2.5<br/>π Stage1 / β
Stage2+]
LLM --> Output[Generated Text]
style Vision fill:#ffcccc
style LLM fill:#ccffcc
style PixelShuffle fill:#ffffcc
style Proj fill:#ffffcc
style Norm fill:#ffffcc
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SiQ-VL Model Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input Image Text Prompt
β β
β β
βΌ βΌ
βββββββββββ ββββββββββββββββ
β Image β β Tokenizer β
β (PIL) β β (Qwen2.5) β
ββββββ¬βββββ βββββββββ¬βββββββ
β β
β β
βΌ βΌ
ββββββββββββββββββ ββββββββββββββββ
β Image β β Text Tokens β
β Processor β β + Special β
β (SigLIP) β β Tokens β
ββββββ¬ββββββββββββ ββββββββ¬ββββββββ
β β
β β
βΌ β
ββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββ
β Vision Tower (SigLIP) β
β [FROZEN - All Stages] β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Patch ββ β Patch ββ β Patch ββ β Patch ββ ... β
β β Embeddingβ β Embeddingβ β Embeddingβ β Embeddingβ β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β Output: [Batch, 729, 1152] (for 384Γ384 image, patch_size=14) β
ββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Projector (SiQ_VLModalityProjector) β
β [TRAINABLE - All Stages] β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Pixel Shuffle (Factor=3) β β
β β [729, 1152] β Reshape β [81, 10368] β β
β ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Linear Projection β β
β β [81, 10368] β Linear(10368, 896) β [81, 896] β β
β ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LayerNorm β β
β β Normalize to match LLM embedding distribution β β
β ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β
β Output: [Batch, 81, 896] (compressed vision tokens) β
ββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β βββββββββββββββββββββ
β β Text Embeddings β
β β [Batch, Seq, 896]β
β ββββββββββ¬βββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββ
β Embedding Fusion β
β (Splice Image Tokens) β
ββββββββββββββ¬βββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Language Model (Qwen2.5) β
β [FROZEN - Stage 1] [TRAINABLE - Stage 2+] β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Layer 1 ββ β Layer 2 ββ β Layer 3 ββ β Layer N ββ ... β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β Output: [Batch, Seq, Vocab] (logits for next token prediction) β
ββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Generated β
β Text β
ββββββββββββββββ
Key Dimensions:
β’ Vision Features: [Batch, 729, 1152] (SigLIP SO400M)
β’ After Pixel Shuffle: [Batch, 81, 10368]
β’ After Projection: [Batch, 81, 896] (Qwen2.5-0.5B hidden size)
β’ LLM Output: [Batch, Seq, Vocab]
Forward Pass Data Flow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Forward Pass Data Flow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input:
β’ Image: PIL.Image (384Γ384Γ3)
β’ Text: "Describe this image."
Step 1: Image Processing
Image (384Γ384Γ3)
β [Image Processor]
Pixel Values [1, 3, 384, 384]
β [Vision Tower - SigLIP]
Vision Features [1, 729, 1152]
β
ββ 729 patches = (384/14)Β²
ββ 1152 = SigLIP SO400M hidden size
Step 2: Projection with Pixel Shuffle
Vision Features [1, 729, 1152]
β [Reshape: 27Γ27 patches]
[1, 27, 27, 1152]
β [Pixel Shuffle: factor=3]
[1, 9, 9, 10368] (1152 Γ 3Β² = 10368)
β [Reshape]
[1, 81, 10368]
β [Linear Projection: 10368β896]
[1, 81, 896]
β [LayerNorm]
Vision Embeddings [1, 81, 896]
β
ββ 81 tokens (compressed from 729)
ββ 896 = Qwen2.5-0.5B hidden size
Step 3: Text Processing
Text: "Describe this image."
β [Tokenizer + Chat Template]
Input IDs: [151644, 77091, 198, ..., 151655, ..., 151645]
β
ββ <|im_start|>user\n
ββ <|vision_start|><|image_pad|>Γ81<|vision_end|>
ββ Describe this image.
ββ <|im_end|>
β [Text Embeddings]
Text Embeddings [1, Seq, 896]
Step 4: Embedding Fusion
Text Embeddings: [1, Seq, 896]
β
ββ Find <|image_pad|> positions
β
ββ Prefix: [1, prefix_len, 896]
ββ Image: [1, 81, 896] β Insert here
ββ Suffix: [1, suffix_len, 896]
β [Concatenate]
Fused Embeddings [1, prefix_len + 81 + suffix_len, 896]
Step 5: LLM Forward Pass
Fused Embeddings [1, Total_Seq, 896]
β [Qwen2.5 Transformer]
Logits [1, Total_Seq, Vocab_Size]
β [Generate/Decode]
Output: "The image depicts a beautiful sunset..."
Step 6: Loss Calculation (Training)
Logits [1, Total_Seq, Vocab_Size]
β
ββ Labels [1, Total_Seq]
β
ββ -100 (ignore): Image tokens, prompt tokens
ββ Token IDs: Answer tokens only
β [Cross Entropy Loss]
Loss: scalar
Component Status by Stage
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Component Training Status by Stage β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Component β Stage 1 β Stage 2 β Stage 3 β Stage 4 β
ββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
Vision Tower β Frozen β Frozen β Frozen β Frozen β
(SigLIP) β β β β β
ββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
Projector β Train β Train β Train β Train β
β β β β β
ββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
Language Model β Frozen β Train β Train β Train β
(Qwen2.5) β β β β β
ββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
RL Components β N/A β N/A β N/A β Active β
β β β β β
Key Design Features
- Multi-Stage Training Paradigm: A progressive training strategy that transitions from projector alignment to comprehensive model fine-tuning
- Pixel Shuffle Compression: Implements spatial compression to reduce vision token sequence length, improving computational efficiency
- Automatic Configuration: Dynamically computes pixel shuffle factors based on vision encoder specifications
- Distributed Training Support: Facilitates multi-GPU training through the Accelerate framework
- Memory Optimization: Incorporates gradient checkpointing and optimized data loading strategies
Training Methodology
The SiQ-VL model is trained using a multi-stage approach designed to incrementally develop vision-language capabilities:
Stage 1: Projector Alignment
Objective: Establish alignment between vision encoder outputs and the language model embedding space through supervised training of the projection module exclusively.
- Frozen Components: Vision encoder (SigLIP) and language model (Qwen2.5)
- Trainable Parameters: Projection module only
- Training Dataset: FineVision multimodal instruction-following dataset
- Purpose: Initialize vision-language feature alignment
- Implementation Status: Fully implemented
Stage 2: Language Model Fine-tuning on Visual Question Answering
Objective: Fine-tune the language model component on large-scale visual question answering datasets to enhance visual comprehension and reasoning capabilities.
- Frozen Components: Vision encoder (SigLIP)
- Trainable Parameters: Projection module and language model
- Training Dataset: Large-scale VQA datasets including VQAv2, GQA, and TextVQA
- Purpose: Develop enhanced visual understanding and question-answering capabilities
- Implementation Status: Planned for future release
Stage 3: Supervised Fine-tuning with Chain-of-Thought Reasoning
Objective: Fine-tune the model on reasoning datasets annotated with chain-of-thought (CoT) demonstrations to improve step-by-step reasoning and explanatory capabilities.
- Frozen Components: Vision encoder (SigLIP)
- Trainable Parameters: Projection module and language model
- Training Dataset: Visual reasoning datasets with chain-of-thought annotations
- Purpose: Develop systematic reasoning and step-by-step explanation capabilities
- Implementation Status: Planned for future release
Stage 4: Reinforcement Learning-based Optimization
Objective: Enhance model performance through reinforcement learning techniques, such as reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), to better align outputs with human preferences.
- Training Method: Reinforcement learning-based optimization (specific methodology to be determined)
- Purpose: Improve output quality and alignment with human preferences
- Implementation Status: Planned for future release
Training Pipeline Flow Diagram
Training Pipeline Visualization (Mermaid)
graph TD
Start[Initialize Models<br/>SigLIP + Qwen2.5] --> Stage1[Stage 1: Projector Alignment β
]
Stage1 --> |Train Projector Only| S1Checkpoint[Checkpoint: Stage 1<br/>Aligned Projector]
S1Checkpoint --> Stage2[Stage 2: LLM Fine-tuning π§]
Stage2 --> |Train Projector + LLM| S2Checkpoint[Checkpoint: Stage 2<br/>VQA Capable]
S2Checkpoint --> Stage3[Stage 3: SFT with CoT π§]
Stage3 --> |Train Projector + LLM| S3Checkpoint[Checkpoint: Stage 3<br/>Reasoning Capable]
S3Checkpoint --> Stage4[Stage 4: RL Training π§]
Stage4 --> |RL Optimization| Final[Final Model<br/>Production Ready]
Stage1 -.->|Dataset: FineVision| D1[FineVision<br/>Multimodal Instructions]
Stage2 -.->|Dataset: VQA| D2[VQAv2, GQA, TextVQA]
Stage3 -.->|Dataset: CoT| D3[Reasoning with CoT]
Stage4 -.->|Dataset: Preferences| D4[Human Preferences]
style Stage1 fill:#90EE90
style Stage2 fill:#FFD700
style Stage3 fill:#FFD700
style Stage4 fill:#FFD700
style Final fill:#87CEEB
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Training Pipeline Overview β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Initialization β
β β’ Load SigLIP (frozen) β
β β’ Load Qwen2.5 (frozen) β
β β’ Initialize Projector (random weights) β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: Projector Alignment [IMPLEMENTED] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Vision Tower: FROZEN β
β Projector: TRAINABLE β
β LLM: FROZEN β
β β
β Dataset: FineVision β
β β’ Multimodal instruction-following β
β β’ ~10 subsets (coco_colors, sharegpt4v, etc.) β
β β
β Training: β
β β’ Learning Rate: 1e-3 β
β β’ Steps: ~1000 β
β β’ Objective: Align vision features with LLM space β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββ
β Checkpoint: Stage 1 β
β β’ Aligned Projector β
β β’ Frozen Vision + LLM β
βββββββββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: LLM Fine-tuning on VQA [PLANNED] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Vision Tower: FROZEN β
β Projector: TRAINABLE (continue from Stage 1) β
β LLM: TRAINABLE (unfrozen) β
β β
β Dataset: Large VQA Datasets β
β β’ VQAv2, GQA, TextVQA, etc. β
β β’ Focus on visual question answering β
β β
β Training: β
β β’ Learning Rate: 1e-5 to 2e-5 (lower for LLM) β
β β’ Steps: TBD β
β β’ Objective: Improve VQA capabilities β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββ
β Checkpoint: Stage 2 β
β β’ VQA-capable model β
βββββββββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: SFT with CoT Reasoning [PLANNED] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Vision Tower: FROZEN β
β Projector: TRAINABLE (continue from Stage 2) β
β LLM: TRAINABLE (continue from Stage 2) β
β β
β Dataset: Reasoning with Chain-of-Thought β
β β’ Step-by-step reasoning annotations β
β β’ Visual reasoning tasks β
β β
β Training: β
β β’ Learning Rate: 1e-5 to 2e-5 β
β β’ Steps: TBD β
β β’ Objective: Develop reasoning capabilities β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββ
β Checkpoint: Stage 3 β
β β’ Reasoning-capable β
βββββββββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 4: Reinforcement Learning [PLANNED] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Vision Tower: FROZEN β
β Projector: TRAINABLE (continue from Stage 3) β
β LLM: TRAINABLE (continue from Stage 3) β
β RL Components: ACTIVE β
β β
β Dataset: Preference Datasets β
β β’ Human feedback data β
β β’ Preference pairs β
β β
β Training: β
β β’ Method: RLHF / DPO / etc. (TBD) β
β β’ Objective: Align with human preferences β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββ
β Final Model β
β β’ Fully aligned VLM β
β β’ Production ready β
βββββββββββββββββββββββββββββ
Training Stage Comparison
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Training Stage Comparison Table β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Feature β Stage 1 β Stage 2 β Stage 3 β Stage 4
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Status β Implemented β Planned β Planned β Planned
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Trainable Components β Projector only β Projector+LLM β Projector+LLM β Projector+LLM+RL
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Frozen Components β Vision + LLM β Vision only β Vision only β Vision only
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Learning Rate β 1e-3 β 1e-5 to 2e-5 β 1e-5 to 2e-5 β TBD
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Training Steps β ~1000 β TBD β TBD β TBD
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Primary Dataset β FineVision β VQA Datasets β CoT Reasoning β Preferences
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Objective β Alignment β VQA β Reasoning β Alignment
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Checkpoint Input β Base models β Stage 1 β Stage 2 β Stage 3
ββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββΌββββββββββββ
Checkpoint Output β Stage 1 β Stage 2 β Stage 3 β Final Model
Requirements
System Requirements
- Python 3.10 (Python >= 3.10 and < 3.11)
- PyTorch >= 2.9.1
- CUDA-capable GPU with at least 24GB VRAM (recommended for training)
- Package manager: uv (recommended) or pip
Installation
Installation via uv (Recommended)
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone <repository-url>
cd SiQ_VL
# Install dependencies
uv sync
Using pip
pip install -e .
Training Datasets
Stage 1: FineVision Dataset
Stage 1 training employs the FineVision dataset, available through HuggingFace, which comprises multiple data subsets:
coco_colorsdensefusion_1mface_emotiongoogle_landmarkslaion_gpt4vsharegpt4osharegpt4v(coco)sharegpt4v(llava)sharegpt4v(knowledge)sharegpt4v(sam)
Future Training Stages
- Stage 2: Large-scale visual question answering datasets (VQAv2, GQA, TextVQA)
- Stage 3: Visual reasoning datasets annotated with chain-of-thought demonstrations
- Stage 4: Human preference datasets for reinforcement learning optimization
Training Instructions
Note: Presently, only Stage 1 (Projector Alignment) is fully implemented. Stages 2-4 are planned for future releases.
Stage 1: Projector Alignment Training
Quick Start
The easiest way to start Stage 1 training is using the provided shell script, which auto-detects your environment:
bash scripts/train_stage_1.sh
The script performs the following automatic configurations:
- Detects the computing environment (e.g., MacBook, AWS p4d instances)
- Sets appropriate hyperparameters for Stage 1 training
- Configures distributed training when multiple GPUs are available
- Freezes the language model and trains only the projection module
Manual Training
For more control, you can run the training script directly:
python scripts/train.py \
--vision_model_name_or_path "google/siglip-so400m-patch14-384" \
--llm_model_name_or_path "Qwen/Qwen2.5-0.5B-Instruct" \
--data_path "HuggingFaceM4/FineVision" \
--sub_sets "coco_colors,densefusion_1m,sharegpt4v(knowledge)" \
--freeze_llm \
--output_dir "./checkpoints/siq_vlm_stage1" \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 4 \
--max_steps 1000 \
--learning_rate 1e-3 \
--bf16
Important: Stage 1 training employs --freeze_llm by default, ensuring that only the projection module parameters are updated during this training phase.
Training Arguments
Model Configuration
--vision_model_name_or_path: Path or HuggingFace model ID for vision encoder (default:google/siglip-so400m-patch14-384)--llm_model_name_or_path: Path or HuggingFace model ID for language model (default:Qwen/Qwen2.5-0.5B-Instruct)--freeze_llm: Freeze the LLM during training (default: True)--no_freeze_llm: Unfreeze the LLM for full fine-tuning--pixel_shuffle_factor: Manual pixel shuffle factor (auto-calculated if not specified)
Dataset Configuration
--data_path: Path to dataset or HuggingFace dataset name (default:HuggingFaceM4/FineVision)--sub_sets: Comma-separated list of dataset subsets to use--max_samples: Limit dataset size for quick testing--num_proc: Number of processes for dataset loading (default: 96)--dataloader_num_workers: Number of dataloader workers (default: 4)
Training Hyperparameters
--per_device_train_batch_size: Batch size per device (default: 8)--gradient_accumulation_steps: Gradient accumulation steps (default: 4)--max_steps: Maximum training steps (default: 1000)--learning_rate: Learning rate (default: 1e-3)--bf16: Use bfloat16 precision (default: True, recommended for Qwen)--fp16: Use float16 precision (alternative to bf16)
Output Configuration
--output_dir: Directory to save checkpoints (default:./checkpoints/siq_vlm_run1)--logging_steps: Steps between logging (default: 10)--save_steps: Steps between checkpoints (default: 500)--project: WandB project name (default:siq_vl_stage_1)
Distributed Training
--use_distributed: Enable distributed training (auto-detected if multiple GPUs available)--no_distributed: Disable distributed training
Distributed Training
For multi-GPU training, use Accelerate:
accelerate launch \
--dispatch_batches=false \
--split_batches=false \
scripts/train.py \
--freeze_llm \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 4 \
...
Publishing Checkpoints to the Hugging Face Hub
You can optionally publish trained checkpoints to the Hugging Face Hub so others can use the models without retraining.
Naming convention: Repos are named as
siq_vl_{vision_backbone}_{llm_backbone}_{stage}
For example:siq_vl_siglip2-base-patch16-224_qwen2.5-0.5b-instruct_stage1.Stage inference: The stage suffix (e.g.,
stage1,stage2) is automatically inferred from your--projectname and/or--output_dir.- Stage 1 runs launched via
scripts/train_stage_1.shwill typically publish as..._stage1. - Stage 2 runs launched via
scripts/train_stage_2.shwill typically publish as..._stage2.
- Stage 1 runs launched via
W&B integration:
- The Hub commit message includes the W&B run URL (when available).
- A lightweight Hub git tag of the form
wandb-{run_id}is created, whose message contains the W&B run URL.
Example: Publish Stage 1 Model (MacBook quick run)
bash scripts/train_stage_1.sh \
--push_to_hub
This will:
- Train Stage 1 using the MacBook defaults.
- Save the final model under
./checkpoints/siq_vlm_stage1/{vision}__{llm}. - Create (or reuse) a Hub repo named like:
siq_vl_siglip2-base-patch16-224_qwen2.5-0.5b-instruct_stage1
- Upload all files from the final checkpoint directory.
- Add a Hub tag
wandb-{run_id}with a message that includes the W&B run URL.
Example: Publish Stage 2 Model (AWS p4d full run)
STAGE=2 bash scripts/train_launch.sh \
--push_to_hub
This will:
- Train Stage 2 (full finetuning) using the AWS p4d defaults.
- Save the final model under
./checkpoints/siq_vlm_stage2/{vision}__{llm}. - Create (or reuse) a Hub repo named like:
siq_vl_siglip2-so400m-patch16-512_qwen2.5-1.5b-instruct_stage2
- Upload all files from the final checkpoint directory.
- Add a Hub tag
wandb-{run_id}with a message that includes the W&B run URL.
To override the default repo id (for example to push under an organization), pass:
--hub_model_id your-org/siq_vl_siglip2-base-patch16-224_qwen2.5-0.5b-instruct_stage1.
Project Structure
SiQ_VL/
βββ siq_vl/ # Main package
β βββ model.py # SiQ_VLModel and Projector
β βββ processing.py # SiQ_VLProcessor for multimodal inputs
β βββ dataset.py # VQAIterableDataset for efficient data loading
β βββ collator.py # Data collator for batching
β βββ callbacks.py # Training callbacks (metrics, GPU cleanup)
βββ scripts/
β βββ train.py # Main training script (Stage 1)
β βββ train_stage_1.sh # Convenience script for Stage 1 with auto-configuration
β # Future: train_stage_2.py, train_stage_3.py, train_rl.py
βββ checkpoints/ # Saved model checkpoints
β βββ siq_vlm_stage1/ # Stage 1 checkpoints
βββ lmms-eval/ # Evaluation framework (optional)
Development Roadmap
- Stage 1: Projector alignment training (Completed)
- Stage 2: Language model fine-tuning on large-scale VQA datasets
- Stage 3: Supervised fine-tuning with chain-of-thought reasoning
- Stage 4: Reinforcement learning-based training (RLHF/DPO)
- Evaluation scripts and benchmark integration
- Model inference and deployment utilities
Model Specifications
Vision Encoder Specifications
- Model Architecture: SigLIP (SigLIP 2 SO400M or base model variants)
- Training Status: Parameters remain frozen throughout all training stages
- Output Characteristics: Produces vision features with configurable patch size and image resolution settings
Projection Module Specifications
- Architecture Type: Linear projection layer with pixel shuffle operation
- Functional Role: Transforms vision encoder hidden dimensions to match language model embedding dimensions
- Compression Mechanism: Pixel shuffle operation reduces sequence length (e.g., 729 tokens β 81 tokens for 384Γ384 pixel images with shuffle factor of 3)
- Normalization: Layer normalization applied for distribution alignment
Language Model Specifications
- Model Architecture: Qwen2.5 (available in 0.5B, 1.5B, and larger parameter variants)
- Training Status:
- Stage 1: Parameters remain frozen; only projection module is trained
- Stage 2 and subsequent stages: Parameters are unfrozen for full fine-tuning
- Special Token Handling: Utilizes Qwen's native special tokens including
<|image_pad|>,<|vision_start|>, and<|vision_end|>
Usage Examples
Loading a Stage 1 Checkpoint
The following code demonstrates how to load a trained Stage 1 checkpoint for inference:
from siq_vl.model import SiQ_VLModel
from siq_vl.processing import SiQ_VLProcessor
from transformers import AutoImageProcessor, AutoTokenizer
from PIL import Image
import torch
import json
import os
# Load checkpoint configuration
checkpoint_dir = "./checkpoints/siq_vlm_stage1"
with open(os.path.join(checkpoint_dir, "model_config.json"), "r") as f:
model_config = json.load(f)
# Load processor (saved with the model)
processor = SiQ_VLProcessor.from_pretrained(checkpoint_dir)
# Initialize model with saved configuration
model = SiQ_VLModel(
vision_model_path=model_config["vision_model_path"],
llm_model_path=model_config["llm_model_path"],
freeze_llm=True # Stage 1 uses frozen LLM
)
# Load the trained weights
model.load_state_dict(torch.load(
os.path.join(checkpoint_dir, "pytorch_model.bin"),
map_location="cpu"
))
model.eval()
# Prepare inputs
image = Image.open("path/to/image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image."}
]
}
]
# Process and forward
inputs = processor(text=messages, images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Generate response (example)
# Note: Full generation code depends on your inference setup
Initializing Model from Base Architectures
The following example demonstrates model initialization from pre-trained base models for Stage 1 training:
model = SiQ_VLModel(
vision_model_path="google/siglip-so400m-patch14-384",
llm_model_path="Qwen/Qwen2.5-0.5B-Instruct",
freeze_llm=True # Stage 1: freeze LLM
)
Training Notes and Recommendations
Stage 1 Training Considerations
- Memory Requirements: Training requires substantial VRAM. For GPUs with 24GB VRAM, recommended batch sizes range from 4-8 with gradient accumulation enabled.
- Numerical Precision: Qwen models exhibit optimal performance with bfloat16 precision. The use of float16 precision is not recommended for Qwen architectures.
- Overfitting Behavior: Vision-language models may exhibit rapid overfitting. Approximately 1000 training steps typically suffice for projector alignment in Stage 1.
- Checkpoint Format: Models are saved in PyTorch format (
.binfiles) to circumvent potential safetensors compatibility issues. - Learning Rate Selection: Stage 1 employs a learning rate of 1e-3 for projector alignment. Subsequent stages utilize lower learning rates (1e-5 to 2e-5) for language model fine-tuning.
Multi-Stage Training Considerations
- Progressive Checkpoint Loading: Each training stage builds upon checkpoints from previous stages. Stage 1 checkpoints must be loaded prior to initiating Stage 2 training.
- Parameter Freezing Strategy:
- Stage 1: Vision encoder and language model parameters remain frozen
- Stage 2 and subsequent stages: Only vision encoder parameters remain frozen
- Dataset Progression: Training stages employ increasingly specialized datasets designed to target specific model capabilities.
Contributing
Contributions to this project are welcome. Please submit pull requests for review.
License
This project is licensed under the MIT License:
MIT License
Copyright (c) 2025 SiQ-VL Contributors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Acknowledgments
This work builds upon the following open-source contributions:
- SigLIP2 (Zhai et al., 2023): Vision encoder architecture implementation [GitHub]
- Qwen2.5 (Qwen Team, 2024): Language model architecture [GitHub]
- HuggingFace Transformers (Wolf et al., 2020): Deep learning framework [GitHub]
- FineVision Dataset (HuggingFace, 2025): open dataset for data-centric training of Vision Language Models [HuggingFace]