Regression via Transformer-Based Classification (RvTC)

Model: Qwen2-VL-2B-With-Titles

Model Description

Fine-tuned Qwen2-VL-2B-Instruct model for image aesthetic assessment using the RvTC (Regression via Transformer-Based Classification) framework. This checkpoint uses image-language training with AVA challenge titles as semantic prompts, demonstrating the importance of meaningful textual context for multimodal models.

Base Model

Architecture: Qwen2-VL-2B-Instruct
Source: Qwen/Qwen2-VL-2B-Instruct

Training Configuration

Dataset: AVA (Aesthetic Visual Analysis)
Training Mode: Image-language (with challenge title prompts)
Epochs: 3
Learning Rate: 1e-5
Batch Size: 128 (training)
Optimizer: AdamW with cosine scheduler
Warmup Ratio: 0.03

Binning Configuration

Number of Bins: 51
Value Range: [1.81, 8.60] (range of train set)
Method: Uniform binning for regression via classification

Performance

Evaluated on AVA test set (19,930 samples):

Pearson Correlation (PLCC): 0.908
Spearman Correlation (SRCC): 0.906

Citation

@inproceedings{jennings2025language,
  title={Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression},
  author={Roy H. Jennings, Genady Paikin, Roy Shaul, and Evgeny Soloveichik},
  booktitle={2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026},
  organization={IEEE}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for royhj/rvtc-qwen2vl-2b-with-titles

Base model

Qwen/Qwen2-VL-2B

Finetuned

Qwen/Qwen2-VL-2B-Instruct

Finetuned

(323)

this model

Paper for royhj/rvtc-qwen2vl-2b-with-titles

Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

Paper • 2507.14997 • Published Jul 20, 2025