Regression via Transformer-Based Classification (RvTC)

  • Model: Qwen2-VL-2B-With-Titles

github - arxiv

Model Description

Fine-tuned Qwen2-VL-2B-Instruct model for image aesthetic assessment using the RvTC (Regression via Transformer-Based Classification) framework. This checkpoint uses image-language training with AVA challenge titles as semantic prompts, demonstrating the importance of meaningful textual context for multimodal models.

Base Model

  • Architecture: Qwen2-VL-2B-Instruct
  • Source: Qwen/Qwen2-VL-2B-Instruct

Training Configuration

  • Dataset: AVA (Aesthetic Visual Analysis)
  • Training Mode: Image-language (with challenge title prompts)
  • Epochs: 3
  • Learning Rate: 1e-5
  • Batch Size: 128 (training)
  • Optimizer: AdamW with cosine scheduler
  • Warmup Ratio: 0.03

Binning Configuration

  • Number of Bins: 51
  • Value Range: [1.81, 8.60] (range of train set)
  • Method: Uniform binning for regression via classification

Performance

Evaluated on AVA test set (19,930 samples):

  • Pearson Correlation (PLCC): 0.908
  • Spearman Correlation (SRCC): 0.906

Citation

@inproceedings{jennings2025language,
  title={Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression},
  author={Roy H. Jennings, Genady Paikin, Roy Shaul, and Evgeny Soloveichik},
  booktitle={2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026},
  organization={IEEE}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for royhj/rvtc-qwen2vl-2b-with-titles

Base model

Qwen/Qwen2-VL-2B
Finetuned
(323)
this model

Paper for royhj/rvtc-qwen2vl-2b-with-titles