Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression
Paper
•
2507.14997
•
Published
Fine-tuned Qwen2-VL-2B-Instruct model for image aesthetic assessment using the RvTC (Regression via Transformer-Based Classification) framework. This checkpoint uses image-language training with AVA challenge titles as semantic prompts, demonstrating the importance of meaningful textual context for multimodal models.
Evaluated on AVA test set (19,930 samples):
@inproceedings{jennings2025language,
title={Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression},
author={Roy H. Jennings, Genady Paikin, Roy Shaul, and Evgeny Soloveichik},
booktitle={2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2026},
organization={IEEE}
}