DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
Abstract
A large-scale Chinese image-text dataset called DanQing is introduced to advance vision-language pretraining, demonstrating superior performance in various downstream tasks through continual pretraining of the SigLIP2 model.
Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
Community
Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
This is a strong and timely contribution to Chinese vision-language pre-training research. DanQing directly addresses the long-standing bottleneck of high-quality Chinese image–text data, and the scale combined with rigorous filtering clearly sets it apart from existing resources. Building the dataset primarily from 2024–2025 web data is especially valuable, as it allows models to better reflect evolving language usage and real-world semantics. https://thecupcut.com/
HuggingFace Dataset: https://huggingface.co/datasets/DeepGlint-AI/DanQing100M
ModelScope Dataset: https://www.modelscope.cn/datasets/deepglint/DanQing
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP (2025)
- SuperCLIP: CLIP with Simple Classification Supervision (2025)
- FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding (2025)
- $\beta$-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment (2025)
- Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation (2025)
- Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval (2025)
- ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper