Name: SII-Monument-Valley/CiQi-VQA
Creator: SII-Monument-Valley
Published: 2026-03-27 13:31:37
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/SII-Monument-Valley/CiQi-VQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - question-answering language: - en tags: - art - agent size_categories: - 10K<n<100K --- # CiQi-Agent 📖 [Paper](https://arxiv.org/abs/<paper-id>) | 📊 [CiQi-VQA Dataset](https://huggingface.co/datasets/<org>/ciqi-vqa) | 🧪 [CiQi-Bench Benchmark](https://huggingface.co/datasets/<org>/ciqi-bench) | 🤗 [CiQi-Agent Model](https://huggingface.co/<org>/ciqi-agent) --- **CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains** --- [Paper Link](https://arxiv.org/abs/<paper-id>) ## 🎯 Overview We present **CiQi-Agent**, a domain-specific multimodal agent for **antique Chinese porcelain connoisseurship**. The project is designed to combine **fine-grained visual perception**, **tool-augmented reasoning**, and **cultural-heritage knowledge grounding** for explainable porcelain analysis. CiQi-Agent is built for **tool-augmented multimodal reasoning** on antique Chinese porcelains. During inference, it can inspect local visual evidence with an **image zoom-in tool**, retrieve visually similar examples with **image retrieval**, and access relevant domain knowledge with **text retrieval**, enabling more grounded and interpretable connoisseurship analysis. Alongside the model, we release: - **CiQi-VQA**, covering **29,596 porcelain specimens**, **51,553 images**, and **557,943 VQA pairs** - **CiQi-Bench**, built from **775 porcelain specimens**, **878 images**, and **5,425 multiple-choice questions** On CiQi-Bench, CiQi-Agent achieves **81.5%** average accuracy on multiple-choice evaluation and **66.7%** average score on free-form evaluation, outperforming strong open-source and closed-source multimodal baselines. ## 🤖 Model - Built on **Qwen2.5-VL-7B-Instruct** - Uses **tool-augmented reasoning** with zoom-in, image retrieval, and text retrieval - Trained with a **two-phase supervised fine-tuning + reinforcement learning pipeline** - Evaluated on **CiQi-Bench** with both **multiple-choice** and **free-form** protocols - Achieves **81.5%** average accuracy on multiple-choice evaluation, exceeding **GPT-5 by 5.7 points** and the strongest listed open-source baseline **GLM-4.5V (72.6%)** by **8.9 points** - Achieves **66.7%** average score on free-form evaluation, exceeding **GPT-5 by 18.7 points** and **Qwen2.5-VL-72B-Instruct by 23.7 points** ## 📊 Model Performance ### Multiple-Choice Accuracy (%) on CiQi-Bench | Model | Dynasty | Reign | Kiln | Color | Motif | Shape | Naming | Average | | --------------------------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | GPT-5 | 65.7 | 61.4 | 79.6 | 86.5 | 69.3 | 83.8 | 84.3 | 75.8 | | GPT-4.1 | 59.3 | 68.3 | 71.1 | 85.0 | 62.2 | 81.8 | 77.9 | 72.2 | | GPT-4o | 59.1 | 60.4 | 68.6 | 89.2 | 70.1 | 84.2 | 82.1 | 73.4 | | o3 | 57.6 | 57.4 | 72.2 | 82.6 | 62.4 | 76.8 | 76.6 | 69.4 | | Qwen2.5-VL-72B-Instruct | 57.6 | 34.7 | 69.2 | 86.7 | 71.7 | 84.1 | 80.3 | 69.2 | | GLM-4.5V (106B) | 58.3 | 59.4 | 75.8 | 82.3 | 70.4 | 81.8 | 80.6 | 72.6 | | InternVL3.5-241B-A28B-Flash | 57.1 | 38.6 | 59.5 | 82.1 | 64.8 | 73.9 | 68.5 | 63.5 | | Kimi-VL-A3B-Instruct (16B) | 59.3 | 22.8 | 48.8 | 84.8 | 59.8 | 77.9 | 70.3 | 60.5 | | **CiQi-Agent (Ours, 7B)** | **77.6** | **70.3** | **81.8** | **91.4** | **75.7** | **88.1** | **85.2** | **81.5** | ### Free-Form Score (%) on CiQi-Bench | Model | Dynasty | Reign | Kiln | Color | Motif | Shape | Average | | --------------------------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | GPT-5 | 39.4 | 32.8 | 42.6 | 74.4 | 35.3 | 63.9 | 48.0 | | GPT-4.1 | 36.7 | 27.2 | 29.0 | 67.5 | 27.6 | 60.1 | 41.3 | | GPT-4o | 26.9 | 13.4 | 15.1 | 53.9 | 21.1 | 47.6 | 29.7 | | o3 | 42.7 | 36.6 | 44.4 | 74.2 | 33.1 | 62.1 | 48.8 | | Qwen2.5-VL-72B-Instruct | 29.5 | 31.2 | 27.7 | 75.8 | 31.0 | 62.6 | 43.0 | | GLM-4.5V (106B) | 31.0 | 14.3 | 32.8 | 65.4 | 31.1 | 65.2 | 39.9 | | InternVL3.5-241B-A28B-Flash | 42.4 | 31.6 | 36.9 | 52.6 | 19.6 | 41.5 | 37.4 | | Kimi-VL-A3B-Instruct (16B) | 17.3 | 23.7 | 16.2 | 69.5 | 26.5 | 61.3 | 35.7 | | **CiQi-Agent (Ours, 7B)** | **71.3** | **49.1** | **69.8** | **85.4** | **49.7** | **75.0** | **66.7** | ## 📦 Dataset & Benchmark ### 📊 CiQi-VQA **CiQi-VQA** is a large-scale dataset for porcelain-centered multimodal training. - 29,596 porcelain specimens - 51,553 images - 557,943 VQA pairs - 38 dynasties - 42 reign periods - 246 glaze color categories - 248 decorative motif categories - 158 vessel shape categories Link: [https://huggingface.co/datasets//ciqi-vqa](https://huggingface.co/datasets/) ### 🧪 CiQi-Bench **CiQi-Bench** is an expert-aligned benchmark for evaluating porcelain connoisseurship ability. - 775 porcelain specimens - 878 images - 5,425 multiple-choice questions - Free-form evaluation with attribute-wise scoring Link: [https://huggingface.co/datasets//ciqi-bench](https://huggingface.co/datasets/) ### 📈 Dataset and Benchmark Statistics | Split / Resource | Porcelains | Images | VQA Questions | Multiple-Choice Questions | Attributes | | --------------------- | ---------- | ------ | ------------- | ------------------------- | ----------------------------------------- | | CiQi-VQA SFT | 28,821 | 50,675 | 557,168 | --- | dynasty, reign, kiln, color, motif, shape | | CiQi-VQA RL subset | 10,275 | 10,275 | 10,275 | --- | dynasty, reign, kiln, color, motif, shape | | CiQi-Bench Evaluation | 775 | 878 | 775 | 5,425 | dynasty, reign, kiln, color, motif, shape | | Total | 29,596 | 51,553 | 557,943 | 5,425 | dynasty, reign, kiln, color, motif, shape | ## 📜 License Licensing will be specified separately for the model and the dataset. - **Model license**: TBD - **Dataset license**: CC BY-NC 4.0 ## 🤝 Acknowledgement We thank **[Verl](https://github.com/volcengine/verl)** for providing an open-source reinforcement learning framework that supports this line of research. We also thank **[DeepEyes](https://github.com/Visual-Agent/DeepEyes)** for inspiring and informing our exploration of tool-augmented multimodal reasoning. ## 📜 Citation ```bibtex @article{ciqiagent2026, title = {CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains}, author = {Anonymous}, journal = {arXiv}, year = {2026}, note = {Preprint, details to be updated} } ```

应用场景：