SII-Monument-Valley/CiQi-VQA
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SII-Monument-Valley/CiQi-VQA
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- question-answering
language:
- en
tags:
- art
- agent
size_categories:
- 10K<n<100K
---
# CiQi-Agent
📖 [Paper](https://arxiv.org/abs/<paper-id>) | 📊 [CiQi-VQA Dataset](https://huggingface.co/datasets/<org>/ciqi-vqa) | 🧪 [CiQi-Bench Benchmark](https://huggingface.co/datasets/<org>/ciqi-bench) | 🤗 [CiQi-Agent Model](https://huggingface.co/<org>/ciqi-agent)
---
**CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains**
---
[Paper Link](https://arxiv.org/abs/<paper-id>)
## 🎯 Overview
We present **CiQi-Agent**, a domain-specific multimodal agent for **antique Chinese porcelain connoisseurship**. The project is designed to combine **fine-grained visual perception**, **tool-augmented reasoning**, and **cultural-heritage knowledge grounding** for explainable porcelain analysis.
CiQi-Agent is built for **tool-augmented multimodal reasoning** on antique Chinese porcelains. During inference, it can inspect local visual evidence with an **image zoom-in tool**, retrieve visually similar examples with **image retrieval**, and access relevant domain knowledge with **text retrieval**, enabling more grounded and interpretable connoisseurship analysis.
Alongside the model, we release:
- **CiQi-VQA**, covering **29,596 porcelain specimens**, **51,553 images**, and **557,943 VQA pairs**
- **CiQi-Bench**, built from **775 porcelain specimens**, **878 images**, and **5,425 multiple-choice questions**
On CiQi-Bench, CiQi-Agent achieves **81.5%** average accuracy on multiple-choice evaluation and **66.7%** average score on free-form evaluation, outperforming strong open-source and closed-source multimodal baselines.
## 🤖 Model
- Built on **Qwen2.5-VL-7B-Instruct**
- Uses **tool-augmented reasoning** with zoom-in, image retrieval, and text retrieval
- Trained with a **two-phase supervised fine-tuning + reinforcement learning pipeline**
- Evaluated on **CiQi-Bench** with both **multiple-choice** and **free-form** protocols
- Achieves **81.5%** average accuracy on multiple-choice evaluation, exceeding **GPT-5 by 5.7 points** and the strongest listed open-source baseline **GLM-4.5V (72.6%)** by **8.9 points**
- Achieves **66.7%** average score on free-form evaluation, exceeding **GPT-5 by 18.7 points** and **Qwen2.5-VL-72B-Instruct by 23.7 points**
## 📊 Model Performance
### Multiple-Choice Accuracy (%) on CiQi-Bench
| Model | Dynasty | Reign | Kiln | Color | Motif | Shape | Naming | Average |
| --------------------------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| GPT-5 | 65.7 | 61.4 | 79.6 | 86.5 | 69.3 | 83.8 | 84.3 | 75.8 |
| GPT-4.1 | 59.3 | 68.3 | 71.1 | 85.0 | 62.2 | 81.8 | 77.9 | 72.2 |
| GPT-4o | 59.1 | 60.4 | 68.6 | 89.2 | 70.1 | 84.2 | 82.1 | 73.4 |
| o3 | 57.6 | 57.4 | 72.2 | 82.6 | 62.4 | 76.8 | 76.6 | 69.4 |
| Qwen2.5-VL-72B-Instruct | 57.6 | 34.7 | 69.2 | 86.7 | 71.7 | 84.1 | 80.3 | 69.2 |
| GLM-4.5V (106B) | 58.3 | 59.4 | 75.8 | 82.3 | 70.4 | 81.8 | 80.6 | 72.6 |
| InternVL3.5-241B-A28B-Flash | 57.1 | 38.6 | 59.5 | 82.1 | 64.8 | 73.9 | 68.5 | 63.5 |
| Kimi-VL-A3B-Instruct (16B) | 59.3 | 22.8 | 48.8 | 84.8 | 59.8 | 77.9 | 70.3 | 60.5 |
| **CiQi-Agent (Ours, 7B)** | **77.6** | **70.3** | **81.8** | **91.4** | **75.7** | **88.1** | **85.2** | **81.5** |
### Free-Form Score (%) on CiQi-Bench
| Model | Dynasty | Reign | Kiln | Color | Motif | Shape | Average |
| --------------------------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| GPT-5 | 39.4 | 32.8 | 42.6 | 74.4 | 35.3 | 63.9 | 48.0 |
| GPT-4.1 | 36.7 | 27.2 | 29.0 | 67.5 | 27.6 | 60.1 | 41.3 |
| GPT-4o | 26.9 | 13.4 | 15.1 | 53.9 | 21.1 | 47.6 | 29.7 |
| o3 | 42.7 | 36.6 | 44.4 | 74.2 | 33.1 | 62.1 | 48.8 |
| Qwen2.5-VL-72B-Instruct | 29.5 | 31.2 | 27.7 | 75.8 | 31.0 | 62.6 | 43.0 |
| GLM-4.5V (106B) | 31.0 | 14.3 | 32.8 | 65.4 | 31.1 | 65.2 | 39.9 |
| InternVL3.5-241B-A28B-Flash | 42.4 | 31.6 | 36.9 | 52.6 | 19.6 | 41.5 | 37.4 |
| Kimi-VL-A3B-Instruct (16B) | 17.3 | 23.7 | 16.2 | 69.5 | 26.5 | 61.3 | 35.7 |
| **CiQi-Agent (Ours, 7B)** | **71.3** | **49.1** | **69.8** | **85.4** | **49.7** | **75.0** | **66.7** |
## 📦 Dataset & Benchmark
### 📊 CiQi-VQA
**CiQi-VQA** is a large-scale dataset for porcelain-centered multimodal training.
- 29,596 porcelain specimens
- 51,553 images
- 557,943 VQA pairs
- 38 dynasties
- 42 reign periods
- 246 glaze color categories
- 248 decorative motif categories
- 158 vessel shape categories
Link: [https://huggingface.co/datasets//ciqi-vqa](https://huggingface.co/datasets/)
### 🧪 CiQi-Bench
**CiQi-Bench** is an expert-aligned benchmark for evaluating porcelain connoisseurship ability.
- 775 porcelain specimens
- 878 images
- 5,425 multiple-choice questions
- Free-form evaluation with attribute-wise scoring
Link: [https://huggingface.co/datasets//ciqi-bench](https://huggingface.co/datasets/)
### 📈 Dataset and Benchmark Statistics
| Split / Resource | Porcelains | Images | VQA Questions | Multiple-Choice Questions | Attributes |
| --------------------- | ---------- | ------ | ------------- | ------------------------- | ----------------------------------------- |
| CiQi-VQA SFT | 28,821 | 50,675 | 557,168 | --- | dynasty, reign, kiln, color, motif, shape |
| CiQi-VQA RL subset | 10,275 | 10,275 | 10,275 | --- | dynasty, reign, kiln, color, motif, shape |
| CiQi-Bench Evaluation | 775 | 878 | 775 | 5,425 | dynasty, reign, kiln, color, motif, shape |
| Total | 29,596 | 51,553 | 557,943 | 5,425 | dynasty, reign, kiln, color, motif, shape |
## 📜 License
Licensing will be specified separately for the model and the dataset.
- **Model license**: TBD
- **Dataset license**: CC BY-NC 4.0
## 🤝 Acknowledgement
We thank **[Verl](https://github.com/volcengine/verl)** for providing an open-source reinforcement learning framework that supports this line of research.
We also thank **[DeepEyes](https://github.com/Visual-Agent/DeepEyes)** for inspiring and informing our exploration of tool-augmented multimodal reasoning.
## 📜 Citation
```bibtex
@article{ciqiagent2026,
title = {CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains},
author = {Anonymous},
journal = {arXiv},
year = {2026},
note = {Preprint, details to be updated}
}
```
提供机构:
SII-Monument-Valley



