syntheticDocQA_artificial_intelligence_test
收藏魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/vidore/syntheticDocQA_artificial_intelligence_test
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
This dataset is part of a topic-specific retrieval benchmark spanning multiple domains, which evaluates retrieval in more realistic industrial applications.
It includes documents about the **Artificial Intelligence**.
### Data Collection
Thanks to a crawler (see below), we collected 1,000 PDFs from the Internet with the query ('artificial intelligence'). From these documents, we randomly sampled 1000 pages.
We associated these with 100 questions and answers generated using Claude-3 Sonnet, a high-quality proprietary vision-language model.
**Web Crawler**
We implemented a web crawler to efficiently collect large volumes of documents related to a given topic.
The crawler is seeded with a user-defined query (e.g. "artificial intelligence") and then uses GPT-3.5 Turbo to brainstorm related topics and subtopics.
This query augmentation strategy aims to broaden and deepen the search. GPT-3.5 Turbo is further used to generate diverse search queries from each subtopic.
This query set is then consumed by a pool of parallel workers whose job is to fetch the associated most relevant documents.
We use [SerpAPI](https://serpapi.com/) along with a filetype filter (PDF documents only) to programmatically scrape Google Search rankings.
Each file is hashed and stored in a Bloom filter shared among workers to avoid duplicate documents in the final corpus.
Unique scraped files are downloaded and inserted into a SQLite database along with additional metadata.
### Data Curation
As the queries (and answers) are generated using a Vison Language Model, human annotators extensively filtered them for quality and relevance.
### Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("vidore/syntheticDocQA_artificial_intelligence_test", split="test")
```
### Dataset Structure
Here is an example of a dataset instance structure:
```json
features:
- name: query
dtype: string
- name: image
dtype: image
- name: image_filename
dtype: string
- name: answer
dtype: string
- name: page
dtype: string
- name: model
dtype: string
- name: prompt
dtype: string
- name: source
dtype: string
```
## Citation Information
If you use this dataset in your research, please cite the original dataset as follows:
```latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
```
## 数据集说明
本数据集属于多领域主题特定检索基准的组成部分,旨在更贴近真实工业应用的场景中评估检索性能。其涵盖与**人工智能(Artificial Intelligence)**相关的文档。
### 数据采集
借助一款网络爬虫(详见下文),我们以查询词「人工智能」为检索入口,从互联网上采集了1000份PDF文档,并从中随机抽取1000页内容。
我们将这些页面与由高性能闭源视觉语言模型Claude-3 Sonnet生成的100组问答对进行关联。
**网络爬虫实现**
我们开发了一款网络爬虫,以高效采集与指定主题相关的海量文档。该爬虫以用户自定义查询词(如「人工智能」)作为种子启动,随后调用GPT-3.5 Turbo生成相关主题与子主题,以此拓展搜索范围并深化搜索颗粒度。接着,我们进一步利用GPT-3.5 Turbo从每个子主题中生成多样化的搜索查询词。
随后,由一批并行工作节点组成的集群将基于该查询集,抓取与之关联度最高的文档。我们结合[SerpAPI](https://serpapi.com/)与文件类型过滤器(仅保留PDF文档),以编程方式抓取谷歌搜索结果排名。
我们对每份文件进行哈希运算,并将哈希值存入各工作节点共享的布隆过滤器(Bloom filter),以避免最终语料库中出现重复文档。唯一的已抓取文件将被下载,并与额外元数据一同存入SQLite数据库。
### 数据整理
由于本次生成的查询与回答均基于视觉语言模型,因此我们安排人工标注者对其进行了全面筛选,以保障内容质量与主题相关性。
### 数据集加载
python
from datasets import load_dataset
ds = load_dataset("vidore/syntheticDocQA_artificial_intelligence_test", split="test")
### 数据集结构
以下为单条数据集实例的字段结构示例:
json
features:
- name: "query"
dtype: string
- name: "image"
dtype: image
- name: "image_filename"
dtype: string
- name: "answer"
dtype: string
- name: "page"
dtype: string
- name: "model"
dtype: string
- name: "prompt"
dtype: string
- name: "source"
dtype: string
### 引用信息
若您在研究工作中使用本数据集,请按如下格式引用原始数据集:
latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
提供机构:
maas
创建时间:
2025-06-04



