syntheticDocQA_government_reports_test
收藏魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/vidore/syntheticDocQA_government_reports_test
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
This dataset is part of a topic-specific retrieval benchmark spanning multiple domains, which evaluates retrieval in more realistic industrial applications.
It includes documents about the **Government Reports** that allow ViDoRe to benchmark administrative/legal documents.
### Data Collection
Thanks to a crawler (see below), we collected 1,000 PDFs from the Internet with the query ('government reports'). From these documents, we randomly sampled 1000 pages.
We associated these with 100 questions and answers generated using Claude-3 Sonnet, a high-quality proprietary vision-language model.
**Web Crawler**
We implemented a web crawler to efficiently collect large volumes of documents related to a given topic.
The crawler is seeded with a user-defined query (e.g. "artificial intelligence") and then uses GPT-3.5 Turbo to brainstorm related topics and subtopics.
This query augmentation strategy aims to broaden and deepen the search. GPT-3.5 Turbo is further used to generate diverse search queries from each subtopic.
This query set is then consumed by a pool of parallel workers whose job is to fetch the associated most relevant documents.
We use [SerpAPI](https://serpapi.com/) along with a filetype filter (PDF documents only) to programmatically scrape Google Search rankings.
Each file is hashed and stored in a Bloom filter shared among workers to avoid duplicate documents in the final corpus.
Unique scraped files are downloaded and inserted into a SQLite database along with additional metadata.
### Data Curation
As the queries (and answers) are generated using a Vison Language Model, human annotators extensively filtered them for quality and relevance.
### Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("vidore/syntheticDocQA_government_reports_test", split="test")
```
### Dataset Structure
Here is an example of a dataset instance structure:
```json
features:
- name: query
dtype: string
- name: image
dtype: image
- name: image_filename
dtype: string
- name: answer
dtype: string
- name: page
dtype: string
- name: model
dtype: string
- name: prompt
dtype: string
- name: source
dtype: string
```
## Citation Information
If you use this dataset in your research, please cite the original dataset as follows:
```latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
```
# 数据集描述
本数据集属于多领域主题特定检索基准的一部分,旨在于更贴近实际的工业应用场景中评估检索性能。
其包含与**政府报告**相关的文档,可用于ViDoRe对行政/法律文档开展检索基准测试。
## 数据采集
借助如下所述的网络爬虫,我们以“政府报告”为查询词从互联网上采集了1000份PDF文档,并从这些文档中随机抽取了1000页内容。
我们将这些页面与使用Claude-3 Sonnet——一款优质专有视觉语言模型——生成的100组问答对进行关联。
### 网络爬虫
我们开发了一款网络爬虫,以高效采集与指定主题相关的海量文档。该爬虫以用户自定义查询词(例如“人工智能”)为初始种子,随后调用GPT-3.5 Turbo生成相关主题及子主题的头脑风暴结果。该查询增强策略旨在拓展并深化检索范围。我们进一步利用GPT-3.5 Turbo从每个子主题中生成多样化的检索查询词。随后,由一批并行工作节点组成的集群将使用该查询集合,获取与之关联度最高的相关文档。我们借助[SerpAPI](https://serpapi.com/)并结合文件类型过滤(仅保留PDF文档),以编程方式抓取谷歌搜索的排名结果。对每个文档进行哈希处理后,将其存储于工作节点共享的布隆过滤器中,以避免最终文档库中出现重复文档。将抓取到的唯一文档下载后,连同额外的元数据一同存入SQLite数据库。
## 数据整理
由于查询词(及答案)由视觉语言模型生成,因此人工标注员对其进行了多轮筛选,以确保内容质量与相关性。
## 加载数据集
python
from datasets import load_dataset
ds = load_dataset("vidore/syntheticDocQA_government_reports_test", split="test")
## 数据集结构
以下为数据集实例的结构示例:
json
features:
- name: query
dtype: string
- name: image
dtype: image
- name: image_filename
dtype: string
- name: answer
dtype: string
- name: page
dtype: string
- name: model
dtype: string
- name: prompt
dtype: string
- name: source
dtype: string
## 引用信息
若您在研究中使用本数据集,请按以下方式引用该原始数据集:
latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
提供机构:
maas
创建时间:
2025-06-04



