syntheticDocQA_healthcare_industry_test
收藏魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/vidore/syntheticDocQA_healthcare_industry_test
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Description
This dataset is part of a topic-specific retrieval benchmark spanning multiple domains, which evaluates retrieval in more realistic industrial applications.
It includes documents about the **Healthcare Industry** that allow ViDoRe to benchmark medical documents.
### Data Collection
Thanks to a crawler (see below), we collected 1,000 PDFs from the Internet with the query ('healthcare industry'). From these documents, we randomly sampled 1000 pages.
We associated these with 100 questions and answers generated using Claude-3 Sonnet, a high-quality proprietary vision-language model.
**Web Crawler**
We implemented a web crawler to efficiently collect large volumes of documents related to a given topic.
The crawler is seeded with a user-defined query (e.g. "artificial intelligence") and then uses GPT-3.5 Turbo to brainstorm related topics and subtopics.
This query augmentation strategy aims to broaden and deepen the search. GPT-3.5 Turbo is further used to generate diverse search queries from each subtopic.
This query set is then consumed by a pool of parallel workers whose job is to fetch the associated most relevant documents.
We use [SerpAPI](https://serpapi.com/) along with a filetype filter (PDF documents only) to programmatically scrape Google Search rankings.
Each file is hashed and stored in a Bloom filter shared among workers to avoid duplicate documents in the final corpus.
Unique scraped files are downloaded and inserted into a SQLite database along with additional metadata.
### Data Curation
As the queries (and answers) are generated using a Vison Language Model, human annotators extensively filtered them for quality and relevance.
### Load the dataset
```python
from datasets import load_dataset
ds = load_dataset("vidore/syntheticDocQA_healthcare_industry_test", split="test")
```
### Dataset Structure
Here is an example of a dataset instance structure:
```json
features:
- name: query
dtype: string
- name: image
dtype: image
- name: image_filename
dtype: string
- name: answer
dtype: string
- name: page
dtype: string
- name: model
dtype: string
- name: prompt
dtype: string
- name: source
dtype: string
```
## Citation Information
If you use this dataset in your research, please cite the original dataset as follows:
```latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
```
### 数据集说明
本数据集属于跨领域主题特定检索基准测试集的一部分,旨在针对更贴近实际工业应用的场景评估检索性能。该数据集包含与**医疗健康产业**相关的文档,可供ViDoRe开展医疗文档检索基准测试。
### 数据采集
借助一款网络爬虫(详见下文),我们以查询词「医疗健康产业」为关键词从互联网上采集了1000份PDF文档,并从中随机抽取了1000页内容。
我们将这些文档与使用高质量专有视觉语言模型Claude-3 Sonnet生成的100组问答对进行关联。
**网络爬虫实现**
我们开发了一款网络爬虫,以高效采集与指定主题相关的大量文档。该爬虫以用户自定义查询(例如「人工智能」)为种子查询,随后调用GPT-3.5 Turbo生成相关主题与子主题,以此拓展并深化搜索范围。此外,我们还利用GPT-3.5 Turbo从每个子主题中生成多样化的搜索查询。
随后,由一组并行工作节点基于上述查询集抓取与之关联度最高的文档。我们结合[SerpAPI](https://serpapi.com/)与文件类型过滤(仅保留PDF文档),以编程方式抓取谷歌搜索的排名结果。
我们对每份文件进行哈希处理,并将哈希值存入各工作节点共享的布隆过滤器(Bloom filter)中,以避免最终语料库中出现重复文档。抓取得到的唯一文件将被下载,并与额外元数据一同存入SQLite数据库。
### 数据整理
由于查询(及答案)由视觉语言模型生成,我们聘请人工标注者对其进行了多轮严格筛选,以确保内容质量与相关性。
### 数据集加载
python
from datasets import load_dataset
ds = load_dataset("vidore/syntheticDocQA_healthcare_industry_test", split="test")
### 数据集结构
以下为数据集实例的结构示例:
json
features:
- name: query
dtype: string
- name: image
dtype: image
- name: image_filename
dtype: string
- name: answer
dtype: string
- name: page
dtype: string
- name: model
dtype: string
- name: prompt
dtype: string
- name: source
dtype: string
### 引用信息
若您在研究中使用本数据集,请按如下格式引用原始数据集:
latex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
提供机构:
maas
创建时间:
2025-06-04
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个针对医疗行业的合成文档问答测试集,属于跨领域主题检索基准的一部分,用于评估工业应用中的检索性能。它包含从互联网爬取的1000个PDF文档中采样的1000页内容,并关联了由Claude-3 Sonnet生成的100个问答对,数据经过人工过滤以确保质量。
以上内容由遇见数据集搜集并总结生成



