syntheticDocQA_healthcare_industry_test

Name: syntheticDocQA_healthcare_industry_test
Creator: maas
Published: 2025-12-05 16:37:18
License: 暂无描述

魔搭社区2025-12-05 更新2025-06-07 收录

下载链接：

https://modelscope.cn/datasets/vidore/syntheticDocQA_healthcare_industry_test

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Description This dataset is part of a topic-specific retrieval benchmark spanning multiple domains, which evaluates retrieval in more realistic industrial applications. It includes documents about the **Healthcare Industry** that allow ViDoRe to benchmark medical documents. ### Data Collection Thanks to a crawler (see below), we collected 1,000 PDFs from the Internet with the query ('healthcare industry'). From these documents, we randomly sampled 1000 pages. We associated these with 100 questions and answers generated using Claude-3 Sonnet, a high-quality proprietary vision-language model. **Web Crawler** We implemented a web crawler to efficiently collect large volumes of documents related to a given topic. The crawler is seeded with a user-defined query (e.g. "artificial intelligence") and then uses GPT-3.5 Turbo to brainstorm related topics and subtopics. This query augmentation strategy aims to broaden and deepen the search. GPT-3.5 Turbo is further used to generate diverse search queries from each subtopic. This query set is then consumed by a pool of parallel workers whose job is to fetch the associated most relevant documents. We use [SerpAPI](https://serpapi.com/) along with a filetype filter (PDF documents only) to programmatically scrape Google Search rankings. Each file is hashed and stored in a Bloom filter shared among workers to avoid duplicate documents in the final corpus. Unique scraped files are downloaded and inserted into a SQLite database along with additional metadata. ### Data Curation As the queries (and answers) are generated using a Vison Language Model, human annotators extensively filtered them for quality and relevance. ### Load the dataset ```python from datasets import load_dataset ds = load_dataset("vidore/syntheticDocQA_healthcare_industry_test", split="test") ``` ### Dataset Structure Here is an example of a dataset instance structure: ```json features: - name: query dtype: string - name: image dtype: image - name: image_filename dtype: string - name: answer dtype: string - name: page dtype: string - name: model dtype: string - name: prompt dtype: string - name: source dtype: string ``` ## Citation Information If you use this dataset in your research, please cite the original dataset as follows: ```latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, } ```

### 数据集说明本数据集属于跨领域主题特定检索基准测试集的一部分，旨在针对更贴近实际工业应用的场景评估检索性能。该数据集包含与**医疗健康产业**相关的文档，可供ViDoRe开展医疗文档检索基准测试。 ### 数据采集借助一款网络爬虫（详见下文），我们以查询词「医疗健康产业」为关键词从互联网上采集了1000份PDF文档，并从中随机抽取了1000页内容。我们将这些文档与使用高质量专有视觉语言模型Claude-3 Sonnet生成的100组问答对进行关联。 **网络爬虫实现** 我们开发了一款网络爬虫，以高效采集与指定主题相关的大量文档。该爬虫以用户自定义查询（例如「人工智能」）为种子查询，随后调用GPT-3.5 Turbo生成相关主题与子主题，以此拓展并深化搜索范围。此外，我们还利用GPT-3.5 Turbo从每个子主题中生成多样化的搜索查询。随后，由一组并行工作节点基于上述查询集抓取与之关联度最高的文档。我们结合[SerpAPI](https://serpapi.com/)与文件类型过滤（仅保留PDF文档），以编程方式抓取谷歌搜索的排名结果。我们对每份文件进行哈希处理，并将哈希值存入各工作节点共享的布隆过滤器（Bloom filter）中，以避免最终语料库中出现重复文档。抓取得到的唯一文件将被下载，并与额外元数据一同存入SQLite数据库。 ### 数据整理由于查询（及答案）由视觉语言模型生成，我们聘请人工标注者对其进行了多轮严格筛选，以确保内容质量与相关性。 ### 数据集加载 python from datasets import load_dataset ds = load_dataset("vidore/syntheticDocQA_healthcare_industry_test", split="test") ### 数据集结构以下为数据集实例的结构示例： json features: - name: query dtype: string - name: image dtype: image - name: image_filename dtype: string - name: answer dtype: string - name: page dtype: string - name: model dtype: string - name: prompt dtype: string - name: source dtype: string ### 引用信息若您在研究中使用本数据集，请按如下格式引用原始数据集： latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, }

提供机构：

maas

创建时间：

2025-06-04

搜集汇总

数据集介绍