five

syntheticDocQA_energy_test

收藏
魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/vidore/syntheticDocQA_energy_test
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description This dataset is part of a topic-specific retrieval benchmark spanning multiple domains, which evaluates retrieval in more realistic industrial applications. It includes documents about **Energy** that allow ViDoRe to benchmark technical documentation about energy. ### Data Collection Thanks to a crawler (see below), we collected 1,000 PDFs from the Internet with the query ('energy'). From these documents, we randomly sampled 1000 pages. We associated these with 100 questions and answers generated using Claude-3 Sonnet, a high-quality proprietary vision-language model. **Web Crawler** We implemented a web crawler to efficiently collect large volumes of documents related to a given topic. The crawler is seeded with a user-defined query (e.g. "artificial intelligence") and then uses GPT-3.5 Turbo to brainstorm related topics and subtopics. This query augmentation strategy aims to broaden and deepen the search. GPT-3.5 Turbo is further used to generate diverse search queries from each subtopic. This query set is then consumed by a pool of parallel workers whose job is to fetch the associated most relevant documents. We use [SerpAPI](https://serpapi.com/) along with a filetype filter (PDF documents only) to programmatically scrape Google Search rankings. Each file is hashed and stored in a Bloom filter shared among workers to avoid duplicate documents in the final corpus. Unique scraped files are downloaded and inserted into a SQLite database along with additional metadata. ### Data Curation As the queries (and answers) are generated using a Vison Language Model, human annotators extensively filtered them for quality and relevance. ### Load the dataset ```python from datasets import load_dataset ds = load_dataset("vidore/syntheticDocQA_energy_test", split="test") ``` ### Dataset Structure Here is an example of a dataset instance structure: ```json features: - name: query dtype: string - name: image dtype: image - name: image_filename dtype: string - name: answer dtype: string - name: page dtype: string - name: model dtype: string - name: prompt dtype: string - name: source dtype: string ``` ## Citation Information If you use this dataset in your research, please cite the original dataset as follows: ```latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, } ```

## 数据集说明 本数据集属于跨领域主题特定检索基准测试的一部分,旨在针对更贴近实际的工业应用场景评估检索性能。其中包含与**能源(Energy)**相关的文档,可供ViDoRe基准测试能源领域的技术文档。 ### 数据采集 借助下述网络爬虫(Web Crawler),我们以查询词“能源(energy)”从互联网上爬取了1000份PDF文档,并从中随机抽样得到1000页文本。我们将这些页面与由高质量闭源专有视觉语言模型(Vision-Language Model)Claude-3 Sonnet生成的100组问答对进行关联。 **网络爬虫** 我们开发了一款网络爬虫,用于高效采集与指定主题相关的海量文档。该爬虫以用户自定义查询(例如“人工智能”)作为种子,随后调用GPT-3.5 Turbo构思相关主题与子主题,以此拓展并深化搜索范围。我们进一步使用GPT-3.5 Turbo从每个子主题生成多样化的搜索查询词。随后,由一批并行工作节点组成的集群将基于这些查询词获取与之最相关的文档。我们借助[SerpAPI](https://serpapi.com/)并结合文件类型过滤器(仅筛选PDF文档),以编程方式抓取谷歌搜索的排名结果。对每份文件进行哈希处理后,将其存入各工作节点共享的布隆过滤器(Bloom filter)中,以避免最终语料库中出现重复文档。唯一的已爬取文件将被下载,并与额外元数据一同存入SQLite数据库。 ### 数据整理 由于问答对由视觉语言模型生成,我们聘请人类标注员对其进行了全面筛选,以确保内容质量与相关性。 ### 加载数据集 python from datasets import load_dataset ds = load_dataset("vidore/syntheticDocQA_energy_test", split="test") ### 数据集结构 以下为数据集实例的字段结构示例: json features: - name: query dtype: string - name: image dtype: image - name: image_filename dtype: string - name: answer dtype: string - name: page dtype: string - name: model dtype: string - name: prompt dtype: string - name: source dtype: string ## 引用信息 若您在研究中使用本数据集,请按如下格式引用原始数据集: latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: 基于视觉语言模型的高效文档检索}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, }
提供机构:
maas
创建时间:
2025-06-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作