five

vidore_v3_industrial

收藏
魔搭社区2026-01-02 更新2025-12-27 收录
下载链接:
https://modelscope.cn/datasets/vidore/vidore_v3_industrial
下载链接
链接失效反馈
官方服务:
资源简介:
<center><h1>ViDoRe V3 : Industrial reports</h1></center> This dataset, `Industrial reports`, is a corpus of `technical documents` on military aircrafts (fueling, mechanics...), intended for complex-document understanding tasks. It is one of the 10 corpora comprising the **ViDoRe v3 Benchmark**. ## About ViDoRe v3 ViDoRe V3 is our latest benchmark for RAG evaluation on visually-rich documents from real-world applications. It features 10 datasets with, in total, 26,000 pages and 3099 queries, translated into 6 languages. Each query comes with human-verified relevant pages, bounding box annotations for key elements, and a comprehensive combined answer from human annotations. ## Links * **Homepage:** [https://huggingface.co/vidore](https://huggingface.co/vidore) * **Collection:** [https://hf.co/collections/vidore/vidore-benchmark-v3](https://hf.co/collections/vidore/vidore-benchmark-v3) * **Blogpost:** [https://huggingface.co/blog/QuentinJG/introducing-vidore-v3](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3) * **Leaderboard:** To come... ### Dataset Summary Here is a description of the specific dataset (`Industrial reports`) - Description: Consists technical documents from technical military documents on aircrafts - Language: en - Domain: Industrial - Document Types: Reports ### Dataset Statistics - Total Documents : 27 - Total Pages : 5244 - Total Queries : 1698 - Queries without counting translations : 283 - Average number of pages per query : 1.8 ### Languages The documents in this dataset are in `english`. ### Queries type ![military_technical_reports_query_types](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/SRSAyB0YrTKfGUXevH6D3.png) ### Queries format ![military_technical_reports_query_formats](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/_c-_DEWgdew-7JWoq9eFA.png) ### Content type ![military_technical_reports_content_types](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/3Hg4PEb8aSihpJYyQvui4.png) ## Dataset Structure ### 1. Corpus Contains the full collection of documents to be searched. Data instance of a single item from the corpus subset: ```json { "corpus_id": <int>, "image": <PIL.Image>, "doc_id": <str>, "markdown": <str>, "page_number_in_doc": <int> } ``` - **corpus_id** <int> : A unique numerical identifier for the corresponding corpus document. - **image** <PIL.Image> : The page - **doc_id** <str> : name of the document from where the image was extracted - **markdown** <str> : Extracted text from the Image using an OCR pipeline - **page_number_in_doc** <int> : Original page number inside the document ### 2. Queries Contains set of questions or search queries. Data Instance of a single item from the queries subset: ```json { "query_id": <int>, "query": <str>, "language": <str>, "query_types": <List[str]>, "query_format": <str>, "content_type": <str>, "raw_answers": <List[str]>, "query_generator": <str>, "query_generation_pipeline": <str>, "source_type": <str>, "query_type_for_generation": <str>, "answer": <str> } ``` - **query_id** <int> : A unique numerical identifier for the query. - **query** <str> : The actual text of the search question or statement used for retrieval. - **language** <str> : The language of the query text. - **query_types** <List[str]> : A list of categories or labels describing the query's intent. - **query_format** <str> : The syntactic format of the query ("intruction", "keyword" or "question"). - **content_type** <str> : The type of visual content present images relevant for the query. - **written_answers** <List[str]> : A list of reference answers written by human annotators. - **query_generator** <str> : The source or method used to create the query ("human" or "sdg"). - **query_generation_pipeline** <str> : Type of SDG pipeline used to create the query (if it was not written by humans) - **source_type** <str> : "summary" or "image", metadata about the type of information used by the annotation pipeline to create the query - **query_type_for_generation** <str> : The specific type requested when the query was generated - **answer** <str> : The answer extracted from the source documents, merged from human annotations using an LLM. ### 3. Qrels Maps queries to their corresponding relevant documents. Data Instance of a single item for the qrels subset: ```json { "query_id": <int>, "corpus_id": <int>, "score": <int>, "content_type": <str>, "bounding_boxes": <List[Tuple[int]]> } ``` - **query_id** <int> : A unique numerical identifier for the query. - **corpus_id** <int> : A unique numerical identifier for the corresponding corpus document. - **score** <int> : Relevance score for the pair `<query, corpus>`. Can be either 1 (Critically Relevant) or 2 (Fully Relevant): - Fully Relevant (2) - The page contains the complete answer. - Critically Relevant (1) - The page contains facts or information that are required to answer the query, though additional information is required. - **content_type** <str> : The type of visual content present images relevant for the query. - **bounding_boxes** <List[Tuple[int]]> : Bounding boxes annotated by humans that indicate which part of the image is relevant to the query. ### 4. Original PDFs All the original pdfs used to build the corpus are distributed in the "pdfs" folder of this directory. ## License information All annotations, query-document relevance judgments (qrels), and related metadata generated for this corpus are distributed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The licensing status of the original source documents (the corpus) and any parsed text (`markdown` column in the corpus) are inherited from their respective publishers. The specific license governing each original document is provided in the documents_metadata["license"] field of that document's entry. ## Data Privacy and Removal Requests While this dataset is released under open licenses, we respect the privacy of individuals and the ownership of source content. If you are a data subject, author, or publisher and are uncomfortable with the inclusion of your data or documents in this release, please contact us at gautier.viaud@illuin.tech and quentin.mace@illuin.tech. We will promptly review your request.

<center><h1>ViDoRe V3:工业报告(Industrial reports)</h1></center> 本数据集名为**工业报告(Industrial reports)**,是面向军用航空器(涵盖燃油系统、机械结构等)的技术文档语料库,旨在服务于复杂文档理解任务。该数据集是构成**ViDoRe V3基准测试集**的10个语料库之一。 ## 关于ViDoRe V3 ViDoRe V3是我们最新推出的基准测试集,用于面向真实应用场景下富视觉文档的检索增强生成(Retrieval-Augmented Generation, RAG)评估。该基准包含10个数据集,总计26000页文档与3099条查询,且已被翻译为6种语言。每条查询均配有经人工验证的相关页面、关键元素边界框标注,以及由人工标注整合而成的完整综合答案。 ## 链接 * **主页**:[https://huggingface.co/vidore](https://huggingface.co/vidore) * **数据集合集**:[https://hf.co/collections/vidore/vidore-benchmark-v3](https://hf.co/collections/vidore/vidore-benchmark-v3) * **博客文章**:[https://huggingface.co/blog/QuentinJG/introducing-vidore-v3](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3) * **排行榜**:即将推出... ### 数据集概览 以下为「工业报告」特定数据集的详细说明: - 描述:由涉及军用航空器的技术文档构成 - 语言:英语(en) - 领域:工业领域 - 文档类型:报告 ### 数据集统计信息 - 文档总数:27 - 总页数:5244 - 查询总数:1698 - 不含翻译的原始查询数:283 - 单条查询平均关联页数:1.8 ### 语言说明 本数据集内的文档语言为英语。 ### 查询类型 ![军用技术报告查询类型](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/SRSAyB0YrTKfGUXevH6D3.png) ### 查询格式 ![军用技术报告查询格式](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/_c-_DEWgdew-7JWoq9eFA.png) ### 内容类型 ![军用技术报告内容类型](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/3Hg4PEb8aSihpJYyQvui4.png) ## 数据集结构 ### 1. 语料库 包含待检索的完整文档集合。 语料库子集单条数据实例格式如下: json { "corpus_id": <int>, "image": <PIL.Image>, "doc_id": <str>, "markdown": <str>, "page_number_in_doc": <int> } 各字段说明: - **corpus_id** <int>:对应语料库文档的唯一数字标识符 - **image** <PIL.Image>:文档页面图像 - **doc_id** <str>:提取该图像的源文档名称 - **markdown** <str>:通过光学字符识别(Optical Character Recognition, OCR)流水线从图像中提取的文本内容 - **page_number_in_doc** <int>:该页面在源文档中的原始页码 ### 2. 查询集 包含问题或搜索查询的集合。 查询集子集单条数据实例格式如下: json { "query_id": <int>, "query": <str>, "language": <str>, "query_types": <List[str]>, "query_format": <str>, "content_type": <str>, "raw_answers": <List[str]>, "query_generator": <str>, "query_generation_pipeline": <str>, "source_type": <str>, "query_type_for_generation": <str>, "answer": <str> } 各字段说明: - **query_id** <int>:该查询的唯一数字标识符 - **query** <str>:用于检索的搜索问题或语句的原始文本 - **language** <str>:查询文本的语言 - **query_types** <List[str]>:描述查询意图的类别或标签列表 - **query_format** <str>:查询的句法格式,包括“指令”“关键词”或“问题” - **content_type** <str>:与查询相关的图像所包含的视觉内容类型 - **raw_answers** <List[str]>:由人工标注者撰写的参考答案列表 - **query_generator** <str>:生成该查询的来源或方法,取值为“人工(human)”或“合成数据生成器(sdg)” - **query_generation_pipeline** <str>:生成查询所用的合成数据生成流水线类型(若非人工撰写) - **source_type** <str>:“摘要(summary)”或“图像(image)”,表示标注流水线创建查询时所用的信息类型元数据 - **query_type_for_generation** <str>:生成该查询时所指定的具体类型 - **answer** <str>:从源文档中提取的答案,由大语言模型(Large Language Model, LLM)整合人工标注结果得到 ### 3. 查询-相关文档映射集(Qrels) 用于将查询与其对应的相关文档进行关联。 Qrels子集单条数据实例格式如下: json { "query_id": <int>, "corpus_id": <int>, "score": <int>, "content_type": <str>, "bounding_boxes": <List[Tuple[int]]> } 各字段说明: - **query_id** <int>:该查询的唯一数字标识符 - **corpus_id** <int>:对应语料库文档的唯一数字标识符 - **score** <int>:<查询,语料库文档>对的相关性评分,仅可取值为1(关键相关)或2(完全相关): - 完全相关(2):该页面包含完整答案 - 关键相关(1):该页面包含回答该查询所需的事实或信息,但仍需补充额外信息 - **content_type** <str>:与查询相关的图像所包含的视觉内容类型 - **bounding_boxes** <List[Tuple[int]]>:人工标注的边界框列表,用于指示图像中与查询相关的区域 ### 4. 原始PDF文件 用于构建语料库的所有原始PDF文件均分布在该目录下的“pdfs”文件夹中。 ## 许可信息 为本语料库生成的所有标注、查询-文档相关性判断结果(Qrels)及相关元数据,均采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC BY 4.0)进行分发。 原始源文档(即语料库)及解析后的文本(语料库中的`markdown`列)的许可状态继承自其各自的发布方。每份原始文档的特定许可协议可在该文档条目的`documents_metadata["license"]`字段中查看。 ## 数据隐私与移除请求 尽管本数据集采用开放许可协议发布,但我们尊重个人隐私及源内容的所有权。若您为数据主体、作者或发布方,且对将您的数据或文档纳入本数据集感到不适,请通过以下邮箱联系我们:gautier.viaud@illuin.tech 与 quentin.mace@illuin.tech。我们将及时审核您的请求。
提供机构:
maas
创建时间:
2025-11-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作