vdsid_french
收藏魔搭社区2025-12-05 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/vidore/vdsid_french
下载链接
链接失效反馈官方服务:
资源简介:
# VDSID-French: Vision Retrieval Dataset on French documents
## Overview
**VDSID-French** is a subset of the [`vidore/vdsid`](https://huggingface.co/datasets/vidore/vdsid) dataset. It contains 5000 document-question-answer triplet of French documents, split into a train set of 4700 examples and a test set of 300 examples.
This dataset was created as ColPali was mainly trained on English documents, so fine-tuning on French documents can help to improve the multilingual capabilities of the model.
## Data Fields
### Document Information
- `document_filename`: Filename of the document.
- `document_url`: Original URL of the document.
- `search_query`: The query used to fetch the document.
- `search_topic`: Topic related to the document.
- `search_subtopic`: Subtopic related to the document.
- `search_language`: Language specified for the search.
- `search_filetype`: Filetype filter applied during the search.
### Page Details
- `page_number`: The page's number within the document.
- `page_description`: A natural language description of the page.
- `page_language`: Language used on the page.
- `page_contains_table`: Boolean indicating the presence of tables.
- `page_contains_figure`: Boolean indicating the presence of figures.
- `page_contains_paragraph`: Boolean indicating the presence of paragraphs.
- `page_image`: Image of the page.
### Query Information
- `query_type`: Type of query (see below).
- `query_answerability`: Answerability level of the query (see below).
- `query_modality`: Modality used for query generation.
- `query_language`: Language of the query.
- `query_reasoning`: Reasoning traces used in query generation.
- `query`: The actual query text.
- `query_is_self_contained`: Boolean indicating if the query is self-contained.
- `query_is_self_contained_reasoning`: Reasoning traces for determining self-contained nature.
- `answer`: Expected answer.
## Query typology
Different question types and answerability levels were designed to distill fine-grained capabilities in retrieval and question-answering models.
### Question Types
- **Extractive:** A clear and specific question that can be answered using only a specific piece of information.
- **Open-ended:** A question that focuses on broad in scope, qualitative aspects of an information.
- **Boolean:** A yes/no question that may involve multiple steps of reasoning.
- **Compare-contrast:** A question that requires comparing and/or contrasting two entities or topics that are closely related to each other.
- **Enumerative:** A question that asks to list all examples that possess a common specific property, optionally requesting details about the specifics of each example.
- **Numerical:** A question about a specific piece of information that can be calculated using data from the page. The question should require more than simply reading numbers directly from the page.
### Answerability Levels
Each generated question has one of the three following answerability levels
- **Fully answerable:** A question is said to be _fully answerable_ if the page contains a precise and complete answer to the question.
- **Partially answerable:** A question is said to be _partially answerable_ if the page contains relevant information that is directly related to the question, but some key information is missing and must be retrieved in other pages or documents in order to give a precise and complete answer.
- **Unanswerable:** A question is said to be _unanswerable_ if the page contains information related to the question's topic or domain but upon closer inspection does not contain information that is useful to answer the question. Those questions are tricky and are meant to test if the retrieval system and/or QA system is able to correctly filter the page when faced with such questions.
## Dataset Creation
This dataset was created by filtering from the [`vidore/vdsid`](https://huggingface.co/datasets/vidore/vdsid) dataset using the following steps:
- Shuffle VDISD and keep the first 5000 examples.
- Keep the documents with `search_language = "french"`.
- Keep the fully answerable examples (`"query_answerability"] == 2`).
Finally, we split the 5000 resulting examples in:
- A train set: 4700 examples.
- A test set: 300 examples.
# VDSID-French:面向法语文档的视觉检索数据集(Vision Retrieval Dataset on French documents)
## 概述
**VDSID-French** 是 [`vidore/vdsid`](https://huggingface.co/datasets/vidore/vdsid) 数据集的子集。该数据集包含5000条法语文档-问题-答案三元组,划分为含4700条样本的训练集与含300条样本的测试集。
由于ColPali主要基于英语文档训练,本数据集的构建旨在通过在法语文档上微调,提升模型的多语言能力。
## 数据字段
### 文档信息
- `"document_filename"`:文档文件名。
- `"document_url"`:文档原始URL。
- `"search_query"`:用于获取该文档的检索查询。
- `"search_topic"`:与文档相关的主题。
- `"search_subtopic"`:与文档相关的子主题。
- `"search_language"`:检索时指定的语言。
- `"search_filetype"`:检索过程中应用的文件类型过滤条件。
### 页面详情
- `"page_number"`:文档内的页码。
- `"page_description"`:页面的自然语言描述。
- `"page_language"`:页面使用的语言。
- `"page_contains_table"`:布尔值,指示页面是否包含表格。
- `"page_contains_figure"`:布尔值,指示页面是否包含图表。
- `"page_contains_paragraph"`:布尔值,指示页面是否包含段落。
- `"page_image"`:页面的图像内容。
### 查询信息
- `"query_type"`:查询类型(详见下文)。
- `"query_answerability"`:查询的可回答性等级(详见下文)。
- `"query_modality"`:查询生成所采用的模态。
- `"query_language"`:查询使用的语言。
- `"query_reasoning"`:查询生成过程中使用的推理轨迹。
- `"query"`:实际的查询文本。
- `"query_is_self_contained"`:布尔值,指示查询是否为自包含式(即无需额外上下文即可理解)。
- `"query_is_self_contained_reasoning"`:用于判定查询自包含性的推理轨迹。
- `"answer"`:预期答案。
## 查询分类体系
为了提炼检索与问答模型的细粒度能力,本数据集设计了多种问题类型与可回答性等级。
### 问题类型
- **抽取式**:可仅通过页面中某一特定信息即可明确作答的具体问题。
- **开放式**:聚焦于信息的宽泛范围与定性层面的问题。
- **布尔型**:可通过是/否作答的问题,可能涉及多步推理。
- **比较对比型**:要求对两个密切相关的实体或主题进行比较与/或对比的问题。
- **枚举型**:要求列出所有具备某一共同特定属性的示例,可附带要求提供每个示例具体细节的问题。
- **数值型**:针对可通过页面数据计算得出的特定信息的问题,且作答要求不局限于直接读取页面中的数字。
### 可回答性等级
每个生成的查询均属于以下三种可回答性等级之一:
- **完全可回答**:若页面包含针对该查询的精准且完整的答案,则该查询为完全可回答。
- **部分可回答**:若页面包含与查询直接相关的有效信息,但缺失部分关键信息,需从其他页面或文档中检索补充后方可给出精准完整的答案,则该查询为部分可回答。
- **不可回答**:若页面包含与查询主题或领域相关的信息,但经仔细核查后未包含可用于作答的有效内容,则该查询为不可回答。此类问题旨在测试检索系统与/或问答系统能否在面对此类问题时正确筛选有效页面。
## 数据集构建流程
本数据集通过对 [`vidore/vdsid`](https://huggingface.co/datasets/vidore/vdsid) 数据集进行筛选构建,具体步骤如下:
- 对VDSID数据集进行洗牌,并保留前5000条样本。
- 筛选出`"search_language" = "french"`的文档。
- 保留可回答性等级为完全可回答的样本(即`"query_answerability" == 2`)。
最终,将得到的5000条样本划分为:
- 训练集:4700条样本。
- 测试集:300条样本。
提供机构:
maas
创建时间:
2025-06-04



