five

vidore_v3_finance_en

收藏
魔搭社区2026-01-02 更新2025-12-20 收录
下载链接:
https://modelscope.cn/datasets/vidore/vidore_v3_finance_en
下载链接
链接失效反馈
官方服务:
资源简介:
<center><h1>ViDoRe V3 : Finance - EN</h1></center> This dataset, `Financial_Bank_Reports`, is a corpus of `annual reports` from the `banking` sector, intended for long-document understanding tasks. It is one of the 10 corpora comprising the **ViDoRe v3 Benchmark**. ## About ViDoRe v3 ViDoRe V3 is our latest benchmark for RAG evaluation on visually-rich documents from real-world applications. It features 10 datasets with, in total, 26,000 pages and 3099 queries, translated into 6 languages. Each query comes with human-verified relevant pages, bounding box annotations for key elements, and a comprehensive combined answer from human annotations. ## Links * **Homepage:** [https://huggingface.co/vidore](https://huggingface.co/vidore) * **Collection:** [https://hf.co/collections/vidore/vidore-benchmark-v3](https://hf.co/collections/vidore/vidore-benchmark-v3) * **Blogpost:** [https://huggingface.co/blog/QuentinJG/introducing-vidore-v3](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3) * **Leaderboard:** To come... ### Dataset Summary Summary of the specific dataset (`Financial - EN`): - Description: Consists of six 10-K annual reports from major U.S. financial institutions for the fiscal year ended December 31, 2024. - Language: en - Domain: Finance - Document Types: Reports ### Dataset Statistics - Total Documents : 6 - Total Pages : 2942 - Total Queries : 1854 - Queries without counting translations : 309 - Average number of pages per query : 4.7 ### Languages The documents in this dataset are in `english`. ### Queries type ![finance_bank_reports_en_query_types](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/2g31_jEyaevHc9qTcVI3z.png) ### Queries format ![finance_bank_reports_en_query_formats](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/5HK-NS65sVe8gfqweNPeB.png) ### Content types ![finance_bank_reports_en_content_types](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/8Gvct6HoUH32YSozY4bvo.png) ## Dataset Structure ### 1. Corpus Contains the full collection of documents to be searched. Data instance of a single item from the corpus subset: ```json { "corpus_id": <int>, "image": <PIL.Image>, "doc_id": <str>, "markdown": <str>, "page_number_in_doc": <int> } ``` - **corpus_id** <int> : A unique numerical identifier for the corresponding corpus document. - **image** <PIL.Image> : The page - **doc_id** <str> : name of the document from where the image was extracted - **markdown** <str> : Extracted text from the Image using an OCR pipeline - **page_number_in_doc** <int> : Original page number inside the document ### 2. Queries Contains set of questions or search queries. Data Instance of a single item from the queries subset: ```json { "query_id": <int>, "query": <str>, "language": <str>, "query_types": <List[str]>, "query_format": <str>, "content_type": <str>, "raw_answers": <List[str]>, "query_generator": <str>, "query_generation_pipeline": <str>, "source_type": <str>, "query_type_for_generation": <str>, "answer": <str> } ``` - **query_id** <int> : A unique numerical identifier for the query. - **query** <str> : The actual text of the search question or statement used for retrieval. - **language** <str> : The language of the query text. - **query_types** <List[str]> : A list of categories or labels describing the query's intent. - **query_format** <str> : The syntactic format of the query ("intruction", "keyword" or "question"). - **content_type** <str> : The type of visual content present images relevant for the query. - **raw_answers** <List[str]> : A list of reference answers written by human annotators. - **query_generator** <str> : The source or method used to create the query ("human" or "sdg"). - **query_generation_pipeline** <str> : Type of SDG pipeline used to create the query (if it was not written by humans) - **source_type** <str> : "summary" or "image", metadata about the type of information used by the annotation pipeline to create the query - **query_type_for_generation** <str> : The specific type requested when the query was generated - **answer** <str> : The answer extracted from the source documents, merged from human annotations using an LLM. ### 3. Qrels Maps queries to their corresponding relevant documents. Data Instance of a single item for the qrels subset: ```json { "query_id": <int>, "corpus_id": <int>, "score": <int>, "content_type": <str>, "bounding_boxes": <List[Tuple[int]]> } ``` - **query_id** <int> : A unique numerical identifier for the query. - **corpus_id** <int> : A unique numerical identifier for the corresponding corpus document. - **score** <int> : Relevance score for the pair `<query, corpus>`. Can be either 1 (Critically Relevant) or 2 (Fully Relevant): - Fully Relevant (2) - The page contains the complete answer. - Critically Relevant (1) - The page contains facts or information that are required to answer the query, though additional information is required. - **content_type** <str> : The type of visual content present images relevant for the query. - **bounding_boxes** <List[Tuple[int]]> : Bounding boxes annotated by humans that indicate which part of the image is relevant to the query. ### 4. Original PDFs All the original pdfs used to build the corpus are distributed in the "pdfs" folder of this directory. ## License information All annotations, query-document relevance judgments (qrels), and related metadata generated for this corpus are distributed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The licensing status of the original source documents (the corpus) and any parsed text (`markdown` column in the corpus) are inherited from their respective publishers. The specific license governing each original document is provided in the `documents_metadata["license"]` field of that document's entry. ## Data Privacy and Removal Requests While this dataset is released under open licenses, we respect the privacy of individuals and the ownership of source content. If you are a data subject, author, or publisher and are uncomfortable with the inclusion of your data or documents in this release, please contact us at gautier.viaud@illuin.tech and quentin.mace@illuin.tech. We will promptly review your request.

ViDoRe V3:金融-英文 这个名为`Financial_Bank_Reports`的数据集是银行业**年度报告**语料库,旨在用于长文档理解任务,是构成**ViDoRe v3基准测试集**的10个语料库之一。 ## 关于ViDoRe v3 ViDoRe V3是我们针对真实世界应用中富视觉文档的检索增强生成(Retrieval-Augmented Generation, RAG)评估基准测试集。该基准包含10个数据集,总计26000页文档与3099个查询,已被翻译为6种语言。每个查询均配有人工验证的相关页面、关键元素的边界框标注,以及由人工标注整合而成的完整综合答案。 ## 链接 * **主页:** [https://huggingface.co/vidore](https://huggingface.co/vidore) * **数据集集合:** [https://hf.co/collections/vidore/vidore-benchmark-v3](https://hf.co/collections/vidore/vidore-benchmark-v3) * **博客文章:** [https://huggingface.co/blog/QuentinJG/introducing-vidore-v3](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3) * **排行榜:** 即将推出... ### 数据集概况 本特定数据集(金融-英文)的概况如下: - 描述:包含六份来自美国主要金融机构、截至2024年12月31日财年的10-K年度报告。 - 语言:英文 - 领域:金融 - 文档类型:报告 ### 数据集统计 - 总文档数:6 - 总页数:2942 - 总查询数:1854 - 不计翻译的查询数:309 - 每个查询的平均关联页数:4.7 ### 语言说明 本数据集的文档语言为英文。 ### 查询类型 ![finance_bank_reports_en_query_types](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/2g31_jEyaevHc9qTcVI3z.png) ### 查询格式 ![finance_bank_reports_en_query_formats](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/5HK-NS65sVe8gfqweNPeB.png) ### 内容类型 ![finance_bank_reports_en_content_types](https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/8Gvct6HoUH32YSozY4bvo.png) ## 数据集结构 ### 1. 语料库 包含待检索的完整文档集合。 单个语料库条目的数据实例如下: json { "corpus_id": <int>, "image": <PIL.Image>, "doc_id": <str>, "markdown": <str>, "page_number_in_doc": <int> } - **corpus_id** <int>:对应语料库文档的唯一数值标识符。 - **image** <PIL.Image>:该页的图像文件。 - **doc_id** <str>:提取该图像的源文档名称。 - **markdown** <str>:通过OCR流水线从图像中提取的标记文本。 - **page_number_in_doc** <int>:该页在源文档中的原始页码。 ### 2. 查询集 包含问题或搜索查询的集合。 单个查询条目的数据实例如下: json { "query_id": <int>, "query": <str>, "language": <str>, "query_types": <List[str]>, "query_format": <str>, "content_type": <str>, "raw_answers": <List[str]>, "query_generator": <str>, "query_generation_pipeline": <str>, "source_type": <str>, "query_type_for_generation": <str>, "answer": <str> } - **query_id** <int>:该查询的唯一数值标识符。 - **query** <str>:用于检索的搜索问题或语句的实际文本。 - **language** <str>:查询文本的语言。 - **query_types** <List[str]>:描述查询意图的类别或标签列表。 - **query_format** <str>:查询的句法格式,可选值为“指令”“关键词”或“问题”。 - **content_type** <str>:与该查询相关的图像所包含的视觉内容类型。 - **raw_answers** <List[str]>:由人工标注者撰写的参考答案列表。 - **query_generator** <str>:创建该查询的来源或方法,可选值为“human(人工)”或“sdg”。 - **query_generation_pipeline** <str>:非人工生成查询时所使用的SDG流水线类型。 - **source_type** <str>:“summary(摘要)”或“image(图像)”,描述标注流水线创建查询时所使用的信息类型元数据。 - **query_type_for_generation** <str>:生成该查询时所请求的特定类型。 - **answer** <str>:从源文档中提取的答案,由大语言模型(Large Language Model, LLM)对人工标注结果进行合并得到的整合答案。 ### 3. 查询-相关文档映射(Qrels) 用于将查询与其对应的相关文档进行关联。 单个Qrels条目的数据实例如下: json { "query_id": <int>, "corpus_id": <int>, "score": <int>, "content_type": <str>, "bounding_boxes": <List[Tuple[int]]> } - **query_id** <int>:该查询的唯一数值标识符。 - **corpus_id** <int>:对应语料库文档的唯一数值标识符。 - **score** <int>:<查询, 语料库文档>对的相关性分数,仅可取值为1(关键相关)或2(完全相关): - 完全相关(2):该页面包含完整的答案。 - 关键相关(1):该页面包含回答查询所需的事实或信息,但需补充额外信息方可完整作答。 - **content_type** <str>:与该查询相关的图像所包含的视觉内容类型。 - **bounding_boxes** <List[Tuple[int]]>:由人工标注的边界框列表,用于指示图像中与该查询相关的区域。 ### 4. 原始PDF文件 构建语料库所用的全部原始PDF文件均存储在该目录的“pdfs”文件夹中。 ## 许可信息 本语料库的所有标注、查询-文档相关性判断(Qrels)及相关元数据均采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC BY 4.0)进行分发。 原始源文档(即语料库)及解析后的文本(语料库中的`markdown`列)的许可状态继承自其 respective 出版商,每个原始文档的具体许可信息可在该文档条目的`documents_metadata["license"]`字段中查看。 ## 数据隐私与移除请求 尽管本数据集采用开放许可发布,但我们尊重个人隐私及源内容的所有权。若您是数据主体、作者或发布者,对将您的数据或文档纳入本发布版本感到不适,请通过邮箱gautier.viaud@illuin.tech与quentin.mace@illuin.tech联系我们,我们将及时审核您的请求。
提供机构:
maas
创建时间:
2025-11-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作