TheFinAI/MultiFinBen-GreekOCR
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/TheFinAI/MultiFinBen-GreekOCR
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: image
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11860671915
num_examples: 13320
download_size: 11661151472
dataset_size: 11860671915
configs:
- config_name: default
data_files:
- split: train
path: data/*.parquet
license: apache-2.0
language:
- gr
tags:
- finance
pretty_name: GreekOCR
size_categories:
- 10K<n<100K
task_categories:
- image-to-text
---
----------------------------------------------------------------
# Dataset Card for GreekOCR Dataset
### Dataset Summary
The Greek dataset contains images derived from Annual Company Filings on Athens Stock Exchange. This dataset is used for benchmarkingg and evaluating Large Language Models ability on converting unstructured dcuments, such as pdfs and images, into machine readable format, particularly in finance domain, where the conversion task is more complex and valuable.
### Supported Tasks
- **Task:** Image-to-Text
- **Evaluation Metrics:** ROUGE-1
### Languages
- Greek
## Dataset Structure
### Data Instances
Each instance in the GreekOCR dataset comprises 2 fields:
- **image** : image of regulatory document, each image represent one page in pdf
- **text**: ground truth of text extracted from regulatory document
### Data Fields
- **image** : string - Base64-encoded png
- **text**: extracted text from pdf files
## Dataset Creation
### Curation Rationale
The GreekOCR dataset was curated to support research and development on information extraction techniques and layout retain ability for unstructured documents in Greek. By providing real-world white papers in unstructured format with ground truth, the dataset seeks to address challenges in extracting informat as well as layouts and convert into machine-readable format.
### Source Data
#### Initial Data Collection and Normalization
- The source data are Annual Company Filings on Athens Stock Exchange publically available.
- The pdf files of those documents are downloaded and split via API, split into page per file, and convert into images.
#### Who are the Source Language Producers?
- The source data are Annual Company Filings on Athens Stock Exchange, and is collected to from its official website: https://www.athexgroup.gr/en/market-data/issuers
### Annotations
#### Annotation Process
- The dataset was prepared by collecting, spliting, and converting regulatory documents in Greek
- The annotation of ground truth text is done by Python OCR package ```fitz```
#### Who are the Annotators?
- The dataset stems from publicly available regulatory documents.
- No external annotation team was involved beyond this.
### Personal and Sensitive Information
- The GreekOCR dataset does not contain any personally identifiable information (PII) and is strictly focused on Greek-language regulatory data. No personal or sensitive information is present in the dataset.
## Considerations for Using the Data
### Social Impact of Dataset
This dataset enables AI models to extract structured information from scanned financial documents in Greek, supporting downstream applications in finance, regulation, and transparency initiatives across Greek-speaking regions. By aligning page-level PDF images with accurate ground truth text, it supports the development of fairer, more inclusive models that work across diverse formats and languages.
### Discussion of Biases
- The source data is limited to regulatory documents for Securities Markets, it may underrepresent other financial document types such as tax records, bank statements, or private company reports, potentially limiting model generalizability.
### Other Known Limitations
- The ground truth text is extracted using the Python package fitz (PyMuPDF), which may introduce inaccuracies in complex layouts, potentially affecting training quality and evaluation reliability.
- While the dataset covers regulatory documents, it may lack sufficient variety in layout styles (e.g., handwritten notes, non-standard financial forms, embedded charts), which could limit a model’s ability to generalize to less structured or unconventional financial documents.
## Additional Information
### Dataset Curators
- Yueru He
- Ruoyu Xiang
- The FinAI Team
### Licensing Information
- **License:** Apache License 2.0
### Citation Information
If you use this dataset, please cite:
```bibtex
@misc{peng2025multifinbenmultilingualmultimodaldifficultyaware,
title={MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation},
author={Xueqing Peng and Lingfei Qian and Yan Wang and Ruoyu Xiang and Yueru He and Yang Ren and Mingyang Jiang and Jeff Zhao and Huan He and Yi Han and Yun Feng and Yuechen Jiang and Yupeng Cao and Haohang Li and Yangyang Yu and Xiaoyu Wang and Penglei Gao and Shengyuan Lin and Keyi Wang and Shanshan Yang and Yilun Zhao and Zhiwei Liu and Peng Lu and Jerry Huang and Suyuchen Wang and Triantafillos Papadopoulos and Polydoros Giannouris and Efstathia Soufleri and Nuo Chen and Guojun Xiong and Zhiyang Deng and Yijia Zhao and Mingquan Lin and Meikang Qiu and Kaleb E Smith and Arman Cohan and Xiao-Yang Liu and Jimin Huang and Alejandro Lopez-Lira and Xi Chen and Junichi Tsujii and Jian-Yun Nie and Sophia Ananiadou and Qianqian Xie},
year={2025},
eprint={2506.14028},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.14028},
}
```
dataset_info:
特征:
- 名称:image,数据类型:字符串
- 名称:text,数据类型:字符串
划分:
- 名称:train,字节数:11860671915,样本数:13320
下载大小:11661151472,数据集总大小:11860671915
配置:
- 配置名称:default,数据文件:
- 划分:train,路径:data/*.parquet
许可证:apache-2.0
语言:希腊语(gr)
标签:金融领域
展示名称:GreekOCR
大小类别:10K<n<100K
任务类别:图像到文本(Image-to-Text)
----------------------------------------------------------------
# 希腊OCR数据集卡片
### 数据集概述
本希腊语数据集包含源自雅典证券交易所(Athens Stock Exchange)上市公司年度备案文件的图像。该数据集用于基准测试与评估大语言模型(Large Language Model)将非结构化文档(如PDF和图像)转换为机器可读格式的能力,尤其在金融领域,此类转换任务兼具复杂性与应用价值。
### 支持任务
- **任务**:图像到文本(Image-to-Text)
- **评价指标**:ROUGE-1
### 语言
- 希腊语
## 数据集结构
### 数据实例
GreekOCR数据集中的每个样本包含两个字段:
- **image**:监管文件的图像,每张图像对应PDF的一页
- **text**:从监管文件中提取的真实标签文本
### 数据字段
- **image**:字符串类型——Base64编码的PNG图像
- **text**:从PDF文件中提取的文本
## 数据集构建
### 策展依据
构建GreekOCR数据集旨在支持希腊语非结构化文档的信息提取技术与布局保留能力相关的研究与开发。通过提供带有真实标签的真实世界非结构化文档,本数据集旨在解决信息与布局提取、并转换为机器可读格式过程中的挑战。
### 源数据
#### 初始数据收集与标准化
- 源数据为公开可获取的雅典证券交易所上市公司年度备案文件。
- 通过API下载这些文件的PDF版本,并将其按页拆分,转换为图像格式。
#### 源语言生产者是谁?
- 源数据来自雅典证券交易所的年度备案文件,采集自其官方网站:https://www.athexgroup.gr/en/market-data/issuers
### 标注
#### 标注流程
- 本数据集通过收集、拆分并转换希腊语监管文件完成构建。
- 真实标签文本的标注通过Python OCR包`fitz`完成。
#### 标注者是谁?
- 本数据集源自公开可用的监管文件,未引入外部标注团队,仅由上述流程完成数据处理。
### 个人与敏感信息
- GreekOCR数据集未包含任何个人可识别信息(PII),仅专注于希腊语监管数据,数据集中无个人或敏感信息。
## 数据集使用注意事项
### 数据集的社会影响
本数据集支持AI模型从希腊语扫描金融文档中提取结构化信息,助力希腊语地区在金融、监管与透明度提升领域的下游应用。通过将PDF页面图像与准确的真实标签文本对齐,本数据集有助于开发更公平、更具包容性的模型,使其能够适配多种格式与语言。
### 偏差讨论
- 源数据仅局限于证券市场监管文件,可能未充分涵盖其他金融文档类型,如税务记录、银行对账单或私营公司报告,这可能限制模型的泛化能力。
### 其他已知局限性
- 真实标签文本通过Python包`fitz`(PyMuPDF)提取,在复杂布局下可能引入误差,进而影响训练质量与评估可靠性。
- 尽管本数据集涵盖监管文件,但可能缺乏足够多样的布局样式(如手写笔记、非标准金融表单、内嵌图表),这可能限制模型对低结构化或非常规金融文档的泛化能力。
## 附加信息
### 数据集策展人
- 何悦茹(Yueru He)
- 向若愚(Ruoyu Xiang)
- FinAI团队
### 许可证信息
- **许可证**:Apache许可证2.0
### 引用信息
如果使用本数据集,请引用:
bibtex
@misc{peng2025multifinbenmultilingualmultimodaldifficultyaware,
title={MultiFinBen: 面向金融大语言模型评估的多语言多模态难度感知基准},
author={Xueqing Peng and Lingfei Qian and Yan Wang and Ruoyu Xiang and Yueru He and Yang Ren and Mingyang Jiang and Jeff Zhao and Huan He and Yi Han and Yun Feng and Yuechen Jiang and Yupeng Cao and Haohang Li and Yangyang Yu and Xiaoyu Wang and Penglei Gao and Shengyuan Lin and Keyi Wang and Shanshan Yang and Yilun Zhao and Zhiwei Liu and Peng Lu and Jerry Huang and Suyuchen Wang and Triantafillos Papadopoulos and Polydoros Giannouris and Efstathia Soufleri and Nuo Chen and Guojun Xiong and Zhiyang Deng and Yijia Zhao and Mingquan Lin and Meikang Qiu and Kaleb E Smith and Arman Cohan and Xiao-Yang Liu and Jimin Huang and Alejandro Lopez-Lira and Xi Chen and Junichi Tsujii and Jian-Yun Nie and Sophia Ananiadou and Qianqian Xie},
year={2025},
eprint={2506.14028},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.14028},
}
提供机构:
TheFinAI



