Brazilian_Bills_and_Invoices_Dataset
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/Kratos-AI/Brazilian_Bills_and_Invoices_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Brazilian Bills and Invoices Dataset
*This dataset contains high-quality scanned and photographed images of Brazilian bills, invoices, and utility payment documents. It supports AI research in OCR, financial document understanding, and structured data extraction for Portuguese-language financial contexts.*
## Contact
For queries or collaborations related to this dataset, contact:
- anoushka@kgen.io
- abhishek.vadapalli@kgen.io
## Supported Tasks
- **Task Categories**:
- Text Recognition (OCR)
- Document Classification
- Document Understanding
- **Supported Tasks**:
- Extraction of key financial fields (amounts, due dates, customer IDs, payment codes)
- OCR for printed and digital Brazilian utility bills and invoices
- Document classification by service type (electricity, water, telecom, internet)
- Multilingual text recognition (Portuguese-English) for multinational billing systems
- Training AI models for financial data parsing and automation workflows
## Languages
- **Primary Language**: Portuguese
- **Secondary Presence**: English (on international service invoices or bilingual corporate bills)
## Dataset Creation
### Curation Rationale
The dataset was created to help train AI systems capable of interpreting and digitizing Brazilian billing and invoicing formats. It aids automation in finance, accounting, and document intelligence applications.
### Source Data
- **Contributors**: Collected from anonymized, publicly shared, and simulated invoice data sources
- **Collection Process**: Bills were photographed or scanned from utility and corporate service providers. All personal data (names, addresses, payment info) was removed or anonymized prior to inclusion.
### Other Known Limitations
- **Bias**: Major urban and corporate service providers are overrepresented
- **Layout Diversity**: Variations in design between companies and sectors may impact OCR performance
- **Image Quality**: Folded, faded, or low-resolution documents may affect data extraction accuracy
## Intended Uses
### ✅ Direct Use
- Training OCR and document parsing models
- Research in financial automation and structured document understanding
- Extraction of invoice-level metadata for AI accounting systems
- Benchmarking for multilingual document understanding tasks
### ❌ Out-of-Scope Use
- Reconstruction of individual financial histories
- Misuse of data for identity tracking or commercial exploitation
- Reproduction of proprietary billing templates for commercial gain
## License
CC BY 4.0
# 巴西票据与发票数据集
本数据集包含高质量的巴西票据、发票及公用事业缴费单据的扫描与实拍图像,可支持葡萄牙语金融场景下的光学字符识别(Optical Character Recognition, OCR)、金融文档理解以及结构化数据提取相关的人工智能研究。
## 联系方式
若您对本数据集有咨询或合作需求,请联系:
- anoushka@kgen.io
- abhishek.vadapalli@kgen.io
## 支持任务
- **任务类别**:
- 文本识别(光学字符识别,Optical Character Recognition, OCR)
- 文档分类
- 文档理解
- **支持的具体任务**:
- 关键金融字段提取(涵盖金额、到期日、客户ID、支付代码)
- 针对巴西印刷及数字化公用事业票据与发票的光学字符识别
- 按服务类型(电力、水务、电信、互联网)进行文档分类
- 面向跨国计费系统的多语言文本识别(葡萄牙语-英语)
- 面向金融数据解析与自动化工作流的人工智能模型训练
## 语言
- **主要语言**:葡萄牙语
- **次要覆盖语言**:国际服务发票或双语企业票据中包含英语
## 数据集构建
### 遴选依据
本数据集旨在助力训练能够解读并数字化巴西计费与发票格式的人工智能系统,可为金融、会计及文档智能类应用的自动化流程提供支持。
### 源数据
- **数据采集来源**:数据取自匿名化处理后的公开共享单据及模拟发票数据源
- **采集流程**:公用事业及企业服务提供商的票据通过实拍或扫描方式获取,所有个人信息(姓名、地址、支付信息)在纳入数据集前均已完成移除或匿名化处理。
### 已知其他局限性
- **样本偏差**:城市区域及企业服务提供商的样本占比偏高
- **版式多样性**:不同企业与行业间的版式差异可能影响光学字符识别性能
- **图像质量限制**:折叠、褪色或低分辨率的文档可能降低数据提取精度
## 预期用途
### ✅ 直接使用场景
- 训练光学字符识别与文档解析模型
- 开展金融自动化及结构化文档理解领域的研究
- 为人工智能会计系统提取发票级元数据
- 作为多语言文档理解任务的基准测试集
### ❌ 超出适用范围的使用场景
- 重构个人金融历史
- 滥用数据进行身份追踪或商业牟利
- 复刻专有计费模板以获取商业利益
## 授权协议
CC BY 4.0
提供机构:
maas
创建时间:
2025-10-15



