fimu-docproc-research/CIVQA-TesseractOCR-LayoutLM
收藏Hugging Face2023-11-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/fimu-docproc-research/CIVQA-TesseractOCR-LayoutLM
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: input_ids
sequence: int64
- name: bbox
dtype:
array2_d:
shape:
- 512
- 4
dtype: int64
- name: attention_mask
sequence: int64
- name: image
dtype:
array3_d:
shape:
- 3
- 224
- 224
dtype: int64
- name: start_positions
dtype: int64
- name: end_positions
dtype: int64
- name: questions
dtype: string
- name: answers
dtype: string
splits:
- name: train
num_bytes: 198175471439
num_examples: 160645
- name: validation
num_bytes: 20009392368
num_examples: 16220
download_size: 826530358
dataset_size: 218184863807
language:
- cs
tags:
- finance
pretty_name: C
license: mit
---
# CIVQA TesseractOCR LayoutLM Dataset
The Czech Invoice Visual Question Answering dataset was created with Tesseract OCR and encoded for the LayoutLM.
The pre-encoded dataset can be found on this link: https://huggingface.co/datasets/fimu-docproc-research/CIVQA-TesseractOCR
All invoices used in this dataset were obtained from public sources. Over these invoices, we were focusing on 15 different entities, which are crucial for processing the invoices.
- Invoice number
- Variable symbol
- Specific symbol
- Constant symbol
- Bank code
- Account number
- ICO
- Total amount
- Invoice date
- Due date
- Name of supplier
- IBAN
- DIC
- QR code
- Supplier's address
The invoices included in this dataset were gathered from the internet. We understand that privacy is of utmost importance. Therefore, we sincerely apologise for any inconvenience caused by including your identifiable information in this dataset. If you have identified your data in this dataset and wish to have it removed from research purposes, we request you kindly to access the following URL: https://forms.gle/tUVJKoB22oeTncUD6
We profoundly appreciate your cooperation and understanding in this matter.
提供机构:
fimu-docproc-research
原始信息汇总
CIVQA TesseractOCR LayoutLM Dataset
数据集信息
特征
- input_ids: 序列类型,数据类型为int64
- bbox: 二维数组,形状为[512, 4],数据类型为int64
- attention_mask: 序列类型,数据类型为int64
- image: 三维数组,形状为[3, 224, 224],数据类型为int64
- start_positions: 数据类型为int64
- end_positions: 数据类型为int64
- questions: 数据类型为string
- answers: 数据类型为string
分割
- train: 字节数为198175471439,样本数为160645
- validation: 字节数为20009392368,样本数为16220
大小
- 下载大小: 826530358字节
- 数据集大小: 218184863807字节
语言
- 捷克语 (cs)
标签
- 金融 (finance)
名称
- C
许可证
- MIT



