five

fimu-docproc-research/CIVQA-TesseractOCR-LayoutLM

收藏
Hugging Face2023-11-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/fimu-docproc-research/CIVQA-TesseractOCR-LayoutLM
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: input_ids sequence: int64 - name: bbox dtype: array2_d: shape: - 512 - 4 dtype: int64 - name: attention_mask sequence: int64 - name: image dtype: array3_d: shape: - 3 - 224 - 224 dtype: int64 - name: start_positions dtype: int64 - name: end_positions dtype: int64 - name: questions dtype: string - name: answers dtype: string splits: - name: train num_bytes: 198175471439 num_examples: 160645 - name: validation num_bytes: 20009392368 num_examples: 16220 download_size: 826530358 dataset_size: 218184863807 language: - cs tags: - finance pretty_name: C license: mit --- # CIVQA TesseractOCR LayoutLM Dataset The Czech Invoice Visual Question Answering dataset was created with Tesseract OCR and encoded for the LayoutLM. The pre-encoded dataset can be found on this link: https://huggingface.co/datasets/fimu-docproc-research/CIVQA-TesseractOCR All invoices used in this dataset were obtained from public sources. Over these invoices, we were focusing on 15 different entities, which are crucial for processing the invoices. - Invoice number - Variable symbol - Specific symbol - Constant symbol - Bank code - Account number - ICO - Total amount - Invoice date - Due date - Name of supplier - IBAN - DIC - QR code - Supplier's address The invoices included in this dataset were gathered from the internet. We understand that privacy is of utmost importance. Therefore, we sincerely apologise for any inconvenience caused by including your identifiable information in this dataset. If you have identified your data in this dataset and wish to have it removed from research purposes, we request you kindly to access the following URL: https://forms.gle/tUVJKoB22oeTncUD6 We profoundly appreciate your cooperation and understanding in this matter.
提供机构:
fimu-docproc-research
原始信息汇总

CIVQA TesseractOCR LayoutLM Dataset

数据集信息

特征

  • input_ids: 序列类型,数据类型为int64
  • bbox: 二维数组,形状为[512, 4],数据类型为int64
  • attention_mask: 序列类型,数据类型为int64
  • image: 三维数组,形状为[3, 224, 224],数据类型为int64
  • start_positions: 数据类型为int64
  • end_positions: 数据类型为int64
  • questions: 数据类型为string
  • answers: 数据类型为string

分割

  • train: 字节数为198175471439,样本数为160645
  • validation: 字节数为20009392368,样本数为16220

大小

  • 下载大小: 826530358字节
  • 数据集大小: 218184863807字节

语言

  • 捷克语 (cs)

标签

  • 金融 (finance)

名称

  • C

许可证

  • MIT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作