five

deepdoctection/FRFPE

收藏
Hugging Face2023-06-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/deepdoctection/FRFPE
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - token-classification language: - de - en - fr tags: - finance pretty_name: 'Funds report token classification ' size_categories: - n<1K --- **F**unds **R**eport **F**ront **P**age **E**ntities (FRFPE) is a dataset for document understanding and token classification. It contains 356 titles/front pages of annual and semi-annual reports as well as extracted text and annotations for five different token categories. FRFPE serves as an example of how to train and evaluate multimodal models such as LayoutLM using the deepdoctection framework on a custom dataset. FRFPE contains documents in three different languages - english: 167 - german: 149 - french: 9 as well as the token categories: - report_date (1096 samples) - reporting date of the report - report_type (738 samples) - annual/semi-annual report - umbrella (912 samples) - fund issued as umbrella - fund_name (2122 samples) - Subfund, as part of an umbrella fund or standalone fund - other (12903 samples) - None of the above categories The annotations have been made to the best of our knowledge and belief, but there is no claim on correctness. Some cursory notes: - The images were created by converting PDF files. A resolution of 300 dpi was applied during the conversion. - The text was extracted from the PDF file using PDFPlumber. In some cases the PDF contains embedded images, which in turn contain text, such as corporate names. These are not extracted and are therefore not taken into account. - The annotation was carried out with the annotation tool Prodigy. - The category `report_date` is self-explanatory. `report_type` was used to indicate whether the report is an annual semi-annual report or a report in a different cycle. - `umbrella`/`fund_name` is the classification of any token that is part of a fund name that represents either an umbrella, subfund or individual fund. The distinction between whether a fund represents an umbrella, or single fund is not always apparent from the context of the document, which makes the classification particularly challenging. In order to remain correct in the annotation, information from the Bafin database was used for cases that could not be clarified from the context. To explore the dataset we suggest to use **deep**doctection. Place the unzipped folder in the `**deep**doctection ~/.cache/datasets` folder. ```python import deepdoctection as dd from pathlib import Path @dd.object_types_registry.register("ner_first_page") class FundsFirstPage(dd.ObjectTypes): report_date = "report_date" umbrella = "umbrella" report_type = "report_type" fund_name = "fund_name" dd.update_all_types_dict() path = Path("~/.cache/datasets/fund_ar_front_page/40952248ba13ae8bfdd39f56af22f7d9_0.json") page = dd.Page.from_file(path) page.image = dd.load_image_from_file(path.parents[0] / "image" / page.file_name.replace("pdf","png")) page.viz(interactive=True,show_words=True) # close interactive window with q for word in page.words: print(f"text: {word.characters}, token class: {word.token_class}") ```
提供机构:
deepdoctection
原始信息汇总

数据集概述

数据集名称

  • Funds Report Front Page Entities (FRFPE)

数据集用途

  • 用于文档理解和token分类。

数据集内容

  • 包含356个年度和半年度报告的标题/封面页。
  • 提取的文本和针对五种不同token类别的标注。

语言支持

  • 英语: 167份
  • 德语: 149份
  • 法语: 9份

Token分类

  • report_date: 1096个样本,报告日期。
  • report_type: 738个样本,年度/半年度报告类型。
  • umbrella: 912个样本,作为伞形基金发行的基金。
  • fund_name: 2122个样本,伞形基金或独立基金的子基金名称。
  • other: 12903个样本,不属于以上类别的其他内容。

数据集特点

  • 图像由PDF文件转换而来,分辨率为300 dpi。
  • 文本使用PDFPlumber从PDF文件中提取。
  • 使用Prodigy工具进行标注。
  • 对于难以从文档内容中确定的基金类型,参考了Bafin数据库的信息。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作