five

RealKIE

收藏
arXiv2024-03-29 更新2024-06-21 收录
下载链接:
https://indicodatasolutions.github.io/RealKIE/
下载链接
链接失效反馈
官方服务:
资源简介:
RealKIE是一个包含五个挑战性数据集的基准,旨在推动关键信息提取方法的发展,特别强调企业应用。这些数据集涵盖了多种文档类型,包括SEC S1申报、美国非披露协议、英国慈善报告、FCC发票和资源合同。每个数据集都呈现出独特的挑战,如文本序列化不佳、长文档中的稀疏标注和复杂的表格布局。这些数据集为关键信息提取任务如投资分析和法律数据处理提供了真实的测试平台。此外,论文还深入描述了标注过程、文档处理技术和基线建模方法,以促进能够处理实际挑战的NLP模型的发展,并支持针对特定行业问题的信息提取技术的进一步研究。

RealKIE is a benchmark comprising five challenging datasets, aimed at advancing the development of key information extraction (KIE) methodologies, with a special focus on enterprise applications. These datasets cover a wide range of document types, including SEC S1 filings, U.S. Non-Disclosure Agreements (NDAs), UK charity reports, FCC invoices, and resource contracts. Each dataset presents unique challenges, such as suboptimal text serialization, sparse annotations in long documents, and complex table layouts. These datasets serve as realistic test platforms for KIE tasks such as investment analysis and legal data processing. Furthermore, the paper thoroughly elaborates on the annotation workflow, document processing techniques, and baseline modeling approaches, to facilitate the development of NLP models that can handle real-world challenges and support further research on information extraction technologies targeting industry-specific issues.
提供机构:
Indico数据解决方案
创建时间:
2024-03-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作