five

LoRaLay

收藏
arXiv2023-01-27 更新2024-06-21 收录
下载链接:
https://hf.co/datasets/nglaura/arxivlay-summarization, https://hf.co/datasets/nglaura/pubmedlay-summarization, https://hf.co/datasets/nglaura/hal-summarization, https://hf.co/datasets/nglaura/scielo-summarization, https://hf.co/datasets/nglaura/koreascience-summarization
下载链接
链接失效反馈
官方服务:
资源简介:
LoRaLay是一个多语言和多模态的数据集,专门为长距离和布局感知的摘要任务设计。该数据集由reciTAL机构创建,包含了来自英语、法语、西班牙语、葡萄牙语和韩语的128,000份文档。这些文档主要来源于学术文章,每份文档都附带有视觉和布局信息,以便于研究者探索如何利用这些信息来更好地捕捉文本间的长距离依赖关系。LoRaLay数据集的创建过程涉及从多个学术资源中收集PDF文件,并提取文本和布局元素。该数据集的应用领域包括自然语言处理、文档理解和机器学习,旨在解决如何有效处理和理解包含复杂布局的长文档的问题。

LoRaLay is a multilingual and multimodal dataset specifically designed for long-distance and layout-aware summarization tasks. Developed by reciTAL, the dataset contains 128,000 documents in English, French, Spanish, Portuguese and Korean, mostly sourced from academic articles. Each document is accompanied by visual and layout information, enabling researchers to explore how to leverage such information to better capture long-distance dependencies between textual contents. The creation process of the LoRaLay dataset involves collecting PDF files from multiple academic resources and extracting both text and layout elements. Its applicable fields cover natural language processing, document understanding and machine learning, aiming to address the challenge of effectively processing and comprehending long documents with complex layouts.
提供机构:
reciTAL, 巴黎, 法国
创建时间:
2023-01-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作