E-Manual Corpus
收藏arXiv2021-09-14 更新2024-06-21 收录
下载链接:
https://github.com/abhi1nandy2/EMNLP-2021-Findings
下载链接
链接失效反馈官方服务:
资源简介:
E-Manual Corpus是由印度理工学院卡尔格普尔创建的大型数据集,包含307,957份电子手册,用于预训练RoBERTa语言模型,以增强特定领域的自然语言理解。数据集内容涵盖多种产品和服务类别,如婴儿护理、厨房电器、电子产品等,确保数据多样性。创建过程中,通过收集和预处理PDF文件,构建了包含约11.65亿字的文本库。该数据集主要用于开发和测试电子设备相关的问题回答系统,旨在解决从电子手册中有效检索信息的问题。
The E-Manual Corpus is a large-scale dataset created by the Indian Institute of Technology Kharagpur, comprising 307,957 electronic manuals. It is designed for pre-training RoBERTa language models to enhance domain-specific natural language understanding. The dataset covers a wide range of product and service categories, including baby care, kitchen appliances, electronic products and more, ensuring data diversity. During its development, a text corpus of approximately 1.165 billion words was constructed by collecting and preprocessing PDF files. This dataset is primarily used for developing and testing question answering systems related to electronic devices, aiming to address the challenge of effectively retrieving information from electronic manuals.
提供机构:
印度理工学院, 卡尔格普尔
创建时间:
2021-09-13



