five

laion/COREX-18text

收藏
Hugging Face2024-10-04 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/laion/COREX-18text
下载链接
链接失效反馈
官方服务:
资源简介:
CORE-18全文数据集是CORE中首批维护良好的公开数据集之一,包含了大量研究论文及其补充元数据,用于支持人工智能、机器学习研究和工程项目。该数据集引起了主要企业和研究实验室的广泛关注,特别是在自然语言处理研究领域。LAION项目旨在创建一个维护良好的公开语料库,使公众和开源研究社区能够轻松使用CORE数据集,而无需进行计算密集型的提取和处理。数据集大小超过220GB,共有9,835,064条记录,每两年更新一次。由于数据集中包含多种字符集,因此未进行文本预处理,以避免编码问题和信息丢失。使用该数据集的用户应负责任地使用,并在展示工作时引用数据集以承认其贡献。

The CORE-18 Full Text dataset is one of the first well-maintained public datasets from CORE, containing a vast collection of research papers along with supplementary metadata to support Artificial Intelligence and Machine Learning research, as well as engineering projects. This dataset has garnered significant attention from major corporations and research laboratories, especially in the field of Natural Language Processing. LAIONs aim is to create a well-maintained public corpus, allowing the general public and the open-source research community to utilize the CORE dataset without the need for computationally intensive extraction and processing. The dataset is over 220GB in size, with 9,835,064 entries, and is updated every two years. The dataset has not undergone textual preprocessing due to the presence of various character sets, to avoid unicode disruptions or unintended information loss. Users of this dataset are kindly asked to exercise responsible usage and to acknowledge our contributions by citing the dataset accordingly when presenting their work.
提供机构:
laion
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作