five

COREX-18text

收藏
魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/laion/COREX-18text
下载链接
链接失效反馈
官方服务:
资源简介:
<h1 style="text-align: center; margin: 0;">CORE-18 Fulltext</h1> <div style="text-align: center;"> <img src="CORE_Greek.jpg" alt="CORE Greek" style="width: 250px; height: 250px; display: block; margin: 0 auto;"> </div> **Introducing the CORE-18 Full Text dataset**, among the first well-maintained public datasets of CORE. CORE offers one of the largest collections of research papers, including supplementary metadata, to support Artificial Intelligence, Machine Learning research, and engineering projects. This dataset has gained significant attention among major corporations and research laboratories for Natural Language Processing research. Recognizing the importance of accessibility, LAION's goal was to create a well-maintained public corpus. This enables the general public and the open-source research community to utilize the CORE dataset without the burden of computationally demanding extraction and processing. Under our Open Science research initiative and project, we are making this dataset accessible to the public. ### Dataset information Publication Date: 2018. The most recent datasets are not yet suitable for public sharing. **Size:** Over 220GB (GZIP Compressed) **Number of rows:** 9,835,064 **Update Frequency:** Every two years **Was textual preprocessing performed on the dataset?** No, we refrained from preprocessing the dataset due to the presence of Cyrillic, Latin, and special characters. Preprocessing could have potentially resulted in unicode disruptions or unintended information loss. To ensure ethical and transparent research, we kindly ask all users of this dataset to exercise responsible usage. When presenting your work, please acknowledge our contributions by citing our dataset accordingly.

<h1 style="text-align: center; margin: 0;">CORE-18 全文数据集</h1> <div style="text-align: center;"> <img src="CORE_Greek.jpg" alt="CORE希腊语版" style="width: 250px; height: 250px; display: block; margin: 0 auto;"> </div> **CORE-18 全文数据集介绍**:本数据集是CORE首批维护完善的公开数据集之一。CORE拥有全球规模领先的学术论文馆藏之一,配套补充元数据,可支撑人工智能(Artificial Intelligence)、机器学习(Machine Learning)领域的研究与工程项目开发。该数据集已获得多家头部企业与科研机构的广泛关注,多用于自然语言处理(Natural Language Processing)相关研究。 鉴于获取便利性的重要意义,LAION的目标是打造一套维护完善的公开语料库。此举可使普通公众与开源研究社区无需承担高计算成本的提取与预处理工作,即可使用CORE数据集。 依托我们的开放科学研究计划与项目,我们面向公众开放本数据集的访问权限。 ### 数据集详情 **发布日期**:2018年。当前最新版本的数据集暂不适合公开分享。 **数据集规模**:超过220GB(GZIP压缩格式) **数据条目数**:9,835,064条 **更新频率**:每两年一次 **是否对数据集进行过文本预处理?** 否。由于数据集中包含西里尔字母、拉丁字母及特殊字符,我们未对其进行预处理,以防出现Unicode编码紊乱或非预期的信息丢失问题。 为保障研究的伦理合规性与透明度,我们恳请所有数据集使用者秉持负责任的态度进行使用。在发表相关研究成果时,请务必引用本数据集以认可我们的工作贡献。
提供机构:
maas
创建时间:
2025-10-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作