COREX-18
收藏魔搭社区2025-11-27 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/laion/COREX-18
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<img src="COREX.jpg" alt="A greek scholar writing manuscript in 1080BC" width="250" height="250">
</p>
<h1 align="center">COREX 18</h1>
**Introducing COREX-18**, a comprehensive dataset derived from the 2018 version of the CORE dataset. Our goal is to contribute to the research community by compiling open-access scientific papers and publishing them in extensive datasets. These datasets will facilitate advanced RAG applications and enhance artificial intelligence research.
COREX was developed as part of our X initiative, which aims to maintain and compile publicly available data into accessible and regularly updated datasets.
**COREX-18 consists of over 85 million rows.**
Regarding metadata preservation, we made the decision not to include all metadata due to its complexity and the high number of NULL values encountered. Instead, we preserved critical metadata essential for tracking research papers and understanding their basic information.
COREX-18 remains unaltered from the original CORE dataset. We maintained the integrity of the abstracts and titles without subjecting them to any textual cleaning processes.
COREX-18 is primarily targeted towards RAG Applications and the Citing data and scientific knowledge category.
*Note: Full-text version will be updated soon.*
<p align="center"><img src="COREX.jpg" alt="公元前1080年希腊学者誊写手稿" width="250" height="250"></p>
<h1 align="center">COREX 18</h1>
**关于COREX-18**:本数据集源自2018版CORE数据集,是一套综合性学术数据集。我们旨在服务全球科研社区,将开放获取的学术论文汇集成大规模数据集并公开发布,以期助力高级检索增强生成(Retrieval-Augmented Generation, RAG)应用开发,推动人工智能领域研究进步。
COREX 系列数据集是我们的X计划的组成部分,该计划旨在对公开可用数据进行维护与整合,打造易于获取且定期更新的标准化数据集。
**COREX-18 包含超过8500万条数据记录。**
关于元数据(metadata)保留,考虑到元数据结构复杂且存在大量空值(NULL),我们并未完整收录全部元数据,仅保留了用于追踪学术论文、了解其基础信息的关键元数据字段。
COREX-18 完全保留了原始CORE数据集的内容,未对论文摘要与标题进行任何文本清洗处理,以保障原始数据的完整性。
本数据集主要面向RAG应用及引用数据与科学知识研究方向。
*注:完整文本版本即将更新。*
提供机构:
maas
创建时间:
2025-10-03



