five

arxiv_abstracts

收藏
魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/common-pile/arxiv_abstracts
下载链接
链接失效反馈
官方服务:
资源简介:
# ArXiv Abstracts ## Description Each paper uploaded to [ArXiv](https://arxiv.org/) includes structured metadata fields, including an abstract summarizing the paper’s findings and contributions. According to [ArXiv’s licensing policy](https://info.arxiv.org/help/license/index.html), the metadata for any paper submitted to ArXiv is distributed under the CC0 license, regardless of the license of the paper itself. Thus, this dataset contains the abstract for every paper submitted to ArXiv through late 2024. We source the abstracts from ArXiv’s API via the Open Archives Initiative Protocol for Metadata Harvesting endpoint and reproduce them as-is. Code for collecting, processing, and preparing this dataset is available in the [common-pile GitHub repo](https://github.com/r-three/common-pile). ## Dataset Statistics | Documents | UTF-8 GB | ------------|----------- | 2,538,935 | 2.4 | ## License Issues While we aim to produce datasets with completely accurate licensing information, license laundering and inaccurate metadata can cause us to erroneously assign the incorrect license to some documents (for further discussion of this limitation, please see [our paper](https://huggingface.co/papers/2506.05209)). If you believe you have found an instance of incorrect licensing in this dataset, please [start a discussion](https://github.com/r-three/common-pile/discussions/new) on this repository. ## Other Versions This is the "raw" version of ArXiv Abstracts. If you are looking for the filtered version used to train [Comma v0.1](https://huggingface.co/common-pile/comma-v0.1), you can find it [here](https://huggingface.co/datasets/common-pile/arxiv_abstracts_filtered). ## Citation If you use this dataset, please cite: ```bibtex @article{kandpal2025common, title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}}, author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray}, journal={arXiv preprint}, year={2025} } ```

# ArXiv摘要数据集 ## 数据集描述 每一篇上传至[ArXiv](https://arxiv.org/)的论文均包含结构化元数据字段,其中涵盖了总结该论文研究发现与学术贡献的摘要文本。 根据[ArXiv授权政策](https://info.arxiv.org/help/license/index.html),无论提交至ArXiv的论文本身采用何种授权协议,其元数据均以CC0协议进行分发,不受论文自身授权条款的约束。 因此,本数据集收录了截至2024年末提交至ArXiv的所有论文的摘要文本。我们通过开放存档倡议元数据收割协议(Open Archives Initiative Protocol for Metadata Harvesting,OAI-PMH)接口,从ArXiv的应用程序编程接口(Application Programming Interface,API)获取摘要内容,并完全原样复现原始数据。 本数据集的采集、处理与预处理代码已上传至[common-pile GitHub仓库](https://github.com/r-three/common-pile),可供公开获取。 ## 数据集统计数据 | 文档总数 | UTF-8 存储量(GB) | |---------|-------------------| | 2,538,935 | 2.4 | ## 授权相关说明 尽管我们致力于提供完全准确的授权信息,但授权洗白与元数据失准问题可能导致我们错误地为部分文档分配了不当的授权协议。关于该局限性的详细讨论,请参阅[我们的学术论文](https://huggingface.co/papers/2506.05209)。若您发现本数据集存在授权信息错误的情况,请前往本仓库[发起讨论](https://github.com/r-three/common-pile/discussions/new)。 ## 其他版本 本版本为ArXiv摘要数据集的"原始版"。若您需要用于训练[Comma v0.1](https://huggingface.co/common-pile/comma-v0.1)的过滤版数据集,可前往[此处](https://huggingface.co/datasets/common-pile/arxiv_abstracts_filtered)获取。 ## 引用方式 若您使用本数据集,请引用以下文献: bibtex @article{kandpal2025common, title={{The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text}}, author={Nikhil Kandpal and Brian Lester and Colin Raffel and Sebastian Majstorovic and Stella Biderman and Baber Abbasi and Luca Soldaini and Enrico Shippole and A. Feder Cooper and Aviya Skowron and Shayne Longpre and Lintang Sutawika and Alon Albalak and Zhenlin Xu and Guilherme Penedo and Loubna Ben and Elie Bakouch and John David and Honglu Fan and Dashiell Stander and Guangyu Song and Aaron Gokaslan and John Kirchenbauer and Tom Goldstein and Brian R and Bhavya Kailkhura and Tyler Murray}, journal={arXiv preprint}, year={2025} }
提供机构:
maas
创建时间:
2025-06-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作