five

arxiver

收藏
魔搭社区2025-11-12 更新2024-10-26 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/arxiver
下载链接
链接失效反馈
官方服务:
资源简介:
## Arxiver Dataset Arxiver consists of 63,357 [arXiv](https://arxiv.org/) papers converted to multi-markdown (**.mmd**) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization. ## Curation The Arxiver dataset is created using a neural OCR - [Nougat](https://facebookresearch.github.io/nougat/). After OCR processing, we apply custom text processing steps to refine the data. This includes extracting author information, removing reference sections, and performing additional cleaning and formatting. Please refer to our GitHub [repo](https://github.com/neuralwork/arxiver) for details. ## Using Arxiver You can easily download and use the arxiver dataset with Hugging Face's [datasets](https://huggingface.co/datasets) library. ```py from datasets import load_dataset # whole dataset takes 1.44GB dataset = load_dataset("neuralwork/arxiver") print(dataset) ``` Alternatively, you can stream the dataset to save disk space or to partially download the dataset: ```py from datasets import load_dataset dataset = load_dataset("neuralwork/arxiver", streaming=True) print(dataset) print(next(iter(dataset['train']))) ``` ## References The original articles are maintained by [arXiv](https://arxiv.org/) and copyrighted to the original authors, please refer to the arXiv license information [page](https://info.arxiv.org/help/license/index.html) for details. We release our dataset with a Creative Commons Attribution-Noncommercial-ShareAlike (CC BY-NC-SA 4.0) license, if you use this dataset in your research or project, please cite it as follows: ``` @misc{acar_arxiver2024, author = {Alican Acar, Alara Dirik, Muhammet Hatipoglu}, title = {ArXiver}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/neuralwork/arxiver}} } ```

# Arxiver 数据集 Arxiver数据集包含63357篇转换为multi-markdown(.mmd)格式的arXiv论文。本数据集涵盖原始arXiv论文ID、标题、摘要、作者、发表日期、URL链接以及2023年1月至2023年10月期间发布的对应Markdown文件。 我们期望本数据集可应用于语义搜索、领域专用语言建模、问答系统与文本摘要等诸多场景。 ## 数据整理 Arxiver数据集基于神经光学字符识别(OCR)工具[Nougat](https://facebookresearch.github.io/nougat/)构建。完成OCR处理后,我们通过自定义文本处理流程对数据进行优化,具体包括提取作者信息、移除参考文献章节,以及开展额外的数据清洗与格式规整工作。详细信息请参阅我们的GitHub代码仓库[repo](https://github.com/neuralwork/arxiver)。 ## Arxiver数据集的使用方法 您可通过Hugging Face的[datasets](https://huggingface.co/datasets)库便捷地下载并使用Arxiver数据集。 py from datasets import load_dataset # whole dataset takes 1.44GB dataset = load_dataset("neuralwork/arxiver") print(dataset) 或者,您也可以采用流式加载的方式获取数据集,以节省磁盘空间或仅下载部分数据集: py from datasets import load_dataset dataset = load_dataset("neuralwork/arxiver", streaming=True) print(dataset) print(next(iter(dataset['train']))) ## 参考文献与使用许可 原始论文由[arXiv](https://arxiv.org/)托管,其版权归原作者所有,详细版权信息请参阅arXiv许可说明页面[https://info.arxiv.org/help/license/index.html](https://info.arxiv.org/help/license/index.html)。本数据集采用知识共享署名-非商业性使用-相同方式共享4.0(CC BY-NC-SA 4.0)许可协议发布。若您在研究或项目中使用本数据集,请按如下格式引用: @misc{acar_arxiver2024, author = {Alican Acar, Alara Dirik, Muhammet Hatipoglu}, title = {ArXiver}, year = {2024}, publisher = {Hugging Face}, howpublished = {url{https://huggingface.co/datasets/neuralwork/arxiver}} }
提供机构:
maas
创建时间:
2024-10-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作