neuralwork/arxiver

Name: neuralwork/arxiver
Creator: neuralwork
Published: 2024-11-01 21:18:04
License: 暂无描述

Hugging Face2024-11-01 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/neuralwork/arxiver

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 size_categories: - 10K<n<100K --- ## Arxiver Dataset Arxiver consists of 63,357 [arXiv](https://arxiv.org/) papers converted to multi-markdown (**.mmd**) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization. ## Curation The Arxiver dataset is created using a neural OCR - [Nougat](https://facebookresearch.github.io/nougat/). After OCR processing, we apply custom text processing steps to refine the data. This includes extracting author information, removing reference sections, and performing additional cleaning and formatting. Please refer to our GitHub [repo](https://github.com/neuralwork/arxiver) for details. ## Using Arxiver You can easily download and use the arxiver dataset with Hugging Face's [datasets](https://huggingface.co/datasets) library. ```py from datasets import load_dataset # whole dataset takes 1.44GB dataset = load_dataset("neuralwork/arxiver") print(dataset) ``` Alternatively, you can stream the dataset to save disk space or to partially download the dataset: ```py from datasets import load_dataset dataset = load_dataset("neuralwork/arxiver", streaming=True) print(dataset) print(next(iter(dataset['train']))) ``` ## References The original articles are maintained by [arXiv](https://arxiv.org/) and copyrighted to the original authors, please refer to the arXiv license information [page](https://info.arxiv.org/help/license/index.html) for details. We release our dataset with a Creative Commons Attribution-Noncommercial-ShareAlike (CC BY-NC-SA 4.0) license, if you use this dataset in your research or project, please cite it as follows: ``` @misc{acar_arxiver2024, author = {Alican Acar, Alara Dirik, Muhammet Hatipoglu}, title = {ArXiver}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/neuralwork/arxiver}} } ```

提供机构：

neuralwork

5,000+

优质数据集

54 个

任务类型

进入经典数据集