five

neuralwork/arxiver

收藏
Hugging Face2024-11-01 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/neuralwork/arxiver
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 size_categories: - 10K<n<100K --- ## Arxiver Dataset Arxiver consists of 63,357 [arXiv](https://arxiv.org/) papers converted to multi-markdown (**.mmd**) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization. ## Curation The Arxiver dataset is created using a neural OCR - [Nougat](https://facebookresearch.github.io/nougat/). After OCR processing, we apply custom text processing steps to refine the data. This includes extracting author information, removing reference sections, and performing additional cleaning and formatting. Please refer to our GitHub [repo](https://github.com/neuralwork/arxiver) for details. ## Using Arxiver You can easily download and use the arxiver dataset with Hugging Face's [datasets](https://huggingface.co/datasets) library. ```py from datasets import load_dataset # whole dataset takes 1.44GB dataset = load_dataset("neuralwork/arxiver") print(dataset) ``` Alternatively, you can stream the dataset to save disk space or to partially download the dataset: ```py from datasets import load_dataset dataset = load_dataset("neuralwork/arxiver", streaming=True) print(dataset) print(next(iter(dataset['train']))) ``` ## References The original articles are maintained by [arXiv](https://arxiv.org/) and copyrighted to the original authors, please refer to the arXiv license information [page](https://info.arxiv.org/help/license/index.html) for details. We release our dataset with a Creative Commons Attribution-Noncommercial-ShareAlike (CC BY-NC-SA 4.0) license, if you use this dataset in your research or project, please cite it as follows: ``` @misc{acar_arxiver2024, author = {Alican Acar, Alara Dirik, Muhammet Hatipoglu}, title = {ArXiver}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/neuralwork/arxiver}} } ```
提供机构:
neuralwork
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作