five

咪鼠论文数据

收藏
魔搭社区2025-09-22 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/MiMouse/MiMouseArxiv
下载链接
链接失效反馈
官方服务:
资源简介:
# 咪鼠论文数据 ## 一、介绍 ​ 本仓库备份下载的arxiv论文数据,为方便备份,都打成了压缩包,每月打成一个压缩包,收录了从2007年4月份以来的arxiv的论文数据。 ​ 数据来源于谷歌的对象存储,依赖于gsutil客户端来下载。 ## 二、从谷歌对象存储下载论文数据 ### 1、安装工具 python版本需要小于等于3.12(截至2025年7月) ``` pip install gsutil ``` ### 2、配置host 国内网络直接访问会有问题,应该是DNS解析的问题,部分地区可以访问,不能访问的在host文件中添加以下内容即可: ``` 142.250.217.91 storage.googleapis.com ``` * Windows的host文件位于`C:/Windows/System32/drivers/etc/hosts` * Linux和MacOS的host文件位于`/etc/hosts` ### 3、下载数据 * 下载所有数据(不推荐,好几个T) ```bash gsutil -m cp -R gs://arxiv-dataset/arxiv/arxiv/pdf/ . ``` 上述命令开启多线程下载,将其下载到当前文件夹下 * 下载某年数据 ```bash gsutil -m cp -R gs://arxiv-dataset/arxiv/arxiv/pdf/24* . ``` 数据源按月份收集数据,命名为`年份+月份`,如上述指令下载的是24年的数据 * 下载某月数据 ``` gsutil -m cp -R gs://arxiv-dataset/arxiv/arxiv/pdf/1308 . ``` 下载某月,即拼接年月即可 ### 4、下载某月数据并压缩 ``` filename=1909 && \ gsutil -m cp -R gs://arxiv-dataset/arxiv/arxiv/pdf/$filename . && \ tar -I "pigz -p100" -cf $filename.tgz ./$filename && \ rm -rf ./$filename ``` * 以上命令下载2019年8月的数据到当前文件夹内,并开启100线程进行压缩,压缩完后删除原文件夹,只保留压缩文件; * pigz需要安装,在Ubuntu上安装命令为: ``` apt update -y && apt install pigz -y ```

# Misu Paper Dataset ## 1. Introduction This repository hosts backed-up and downloaded arXiv paper data. For easier backup management, all data is packaged into monthly compressed archives, covering arXiv papers since April 2007. The data is sourced from Google Cloud Storage and downloaded via the gsutil client. ## 2. Downloading Paper Data from Google Cloud Storage ### 1. Install Tools Python version must be ≤3.12 (as of July 2025). Install the required tool with the following command: bash pip install gsutil ### 2. Configure Hosts Direct access from domestic Chinese networks may encounter issues due to DNS resolution problems. Access is available in some regions; for those unable to access, add the following entry to your hosts file: 142.250.217.91 storage.googleapis.com * Windows hosts file location: `C:/Windows/System32/drivers/etc/hosts` * Linux and macOS hosts file location: `/etc/hosts` ### 3. Download Data * Download all data (not recommended, as the total size is several terabytes) bash gsutil -m cp -R gs://arxiv-dataset/arxiv/arxiv/pdf/ . The command above enables multi-threaded downloading and saves the data to the current directory. * Download data for a specific year bash gsutil -m cp -R gs://arxiv-dataset/arxiv/arxiv/pdf/24* . The data sources are organized by month, with archives named using the format `year+month`; the command above downloads data for 2024. * Download data for a specific month bash gsutil -m cp -R gs://arxiv-dataset/arxiv/arxiv/pdf/1308 . To download data for a specific month, simply concatenate the year and month to form the archive name. ### 4. Download and Compress a Month's Data bash filename=1909 && gsutil -m cp -R gs://arxiv-dataset/arxiv/arxiv/pdf/$filename . && tar -I "pigz -p100" -cf $filename.tgz ./$filename && rm -rf ./$filename * The command above downloads data for August 2019 to the current directory, enables 100 threads for compression, deletes the original folder after compression, and retains only the compressed archive. * pigz must be installed; the installation command for Ubuntu is: bash apt update -y && apt install pigz -y
提供机构:
maas
创建时间:
2025-06-08
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务