arxiv

Name: arxiv
Creator: maas
Published: 2025-12-05 12:06:02
License: 暂无描述

魔搭社区2025-12-05 更新2024-12-28 收录

下载链接：

https://modelscope.cn/datasets/Intelligent-Internet/arxiv

下载链接

链接失效反馈

官方服务：

资源简介：

# `arXiv` This is a [arXiv](https://arxiv.org/) dataset for use with the [II-Commons-Local](https://github.com/Intelligent-Internet/II-Commons-Local) project. ## Dataset Details ### Dataset Description This dataset comprises a curated arXiv dataset. We provide a series of pre-computed embedding vector datasets based on ArXiv paper data to help users quickly start and test the semantic search API. These datasets contain paper metadata, text from certain sections, and optimized embedding vectors. They can be downloaded and used directly, eliminating the need for time-consuming local embedding calculations. More details about the dataset can be found [here](https://github.com/Intelligent-Internet/ii-commons-local/blob/main/DATASETS.md). ### Dataset Sources This dataset is derived and organized from arXiv. The original license information for the image can be found in the corresponding entry of the original database. We merged two major [OAI](https://info.arxiv.org/help/oa/index.html) datasets snapshot of arXiv: - [github.com/mattbierbaum/arxiv-public-datasets](https://github.com/mattbierbaum/arxiv-public-datasets/releases) - 1.5M+ papers - cutoff at 2019-03-01 - categories: ALL. - [kaggle.com/datasets/Cornell-University/arxiv](https://www.kaggle.com/datasets/Cornell-University/arxiv/data) - 1.7M+ papers - cutoff at 2025-03-13 - categories: math, statistics, electrical engineering, quantitative biology, and economics. We are still working on this dataset and more fresh data will be added soon. ## Dataset Structure ### `ts_arxiv` - id: A unique identifier for the paper. - paper_id: The arXiv paper ID. - submitter: The submitter of the paper. - authors: The authors of the paper. - title: The title of the paper. - comments: The comments of the paper. - journal_ref: The journal reference of the paper. - doi: The DOI of the paper. - report_no: The report number of the paper. - versions: The versions of the paper. - url: The URL of the paper. - license: The license of the paper. - abstract: The abstract of the paper. - introduction: The introduction of the paper. [1] - conclusion: The conclusion of the paper. [1] - categories_flat: The categories of the paper. [1] We only populate the introduction and conclusion fields if the paper uses a CC license. ### `ts_arxiv_embed` - id: A unique identifier for the paper. - abstract_vector: The vector embedding of the abstract of the paper. - introduction_vector: The vector embedding of the introduction of the paper. - conclusion_vector: The vector embedding of the conclusion of the paper. ## How to use ### Use `II-Commons-Local` The easiest way to use the arXiv dataset is to use the [`II-Commons-Local`(ii-commons-desktop Semantic Search API)](https://github.com/Intelligent-Internet/ii-commons-local) project. The project is a high-performance semantic search and storage API built with FastAPI and DuckDB. It allows you to store text data, generate vector embeddings for it, and perform efficient semantic searches. ### Use the pre-built [DuckDB](https://duckdb.org/) database You can download the pre-built DuckDB database from [here](https://huggingface.co/datasets/Intelligent-Internet/arxiv/tree/main/duckdb). ### Build your own DuckDB database from exported csv files You can export the csv files from the DuckDB database using the following command: 1. Creating DuckDB database ```bash $ duckdb arxiv.duckdb ``` 2. Importing arxiv meta table `ts_arxiv` ```sql CREATE TABLE ts_arxiv( id BIGINT, paper_id VARCHAR, submitter VARCHAR[], authors VARCHAR[], title VARCHAR, comments VARCHAR, journal_ref VARCHAR, doi VARCHAR, report_no VARCHAR, versions VARCHAR[], url VARCHAR, license VARCHAR, abstract VARCHAR, introduction VARCHAR, conclusion VARCHAR, categories_flat VARCHAR[] ); INSERT INTO ts_arxiv ( id, paper_id, submitter, authors, title, comments, journal_ref, doi, report_no, versions, url, license, abstract, introduction, conclusion, categories_flat ) SELECT id, paper_id, submitter, authors, title, comments, journal_ref, doi, report_no, versions, url, license, abstract, introduction, conclusion, categories_flat FROM read_csv_auto('lite/ts_arxiv/0000000.csv'); ``` 3. Importing vector embeddings table `ts_arxiv_embed` ```sql CREATE TABLE ts_arxiv_embed ( id BIGINT, abstract_vector UTINYINT[128], introduction_vector UTINYINT[128], conclusion_vector UTINYINT[128] ); INSERT INTO ts_arxiv_embed (id, abstract_vector, introduction_vector, conclusion_vector) SELECT id, CASE WHEN abstract_vector IS NULL OR abstract_vector = '' OR abstract_vector = '[]' THEN NULL WHEN trim(trim(abstract_vector, '[]'), ' ') = '' THEN CAST([] AS UTINYINT[]) ELSE list_transform(list_filter( string_split(trim(trim(abstract_vector, '[]'), ' '), ' '), element -> element <> '' AND element IS NOT NULL ), x -> CAST(x AS UTINYINT)) END AS abstract_vector, CASE WHEN introduction_vector IS NULL OR introduction_vector= '' OR introduction_vector = '[]' THEN NULL WHEN trim(trim(introduction_vector, '[]'), ' ') = '' THEN CAST([] AS UTINYINT[]) ELSE list_transform( list_filter( string_split(trim(trim(introduction_vector, '[]'), ' '), ' '), element -> element <> '' AND element IS NOT NULL ), x -> CAST(x AS UTINYINT)) END AS introduction_vector, CASE WHEN conclusion_vector IS NULL OR conclusion_vector= '' OR conclusion_vector = '[]' THEN NULL WHEN trim(trim(conclusion_vector, '[]'), ' ') = '' THEN CAST([] AS UTINYINT[]) ELSE list_transform( list_filter( string_split(trim(trim(conclusion_vector, '[]'), ' '), ' '), element -> element <> '' AND element IS NOT NULL ), x -> CAST(x AS UTINYINT)) END AS conclusion_vector FROM read_csv_auto( 'lite/ts_arxiv_embed_section_text/0000000.csv', types={'abstract_vector': 'VARCHAR', 'introduction_vector': 'VARCHAR', 'conclusion_vector':'VARCHAR'} ); ``` ### Use the full PostgreSQL version database We also provide a full PostgreSQL version un-trimmed database. You can download it from [here](https://huggingface.co/datasets/Intelligent-Internet/arxiv/tree/main/data).

# `arXiv` 本数据集为适配[II-Commons-Local](https://github.com/Intelligent-Internet/II-Commons-Local)项目所用的[arXiv](https://arxiv.org/)数据集。 ## 数据集详情 ### 数据集描述本数据集为经过精选整理的arXiv数据集。我们基于arXiv论文数据提供了一系列预计算的嵌入向量数据集，以帮助用户快速启动并测试语义搜索API。此类数据集包含论文元数据、特定章节文本以及优化后的嵌入向量，可直接下载使用，无需耗费大量时间进行本地嵌入计算。 有关该数据集的更多详情可参见[此处](https://github.com/Intelligent-Internet/ii-commons-local/blob/main/DATASETS.md)。 ### 数据集来源本数据集源自arXiv并经整理得到。原始数据库的对应条目内可查阅该数据集的原始许可信息。我们合并了两份主要的arXiv [OAI (Open Archives Initiative)](https://info.arxiv.org/help/oa/index.html)数据集快照： - [github.com/mattbierbaum/arxiv-public-datasets](https://github.com/mattbierbaum/arxiv-public-datasets/releases) - 包含150万余篇论文 - 数据截止至2019年3月1日 - 涵盖全学科分类。 - [kaggle.com/datasets/Cornell-University/arxiv](https://www.kaggle.com/datasets/Cornell-University/arxiv/data) - 包含170万余篇论文 - 数据截止至2025年3月13日 - 涵盖数学、统计学、电气工程、定量生物学与经济学分类。 我们仍在对本数据集进行优化，未来将新增更多最新数据。 ## 数据集结构 ### `ts_arxiv` - id：论文的唯一标识符。 - paper_id：arXiv论文编号。 - submitter：论文提交者。 - authors：论文作者。 - title：论文标题。 - comments：论文附注信息。 - journal_ref：论文期刊引用信息。 - doi：数字对象标识符（DOI）。 - report_no：论文报告编号。 - versions：论文版本信息。 - url：论文链接。 - license：论文许可协议。 - abstract：论文摘要。 - introduction：论文引言部分[1] - conclusion：论文结论部分[1] - categories_flat：论文分类信息。 [1] 仅当论文采用知识共享（CC）许可协议时，我们才会填充引言与结论字段。 ### `ts_arxiv_embed` - id：论文的唯一标识符。 - abstract_vector：论文摘要的向量嵌入。 - introduction_vector：论文引言的向量嵌入。 - conclusion_vector：论文结论的向量嵌入。 ## 使用方法 ### 使用`II-Commons-Local` 使用本arXiv数据集的最简方式为借助[`II-Commons-Local`（ii-commons-desktop语义搜索API）](https://github.com/Intelligent-Internet/ii-commons-local)项目。该项目基于FastAPI与DuckDB构建，是一款高性能语义搜索与存储API，支持存储文本数据、为其生成向量嵌入并执行高效语义搜索。 ### 使用预构建的[DuckDB](https://duckdb.org/)数据库您可从[此处](https://huggingface.co/datasets/Intelligent-Internet/arxiv/tree/main/duckdb)下载预构建的DuckDB数据库。 ### 基于导出的CSV文件自行构建DuckDB数据库您可通过以下命令从DuckDB数据库导出CSV文件： 1. 创建DuckDB数据库 bash $ duckdb arxiv.duckdb 2. 导入arXiv元数据表`ts_arxiv` sql CREATE TABLE ts_arxiv( id BIGINT, paper_id VARCHAR, submitter VARCHAR[], authors VARCHAR[], title VARCHAR, comments VARCHAR, journal_ref VARCHAR, doi VARCHAR, report_no VARCHAR, versions VARCHAR[], url VARCHAR, license VARCHAR, abstract VARCHAR, introduction VARCHAR, conclusion VARCHAR, categories_flat VARCHAR[] ); INSERT INTO ts_arxiv ( id, paper_id, submitter, authors, title, comments, journal_ref, doi, report_no, versions, url, license, abstract, introduction, conclusion, categories_flat ) SELECT id, paper_id, submitter, authors, title, comments, journal_ref, doi, report_no, versions, url, license, abstract, introduction, conclusion, categories_flat FROM read_csv_auto('lite/ts_arxiv/0000000.csv'); 3. 导入向量嵌入表`ts_arxiv_embed` sql CREATE TABLE ts_arxiv_embed ( id BIGINT, abstract_vector UTINYINT[128], introduction_vector UTINYINT[128], conclusion_vector UTINYINT[128] ); INSERT INTO ts_arxiv_embed (id, abstract_vector, introduction_vector, conclusion_vector) SELECT id, CASE WHEN abstract_vector IS NULL OR abstract_vector = '' OR abstract_vector = '[]' THEN NULL WHEN trim(trim(abstract_vector, '[]'), ' ') = '' THEN CAST([] AS UTINYINT[]) ELSE list_transform(list_filter( string_split(trim(trim(abstract_vector, '[]'), ' '), ' '), element -> element <> '' AND element IS NOT NULL ), x -> CAST(x AS UTINYINT)) END AS abstract_vector, CASE WHEN introduction_vector IS NULL OR introduction_vector= '' OR introduction_vector = '[]' THEN NULL WHEN trim(trim(introduction_vector, '[]'), ' ') = '' THEN CAST([] AS UTINYINT[]) ELSE list_transform( list_filter( string_split(trim(trim(introduction_vector, '[]'), ' '), ' '), element -> element <> '' AND element IS NOT NULL ), x -> CAST(x AS UTINYINT)) END AS introduction_vector, CASE WHEN conclusion_vector IS NULL OR conclusion_vector= '' OR conclusion_vector = '[]' THEN NULL WHEN trim(trim(conclusion_vector, '[]'), ' ') = '' THEN CAST([] AS UTINYINT[]) ELSE list_transform( list_filter( string_split(trim(trim(conclusion_vector, '[]'), ' '), ' '), element -> element <> '' AND element IS NOT NULL ), x -> CAST(x AS UTINYINT)) END AS conclusion_vector FROM read_csv_auto( 'lite/ts_arxiv_embed_section_text/0000000.csv', types={'abstract_vector': 'VARCHAR', 'introduction_vector': 'VARCHAR', 'conclusion_vector':'VARCHAR'} ); ### 使用完整PostgreSQL版本数据库我们还提供了未经过裁剪的完整PostgreSQL版本数据库，可从[此处](https://huggingface.co/datasets/Intelligent-Internet/arxiv/tree/main/data)下载。

提供机构：

maas

创建时间：

2025-07-08

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个基于arXiv论文数据的资源，提供预计算的嵌入向量，旨在帮助用户快速启动和测试语义搜索API。它整合了来自GitHub和Kaggle的两个arXiv数据快照，包含论文元数据、文本内容以及优化后的向量嵌入，用户可直接下载使用。数据集结构包括论文元数据表和向量嵌入表，支持通过多种工具进行访问和构建。

以上内容由遇见数据集搜集并总结生成