LinxSci/arxiv-paper-insights

Name: LinxSci/arxiv-paper-insights
Creator: LinxSci
Published: 2026-04-21 10:54:48
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/LinxSci/arxiv-paper-insights

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 pretty_name: arXiv Paper Insights language: - en tags: - arxiv - paper-insights - scientific-llm - summarization - research - linxsci task_categories: - text-generation - feature-extraction - sentence-similarity size_categories: - 1K<n<10K --- # arXiv Paper Insights ## Overview arXiv Paper Insights is a public dataset for arXiv paper discovery, recommendations, and structured research insights. The dataset is derived from LinxSci (`https://linxsci.com`), a paper reading and insight platform focused on arXiv papers. For any paper in this dataset, you can open: `https://linxsci.com/pdf/{arxiv_id}` to read the paper and view the corresponding insights on LinxSci. ## Related Links - LinxSci: `https://linxsci.com` - GitHub project page: `https://github.com/LinxSci/arxiv-paper-insights` - Kaggle dataset page: `https://www.kaggle.com/datasets/linxsci/arxiv-paper-insights` - Hugging Face dataset page: `https://huggingface.co/datasets/LinxSci/arxiv-paper-insights` ## Data Source - Source platform: LinxSci (`https://linxsci.com`) - Upstream corpus: arXiv papers - Enrichment: recommendation pipeline outputs and structured insights derived by LinxSci ## Files - `recommendations.parquet` / `recommendations.jsonl` - `insights.parquet` / `insights.jsonl` - `SCHEMA.md` ## What Is Included - recommendation records with lightweight paper metadata - structured and sanitized paper insights - dataset fields designed for ranking, retrieval, summarization, and scientific IE ## Update Schedule - the dataset is updated every Monday - each update covers the previous week's recommended arXiv papers - corresponding insights are included when available in LinxSci's release pipeline ## What Is Excluded - full paper text - PDF files - markdown copies of papers - raw chunk provenance - figure and table image binaries ## Field Guide ### recommendations - `date`: recommendation date - `arxiv_id`: arXiv paper identifier - `rank`: rank within the recommendation list - `title`: paper title - `abstract`: paper abstract - `tldr`: short summary - `arxiv_categories`: arXiv categories - `github_links`: extracted GitHub links - `hf_links`: extracted Hugging Face links - `github_stars`: GitHub stars snapshot when available - `figures_count`: number of figures detected - `tables_count`: number of tables detected - `has_insight`: whether a sanitized insight record exists ### insights - `arxiv_id`: arXiv paper identifier - `title`: paper title - `abstract`: paper abstract - `paper_type`: coarse paper type label - `keywords_extracted`: extracted keywords - `project_page`: project page URL - `github_repo`: GitHub repository URL - `demo_url`: demo URL - `problem_statement`: concise problem statement - `proposed_solution_overview`: concise solution summary - `key_contributions`: list of contribution statements - `limitations`: list of limitation statements - `future_work`: future work summary - `research_questions_or_hypotheses`: list of research questions - `dataset_names`: dataset names mentioned in the paper - `evaluation_metric_names`: names of evaluation metrics - `baseline_method_names`: baseline or comparison method names - `implementation_details`: concise implementation summary - `key_references_summary`: summarized key references - `main_findings`: main findings - `sota_comparison`: summary of comparison to prior work - `ablation_findings`: ablation summaries - `failure_cases`: failure case summaries - `is_code_available`: whether code availability is indicated - `code_url`: code URL when available - `is_data_available`: whether data availability is indicated - `gpu_requirement`: GPU requirement summary when available - `training_time`: training time summary when available - `personal_summary`: reader-oriented summary - `strengths`: strengths list - `weaknesses`: weaknesses list - `questions_and_ideas`: follow-up questions or ideas - `tags`: tag list - `prerequisite_knowledge`: prerequisite knowledge list - `figures_count`: number of figures - `tables_count`: number of tables - `formulas_count`: number of formulas ## How To Use Typical use cases: - paper recommendation - scientific information extraction - research-agent memory - retrieval and summarization To inspect a specific paper on LinxSci, open: `https://linxsci.com/pdf/{arxiv_id}` Example: `https://linxsci.com/pdf/2401.01234` ## Loading Examples ### Load with pandas ```python import pandas as pd recommendations = pd.read_parquet("recommendations.parquet") insights = pd.read_parquet("insights.parquet") ``` ### Load JSONL ```python import pandas as pd recommendations = pd.read_json("recommendations.jsonl", lines=True) insights = pd.read_json("insights.jsonl", lines=True) ``` ### Join recommendations and insights ```python merged = recommendations.merge(insights, on="arxiv_id", how="left") ``` ## Limitations - some snapshot-based fields can become stale - recommendation and insight outputs are derived artifacts - structured insights may contain extraction errors - not all papers include external project links or code links ## License Notes License: `CC BY 4.0` for derived dataset artifacts only. Original paper contents, PDFs, and other third-party source materials remain subject to their respective licenses and rights. Users are responsible for complying with upstream source terms. ## Citation If you use this dataset, cite the project and link back to LinxSci: ```bibtex @misc{arxiv-paper-insights, title = {arXiv Paper Insights}, author = {LinxSci}, year = {2026}, howpublished = {\url{https://linxsci.com}}, note = {Public dataset for arXiv paper insights and recommendation-ready metadata} } ``` ## Version History - `2026-04-20`: initial public release - Weekly cadence: every Monday we publish the previous week's recommended papers

--- license: CC BY 4.0 pretty_name: arXiv论文洞察数据集（arXiv Paper Insights） language: - en tags: - arXiv - 论文洞察 - 科学大语言模型（Scientific LLM） - 文本摘要 - 研究 - LinxSci task_categories: - 文本生成 - 特征提取 - 句子相似度 size_categories: - 1000<n<10000 --- # arXiv论文洞察数据集（arXiv Paper Insights） ## 概览 arXiv论文洞察数据集是面向arXiv（arXiv）论文发现、推荐与结构化研究洞察的公开数据集。本数据集源自专注于arXiv论文阅读与洞察提取的平台LinxSci（LinxSci），其官方网址为`https://linxsci.com`。对于本数据集中的任意一篇论文，您可通过访问`https://linxsci.com/pdf/{arxiv_id}`来阅读论文并查看LinxSci平台上对应的洞察内容。 ## 相关链接 - LinxSci：`https://linxsci.com` - GitHub项目页面：`https://github.com/LinxSci/arxiv-paper-insights` - Kaggle数据集页面：`https://www.kaggle.com/datasets/linxsci/arxiv-paper-insights` - Hugging Face数据集页面：`https://huggingface.co/datasets/LinxSci/arxiv-paper-insights` ## 数据来源 - 来源平台：LinxSci（LinxSci），官方网址为`https://linxsci.com` - 上游语料：arXiv论文 - 补充内容：LinxSci通过推荐流水线生成的输出结果与结构化研究洞察 ## 数据集文件 - `recommendations.parquet` / `recommendations.jsonl` - `insights.parquet` / `insights.jsonl` - `SCHEMA.md` ## 数据集包含内容 - 附带轻量级论文元数据的推荐记录 - 经过清洗与结构化处理的论文洞察内容 - 专为排序、检索、摘要生成与科学信息提取设计的数据集字段 ## 更新周期 - 本数据集每周一进行更新 - 每次更新涵盖上周的推荐arXiv论文 - 若LinxSci发布流水线中存在对应洞察内容，则会一并包含在更新包中 ## 数据集不包含内容 - 完整论文文本 - PDF文件 - 论文的Markdown副本 - 原始分块的来源信息 - 图表图片的二进制文件 ## 字段说明 ### 推荐记录（recommendations） - `date`：推荐日期 - `arxiv_id`：arXiv论文标识符 - `rank`：推荐列表中的排名位次 - `title`：论文标题 - `abstract`：论文摘要 - `tldr`：简短摘要 - `arxiv_categories`：arXiv论文分类标签 - `github_links`：提取得到的GitHub链接 - `hf_links`：提取得到的Hugging Face链接 - `github_stars`：提取时的GitHub星标快照（若可用） - `figures_count`：检测到的图表数量 - `tables_count`：检测到的表格数量 - `has_insight`：是否存在经过清洗的洞察记录 ### 洞察内容（insights） - `arxiv_id`：arXiv论文标识符 - `title`：论文标题 - `abstract`：论文摘要 - `paper_type`：粗粒度论文类型标签 - `keywords_extracted`：自动提取的关键词 - `project_page`：项目页面URL - `github_repo`：GitHub仓库URL - `demo_url`：演示页面URL - `problem_statement`：简洁的问题阐述 - `proposed_solution_overview`：解决方案概述 - `key_contributions`：贡献声明列表 - `limitations`：局限性声明列表 - `future_work`：未来工作计划总结 - `research_questions_or_hypotheses`：研究问题或假设列表 - `dataset_names`：论文中提及的数据集名称 - `evaluation_metric_names`：评估指标名称 - `baseline_method_names`：基线方法或对比方法名称 - `implementation_details`：简洁的实现细节总结 - `key_references_summary`：关键参考文献总结 - `main_findings`：核心研究发现 - `sota_comparison`：与当前前沿研究的对比总结 - `ablation_findings`：消融实验结果总结 - `failure_cases`：失败案例总结 - `is_code_available`：是否标注了代码可获取 - `code_url`：代码获取URL（若可用） - `is_data_available`：是否标注了数据可获取 - `gpu_requirement`：GPU资源需求总结（若可用） - `training_time`：训练耗时总结（若可用） - `personal_summary`：面向读者的个性化总结 - `strengths`：研究优势列表 - `weaknesses`：研究劣势列表 - `questions_and_ideas`：后续研究问题或创新思路 - `tags`：自定义标签列表 - `prerequisite_knowledge`：前置知识列表 - `figures_count`：论文图表数量 - `tables_count`：论文表格数量 - `formulas_count`：论文公式数量 ## 使用方式典型应用场景包括： - 学术论文推荐 - 科学信息提取 - AI智能体（AI Agent）记忆库构建 - 学术文本检索与摘要生成若要查看LinxSci平台上的特定论文，请访问以下链接： `https://linxsci.com/pdf/{arxiv_id}` 示例链接：`https://linxsci.com/pdf/2401.01234` ## 加载示例 ### 使用Pandas加载Parquet格式数据 python import pandas as pd recommendations = pd.read_parquet("recommendations.parquet") insights = pd.read_parquet("insights.parquet") ### 加载JSONL格式数据 python import pandas as pd recommendations = pd.read_json("recommendations.jsonl", lines=True) insights = pd.read_json("insights.jsonl", lines=True) ### 合并推荐记录与洞察内容 python merged = recommendations.merge(insights, on="arxiv_id", how="left") ## 数据集局限性 - 部分基于快照生成的字段可能随时间过时 - 推荐结果与洞察内容均为衍生加工产物 - 结构化洞察内容可能存在自动提取误差 - 并非所有论文都附带外部项目链接或代码获取链接 ## 许可声明本数据集衍生产物的许可证为CC BY 4.0。原始论文内容、PDF文件及其他第三方源材料仍受其各自的许可协议与版权条款约束。使用者需自行遵守上游数据源的相关使用条款。 ## 引用方式若您使用本数据集，请引用该项目并链接至LinxSci官方网站： bibtex @misc{arxiv-paper-insights, title = {arXiv论文洞察数据集}, author = {LinxSci}, year = {2026}, howpublished = {url{https://linxsci.com}}, note = {arXiv论文洞察与可用于推荐的元数据公开数据集} } ## 版本历史 - `2026-04-20`：首次公开发布 - 更新周期：每周一发布上周的推荐论文数据集

提供机构：

LinxSci

5,000+

优质数据集

54 个

任务类型

进入经典数据集