five

LinxSci/arxiv-paper-insights

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/LinxSci/arxiv-paper-insights
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 pretty_name: arXiv Paper Insights language: - en tags: - arxiv - paper-insights - scientific-llm - summarization - research - linxsci task_categories: - text-generation - feature-extraction - sentence-similarity size_categories: - 1K<n<10K --- # arXiv Paper Insights ## Overview arXiv Paper Insights is a public dataset for arXiv paper discovery, recommendations, and structured research insights. The dataset is derived from LinxSci (`https://linxsci.com`), a paper reading and insight platform focused on arXiv papers. For any paper in this dataset, you can open: `https://linxsci.com/pdf/{arxiv_id}` to read the paper and view the corresponding insights on LinxSci. ## Related Links - LinxSci: `https://linxsci.com` - GitHub project page: `https://github.com/LinxSci/arxiv-paper-insights` - Kaggle dataset page: `https://www.kaggle.com/datasets/linxsci/arxiv-paper-insights` - Hugging Face dataset page: `https://huggingface.co/datasets/LinxSci/arxiv-paper-insights` ## Data Source - Source platform: LinxSci (`https://linxsci.com`) - Upstream corpus: arXiv papers - Enrichment: recommendation pipeline outputs and structured insights derived by LinxSci ## Files - `recommendations.parquet` / `recommendations.jsonl` - `insights.parquet` / `insights.jsonl` - `SCHEMA.md` ## What Is Included - recommendation records with lightweight paper metadata - structured and sanitized paper insights - dataset fields designed for ranking, retrieval, summarization, and scientific IE ## Update Schedule - the dataset is updated every Monday - each update covers the previous week's recommended arXiv papers - corresponding insights are included when available in LinxSci's release pipeline ## What Is Excluded - full paper text - PDF files - markdown copies of papers - raw chunk provenance - figure and table image binaries ## Field Guide ### recommendations - `date`: recommendation date - `arxiv_id`: arXiv paper identifier - `rank`: rank within the recommendation list - `title`: paper title - `abstract`: paper abstract - `tldr`: short summary - `arxiv_categories`: arXiv categories - `github_links`: extracted GitHub links - `hf_links`: extracted Hugging Face links - `github_stars`: GitHub stars snapshot when available - `figures_count`: number of figures detected - `tables_count`: number of tables detected - `has_insight`: whether a sanitized insight record exists ### insights - `arxiv_id`: arXiv paper identifier - `title`: paper title - `abstract`: paper abstract - `paper_type`: coarse paper type label - `keywords_extracted`: extracted keywords - `project_page`: project page URL - `github_repo`: GitHub repository URL - `demo_url`: demo URL - `problem_statement`: concise problem statement - `proposed_solution_overview`: concise solution summary - `key_contributions`: list of contribution statements - `limitations`: list of limitation statements - `future_work`: future work summary - `research_questions_or_hypotheses`: list of research questions - `dataset_names`: dataset names mentioned in the paper - `evaluation_metric_names`: names of evaluation metrics - `baseline_method_names`: baseline or comparison method names - `implementation_details`: concise implementation summary - `key_references_summary`: summarized key references - `main_findings`: main findings - `sota_comparison`: summary of comparison to prior work - `ablation_findings`: ablation summaries - `failure_cases`: failure case summaries - `is_code_available`: whether code availability is indicated - `code_url`: code URL when available - `is_data_available`: whether data availability is indicated - `gpu_requirement`: GPU requirement summary when available - `training_time`: training time summary when available - `personal_summary`: reader-oriented summary - `strengths`: strengths list - `weaknesses`: weaknesses list - `questions_and_ideas`: follow-up questions or ideas - `tags`: tag list - `prerequisite_knowledge`: prerequisite knowledge list - `figures_count`: number of figures - `tables_count`: number of tables - `formulas_count`: number of formulas ## How To Use Typical use cases: - paper recommendation - scientific information extraction - research-agent memory - retrieval and summarization To inspect a specific paper on LinxSci, open: `https://linxsci.com/pdf/{arxiv_id}` Example: `https://linxsci.com/pdf/2401.01234` ## Loading Examples ### Load with pandas ```python import pandas as pd recommendations = pd.read_parquet("recommendations.parquet") insights = pd.read_parquet("insights.parquet") ``` ### Load JSONL ```python import pandas as pd recommendations = pd.read_json("recommendations.jsonl", lines=True) insights = pd.read_json("insights.jsonl", lines=True) ``` ### Join recommendations and insights ```python merged = recommendations.merge(insights, on="arxiv_id", how="left") ``` ## Limitations - some snapshot-based fields can become stale - recommendation and insight outputs are derived artifacts - structured insights may contain extraction errors - not all papers include external project links or code links ## License Notes License: `CC BY 4.0` for derived dataset artifacts only. Original paper contents, PDFs, and other third-party source materials remain subject to their respective licenses and rights. Users are responsible for complying with upstream source terms. ## Citation If you use this dataset, cite the project and link back to LinxSci: ```bibtex @misc{arxiv-paper-insights, title = {arXiv Paper Insights}, author = {LinxSci}, year = {2026}, howpublished = {\url{https://linxsci.com}}, note = {Public dataset for arXiv paper insights and recommendation-ready metadata} } ``` ## Version History - `2026-04-20`: initial public release - Weekly cadence: every Monday we publish the previous week's recommended papers

--- license: CC BY 4.0 pretty_name: arXiv论文洞察数据集(arXiv Paper Insights) language: - en tags: - arXiv - 论文洞察 - 科学大语言模型(Scientific LLM) - 文本摘要 - 研究 - LinxSci task_categories: - 文本生成 - 特征提取 - 句子相似度 size_categories: - 1000<n<10000 --- # arXiv论文洞察数据集(arXiv Paper Insights) ## 概览 arXiv论文洞察数据集是面向arXiv(arXiv)论文发现、推荐与结构化研究洞察的公开数据集。 本数据集源自专注于arXiv论文阅读与洞察提取的平台LinxSci(LinxSci),其官方网址为`https://linxsci.com`。对于本数据集中的任意一篇论文,您可通过访问`https://linxsci.com/pdf/{arxiv_id}`来阅读论文并查看LinxSci平台上对应的洞察内容。 ## 相关链接 - LinxSci:`https://linxsci.com` - GitHub项目页面:`https://github.com/LinxSci/arxiv-paper-insights` - Kaggle数据集页面:`https://www.kaggle.com/datasets/linxsci/arxiv-paper-insights` - Hugging Face数据集页面:`https://huggingface.co/datasets/LinxSci/arxiv-paper-insights` ## 数据来源 - 来源平台:LinxSci(LinxSci),官方网址为`https://linxsci.com` - 上游语料:arXiv论文 - 补充内容:LinxSci通过推荐流水线生成的输出结果与结构化研究洞察 ## 数据集文件 - `recommendations.parquet` / `recommendations.jsonl` - `insights.parquet` / `insights.jsonl` - `SCHEMA.md` ## 数据集包含内容 - 附带轻量级论文元数据的推荐记录 - 经过清洗与结构化处理的论文洞察内容 - 专为排序、检索、摘要生成与科学信息提取设计的数据集字段 ## 更新周期 - 本数据集每周一进行更新 - 每次更新涵盖上周的推荐arXiv论文 - 若LinxSci发布流水线中存在对应洞察内容,则会一并包含在更新包中 ## 数据集不包含内容 - 完整论文文本 - PDF文件 - 论文的Markdown副本 - 原始分块的来源信息 - 图表图片的二进制文件 ## 字段说明 ### 推荐记录(recommendations) - `date`:推荐日期 - `arxiv_id`:arXiv论文标识符 - `rank`:推荐列表中的排名位次 - `title`:论文标题 - `abstract`:论文摘要 - `tldr`:简短摘要 - `arxiv_categories`:arXiv论文分类标签 - `github_links`:提取得到的GitHub链接 - `hf_links`:提取得到的Hugging Face链接 - `github_stars`:提取时的GitHub星标快照(若可用) - `figures_count`:检测到的图表数量 - `tables_count`:检测到的表格数量 - `has_insight`:是否存在经过清洗的洞察记录 ### 洞察内容(insights) - `arxiv_id`:arXiv论文标识符 - `title`:论文标题 - `abstract`:论文摘要 - `paper_type`:粗粒度论文类型标签 - `keywords_extracted`:自动提取的关键词 - `project_page`:项目页面URL - `github_repo`:GitHub仓库URL - `demo_url`:演示页面URL - `problem_statement`:简洁的问题阐述 - `proposed_solution_overview`:解决方案概述 - `key_contributions`:贡献声明列表 - `limitations`:局限性声明列表 - `future_work`:未来工作计划总结 - `research_questions_or_hypotheses`:研究问题或假设列表 - `dataset_names`:论文中提及的数据集名称 - `evaluation_metric_names`:评估指标名称 - `baseline_method_names`:基线方法或对比方法名称 - `implementation_details`:简洁的实现细节总结 - `key_references_summary`:关键参考文献总结 - `main_findings`:核心研究发现 - `sota_comparison`:与当前前沿研究的对比总结 - `ablation_findings`:消融实验结果总结 - `failure_cases`:失败案例总结 - `is_code_available`:是否标注了代码可获取 - `code_url`:代码获取URL(若可用) - `is_data_available`:是否标注了数据可获取 - `gpu_requirement`:GPU资源需求总结(若可用) - `training_time`:训练耗时总结(若可用) - `personal_summary`:面向读者的个性化总结 - `strengths`:研究优势列表 - `weaknesses`:研究劣势列表 - `questions_and_ideas`:后续研究问题或创新思路 - `tags`:自定义标签列表 - `prerequisite_knowledge`:前置知识列表 - `figures_count`:论文图表数量 - `tables_count`:论文表格数量 - `formulas_count`:论文公式数量 ## 使用方式 典型应用场景包括: - 学术论文推荐 - 科学信息提取 - AI智能体(AI Agent)记忆库构建 - 学术文本检索与摘要生成 若要查看LinxSci平台上的特定论文,请访问以下链接: `https://linxsci.com/pdf/{arxiv_id}` 示例链接:`https://linxsci.com/pdf/2401.01234` ## 加载示例 ### 使用Pandas加载Parquet格式数据 python import pandas as pd recommendations = pd.read_parquet("recommendations.parquet") insights = pd.read_parquet("insights.parquet") ### 加载JSONL格式数据 python import pandas as pd recommendations = pd.read_json("recommendations.jsonl", lines=True) insights = pd.read_json("insights.jsonl", lines=True) ### 合并推荐记录与洞察内容 python merged = recommendations.merge(insights, on="arxiv_id", how="left") ## 数据集局限性 - 部分基于快照生成的字段可能随时间过时 - 推荐结果与洞察内容均为衍生加工产物 - 结构化洞察内容可能存在自动提取误差 - 并非所有论文都附带外部项目链接或代码获取链接 ## 许可声明 本数据集衍生产物的许可证为CC BY 4.0。 原始论文内容、PDF文件及其他第三方源材料仍受其各自的许可协议与版权条款约束。使用者需自行遵守上游数据源的相关使用条款。 ## 引用方式 若您使用本数据集,请引用该项目并链接至LinxSci官方网站: bibtex @misc{arxiv-paper-insights, title = {arXiv论文洞察数据集}, author = {LinxSci}, year = {2026}, howpublished = {url{https://linxsci.com}}, note = {arXiv论文洞察与可用于推荐的元数据公开数据集} } ## 版本历史 - `2026-04-20`:首次公开发布 - 更新周期:每周一发布上周的推荐论文数据集
提供机构:
LinxSci
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作