asta-summary-citation-counts
收藏魔搭社区2025-12-03 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/allenai/asta-summary-citation-counts
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Summary
This dataset tracks which scientific papers are most often cited by [**Asta**](https://asta.ai), an agentic research platform that uses retrieval-augmented generation (RAG) to answer scientific questions. Each record is a paper cited by Asta's _Summarize Literature_ tool, ranked by the number of times the system cited that paper. Across more than 113,000 user queries, we track **4M citations** to over **2M distinct papers**. By making this data public, we aim to create a transparent, trackable measure of which research most directly powers AI-generated answers—helping ensure that scientific contributions are visible and credited in the AI era.
Weekly updates reflect ongoing usage patterns as Asta continues to evolve. We invite researchers, bibliometricians, and AI developers to explore citation dynamics across fields, assess how AI systems surface influential work, and help build a future where credit and accountability are integral to AI-assisted discovery.
The most recent update to the data can always be retrieved using the 'latest' config:
`dataset = load_dataset("allenai/asta-summary-citation-counts", "latest")`
Older checkpoints can be retrieved by date. Eg:
`dataset = load_dataset("allenai/asta-summary-citation-counts", "2025-10-07")`
## Column Descriptions
| **Field Name** | **Description** |
|---|---|
| `corpus_id` | Unique identifier for the paper from [Semantic Scholar](https://www.semanticscholar.org/) |
| `title` | Title of the paper |
| `sqa_citation_rank` | Overall rank of the paper in terms of unique citation counts across queries on Asta Literature Summarizer |
| `sqa_citation_count_queries` | Unique citation counts of the paper across queries that powers its `sqa_citation_rank` |
| `sqa_citation_count_total_citations` | Total citation counts of the paper across queries (A paper can be cited multiple times in the answer report to a query) |
| `authors` | Comma separated string of paper authors |
| `venue` | Publishing venue/conference/journal of the paper |
| `year` | Year of publishing of the paper |
| `s2FieldsOfStudy` | Academic field of study categories assigned to the paper in Semantic Scholar by their [classifier](https://blog.allenai.org/announcing-s2fos-an-open-source-academic-field-of-study-classifier-9d2f641949e5). The possible fields are: Computer Science, Medicine, Chemistry, Biology, Materials Science, Physics, Geology, Psychology, Art, History, Geography, Sociology, Business, Political Science, Economics, Philosophy, Mathematics, Engineering, Environmental Science, Agricultural and Food Sciences, Education, Law, and Linguistics. |
## Dataset Details
- **Dataset name:** Asta Summary Citation Counts
- **Maintainer:** Allen Institute for AI (AI2)
- **License and Use:** This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).
- **Update frequency:** Weekly
- **Source platform:** Asta (https://asta.ai)
- **System Paper:** [Ai2 Scholar QA: Organized Literature Synthesis with Attribution](https://www.semanticscholar.org/paper/Ai2-Scholar-QA%3A-Organized-Literature-Synthesis-with-Singh-Chang/6dfbddc07e942116c7a95b23a393e9deb5a47484?utm_source=direct_link)
- **System Code:** [ai2-scholarqa-lib](https://github.com/allenai/ai2-scholarqa-lib)
- **Primary use cases:** bibliometrics, AI transparency, citation dynamics, evaluation of retrieval-augmented generation systems
## 数据集概述
本数据集追踪了**Asta**(https://asta.ai)最常引用的学术论文。Asta是一款运用检索增强生成(Retrieval-Augmented Generation, RAG)技术解答科学问题的智能体驱动研究平台。每条记录均为Asta的“文献总结”工具所引用的论文,并按系统引用该论文的次数进行排序。在超过113,000条用户查询中,我们共追踪到**400万次引用**,涉及超过**200万篇独立论文**。我们公开此数据集的目的是创建一项透明、可溯源的衡量标准,以明确哪些研究直接为AI生成的回答提供支撑——助力确保在人工智能时代,学术贡献能够被清晰呈现并获得应有的认可。
本数据集每周更新,以反映Asta持续迭代过程中的使用模式变化。我们邀请研究人员、文献计量学家以及AI开发者探索各领域的引用动态,评估AI系统如何筛选出具有影响力的研究成果,共同构建一个将信誉与问责机制融入AI辅助科研发现的未来。
用户可通过`latest`配置项获取本数据集的最新版本:
`dataset = load_dataset("allenai/asta-summary-citation-counts", "latest")`
如需获取历史快照,可通过日期指定版本。例如:
`dataset = load_dataset("allenai/asta-summary-citation-counts", "2025-10-07")`
## 字段说明
| **字段名称** | **字段描述** |
|---|---|
| `corpus_id` | 来自[语义学者(Semantic Scholar)](https://www.semanticscholar.org/)的论文唯一标识符 |
| `title` | 论文标题 |
| `sqa_citation_rank` | 在Asta文献总结工具的所有查询中,按论文唯一引用次数计算的整体排名 |
| `sqa_citation_count_queries` | 用于计算`sqa_citation_rank`的、跨查询的论文唯一引用次数 |
| `sqa_citation_count_total_citations` | 跨查询的论文总引用次数(单条查询的回答报告中,一篇论文可能被多次引用) |
| `authors` | 以逗号分隔的论文作者字符串 |
| `venue` | 论文的发表载体/会议/期刊 |
| `year` | 论文发表年份 |
| `s2FieldsOfStudy` | 语义学者(Semantic Scholar)通过其[分类器](https://blog.allenai.org/announcing-s2fos-an-open-source-academic-field-of-study-classifier-9d2f641949e5)为论文分配的学术研究领域分类。可选领域包括:计算机科学、医学、化学、生物学、材料科学、物理学、地质学、心理学、艺术、历史学、地理学、社会学、商学、政治学、经济学、哲学、数学、工程学、环境科学、农业与食品科学、教育学、法学以及语言学。 |
## 数据集详情
- **数据集名称**:Asta文献总结引用计数
- **维护方**:艾伦人工智能研究所(Allen Institute for AI, AI2)
- **许可与使用范围**:本数据集采用ODC-BY许可协议发布,仅可用于符合AI2[负责任使用指南(Responsible Use Guidelines)](https://allenai.org/responsible-use)的研究与教育用途。
- **更新频率**:每周更新一次
- **来源平台**:Asta(https://asta.ai)
- **系统相关论文**:[Ai2 Scholar QA: 基于溯源的结构化文献综述](https://www.semanticscholar.org/paper/Ai2-Scholar-QA%3A-Organized-Literature-Synthesis-with-Singh-Chang/6dfbddc07e942116c7a95b23a393e9deb5a47484?utm_source=direct_link)
- **系统代码**:[ai2-scholarqa-lib](https://github.com/allenai/ai2-scholarqa-lib)
- **主要应用场景**:文献计量学、AI透明度研究、引用动态分析、检索增强生成系统评估
提供机构:
maas
创建时间:
2025-10-09



