five

LitSearch

收藏
魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/princeton-nlp/LitSearch
下载链接
链接失效反馈
官方服务:
资源简介:
# LitSearch: A Retrieval Benchmark for Scientific Literature Search This dataset contains the query set and retrieval corpus for our paper **LitSearch: A Retrieval Benchmark for Scientific Literature Search**. We introduce LitSearch, a retrieval benchmark comprising 597 realistic literature search queries about recent ML and NLP papers. LitSearch is constructed using a combination of (1) questions generated by GPT-4 based on paragraphs containing inline citations from research papers and (2) questions about recently published papers, manually written by their authors. All LitSearch questions were manually examined or edited by experts to ensure high quality. This dataset contains three configurations: 1. `query` containing 597 queries accomanied by gold paper IDs, specificity and quality annotations, and metadata about the source of the query. 2. `corpus_clean` containing 64183 documents. We provide the extracted titles, abstracts and outgoing citation paper IDs. 3. `corpus_s2orc` contains the same set of 64183 documents expressed in the Semantic Scholar Open Research Corpus (S2ORC) schema along with all available metadata. Each configuration has a single 'full' split. ## Usage You can load the configurations as follows: ```python from datasets import load_dataset query_data = load_dataset("princeton-nlp/LitSearch", "query", split="full") corpus_clean_data = load_dataset("princeton-nlp/LitSearch", "corpus_clean", split="full") corpus_s2orc_data = load_dataset("princeton-nlp/LitSearch", "corpus_s2orc", split="full") ```

# LitSearch:面向学术文献检索的检索评测基准 本数据集对应论文**LitSearch:面向学术文献检索的检索评测基准**,包含查询集与检索语料库。我们提出的LitSearch是一款检索评测基准,包含597条针对近年机器学习(Machine Learning, ML)与自然语言处理(Natural Language Processing, NLP)领域学术文献的真实检索查询。LitSearch的构建结合了两种来源:(1)基于学术论文中带有文内引用的段落由GPT-4生成的查询,(2)由近期发表论文的作者手动撰写的针对这些论文的查询。所有LitSearch查询均经过专家人工审核与编辑,以保障数据集质量。 本数据集包含三种配置: 1. `query`:包含597条查询,附带标准答案论文ID、查询特异性与质量标注,以及查询来源的元数据。 2. `corpus_clean`:包含64183条文档,我们提供了提取得到的文档标题、摘要与参考文献的论文ID。 3. `corpus_s2orc`:包含与`corpus_clean`完全一致的64183条文档,采用语义学者开放研究语料库(Semantic Scholar Open Research Corpus, S2ORC)格式进行组织,并附带所有可用元数据。 每种配置均仅包含一个`full`划分集。 ## 使用方法 可通过以下方式加载各配置的数据集: python from datasets import load_dataset query_data = load_dataset("princeton-nlp/LitSearch", "query", split="full") corpus_clean_data = load_dataset("princeton-nlp/LitSearch", "corpus_clean", split="full") corpus_s2orc_data = load_dataset("princeton-nlp/LitSearch", "corpus_s2orc", split="full")
提供机构:
maas
创建时间:
2025-08-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作