five

uva-irlab/trec-cast-2019-multi-turn

收藏
Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/uva-irlab/trec-cast-2019-multi-turn
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en multilinguality: - monolingual size_categories: - 10M<n<100M task_categories: - text-retrieval task_ids: - document-retrieval language_bcp47: - en-US --- # TREC Cast 2019 [TREC Cast](http://www.treccast.ai) have released a document collection with topics and qrels of which a subset has been annotated such that it is suitable for multi-turn conversational search. ## Dataset statistics - # Passages: 38,426,252 - # Topics: 20 - # Queries: 173 ## Subsets ### CAR + MSMARCO Collection Together CAR and MSMARCO have a size of 6,13G, so downloading will take a while. You can use the collection as followed: ```python collection = load_dataset('trec-cast-2019-multi-turn', 'test_collection') ``` The collection has the following data format: ``` docno: str The document id format is [collection_id_paragraph_id] with collection id and paragraph id separated by an underscore. The collection ids are in the set: {MARCO, CAR}. E.g.: CAR_6869dee46ab12f0f7060874f7fc7b1c57d53144a text: str The content of the passage. ``` #### Sample Instead of using the entire data set, you can also download a sample set containing only 200,000 items: ```python collection = load_dataset('trec-cast-2019-multi-turn', 'test_collection_sample') ``` ### Topics You can get the topics as followed: ```python topics = load_dataset('trec-cast-2019-multi-turn', 'topics') ``` The topics have the following dataformat: ``` qid: str Query ID of the format "topicId_questionNumber" history: str[] A list of queries. It can be empty for the first question in a topic. query: str The query ``` ### Qrels You can get the qrels as followed: ```python qrels = load_dataset('trec-cast-2019-multi-turn', 'qrels') ``` The qrels have the following data format: ``` qid: str Query ID of the format "topicId_questionNumber" qrels: List[dict] A list of dictionaries with the keys 'docno' and 'relevance'. Relevance is an integer in the range [0, 4] ```
提供机构:
uva-irlab
原始信息汇总

TREC Cast 2019 数据集概述

基本信息

  • 语言: 英语 (en-US)
  • 多语言性: 单语种
  • 大小: 10M<n<100M
  • 任务类别: 文本检索
  • 任务ID: 文档检索

数据集统计

  • 段落数量: 38,426,252
  • 主题数量: 20
  • 查询数量: 173

数据集子集

主要集合

  • 名称: CAR + MSMARCO Collection

  • 大小: 6.13GB

  • 加载方式: python collection = load_dataset(trec-cast-2019-multi-turn, test_collection)

  • 数据格式:

    • docno: 文档ID,格式为[collection_id_paragraph_id]
    • text: 段落内容

样本集合

  • 大小: 200,000项
  • 加载方式: python collection = load_dataset(trec-cast-2019-multi-turn, test_collection_sample)

主题数据

  • 加载方式: python topics = load_dataset(trec-cast-2019-multi-turn, topics)

  • 数据格式:

    • qid: 查询ID,格式为"topicId_questionNumber"
    • history: 查询历史列表
    • query: 查询内容

Qrels数据

  • 加载方式: python qrels = load_dataset(trec-cast-2019-multi-turn, qrels)

  • 数据格式:

    • qid: 查询ID,格式为"topicId_questionNumber"
    • qrels: 包含docno和relevance键的词典列表,relevance为[0, 4]范围内的整数
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作