uva-irlab/trec-cast-2019-multi-turn

Name: uva-irlab/trec-cast-2019-multi-turn
Creator: uva-irlab
Published: 2022-10-25 09:56:59
License: 暂无描述

Hugging Face2022-10-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/uva-irlab/trec-cast-2019-multi-turn

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en multilinguality: - monolingual size_categories: - 10M<n<100M task_categories: - text-retrieval task_ids: - document-retrieval language_bcp47: - en-US --- # TREC Cast 2019 [TREC Cast](http://www.treccast.ai) have released a document collection with topics and qrels of which a subset has been annotated such that it is suitable for multi-turn conversational search. ## Dataset statistics - # Passages: 38,426,252 - # Topics: 20 - # Queries: 173 ## Subsets ### CAR + MSMARCO Collection Together CAR and MSMARCO have a size of 6,13G, so downloading will take a while. You can use the collection as followed: ```python collection = load_dataset('trec-cast-2019-multi-turn', 'test_collection') ``` The collection has the following data format: ``` docno: str The document id format is [collection_id_paragraph_id] with collection id and paragraph id separated by an underscore. The collection ids are in the set: {MARCO, CAR}. E.g.: CAR_6869dee46ab12f0f7060874f7fc7b1c57d53144a text: str The content of the passage. ``` #### Sample Instead of using the entire data set, you can also download a sample set containing only 200,000 items: ```python collection = load_dataset('trec-cast-2019-multi-turn', 'test_collection_sample') ``` ### Topics You can get the topics as followed: ```python topics = load_dataset('trec-cast-2019-multi-turn', 'topics') ``` The topics have the following dataformat: ``` qid: str Query ID of the format "topicId_questionNumber" history: str[] A list of queries. It can be empty for the first question in a topic. query: str The query ``` ### Qrels You can get the qrels as followed: ```python qrels = load_dataset('trec-cast-2019-multi-turn', 'qrels') ``` The qrels have the following data format: ``` qid: str Query ID of the format "topicId_questionNumber" qrels: List[dict] A list of dictionaries with the keys 'docno' and 'relevance'. Relevance is an integer in the range [0, 4] ```

提供机构：

uva-irlab

原始信息汇总

TREC Cast 2019 数据集概述

基本信息

语言: 英语 (en-US)
多语言性: 单语种
大小: 10M<n<100M
任务类别: 文本检索
任务ID: 文档检索

数据集统计

段落数量: 38,426,252
主题数量: 20
查询数量: 173

数据集子集

主要集合

名称: CAR + MSMARCO Collection
大小: 6.13GB
加载方式: python collection = load_dataset(trec-cast-2019-multi-turn, test_collection)
数据格式:
- docno: 文档ID，格式为[collection_id_paragraph_id]
- text: 段落内容

样本集合

大小: 200,000项
加载方式: python collection = load_dataset(trec-cast-2019-multi-turn, test_collection_sample)

主题数据

加载方式: python topics = load_dataset(trec-cast-2019-multi-turn, topics)
数据格式:
- qid: 查询ID，格式为"topicId_questionNumber"
- history: 查询历史列表
- query: 查询内容

Qrels数据

加载方式: python qrels = load_dataset(trec-cast-2019-multi-turn, qrels)
数据格式:
- qid: 查询ID，格式为"topicId_questionNumber"
- qrels: 包含docno和relevance键的词典列表，relevance为[0, 4]范围内的整数

5,000+

优质数据集

54 个

任务类型

进入经典数据集