McGill-NLP/statcan-dialogue-dataset-retrieval

Name: McGill-NLP/statcan-dialogue-dataset-retrieval
Creator: McGill-NLP
Published: 2024-05-22 21:27:16
License: 暂无描述

Hugging Face2024-05-22 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/McGill-NLP/statcan-dialogue-dataset-retrieval

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - question-answering - table-question-answering size_categories: - 1K<n<10K language: - en - fr configs: - config_name: queries_english data_files: - split: train path: english/train.csv - split: dev path: english/dev.csv - split: test path: english/test.csv - config_name: queries_french data_files: - split: train path: french/train.csv - split: dev path: french/dev.csv - split: test path: french/test.csv - config_name: corpus data_files: - split: english path: english/corpus.csv - split: french path: french/corpus.csv --- # Statcan Dialogue Dataset (Processed for Retrieval Tasks) This is a variant of the [Statcan Dialogue Dataset](https://huggingface.co/datasets/McGill-NLP/statcan-dialogue-dataset), which we processed specifically for multilingual retrieval (english, french). It contains everything in CSVs, rather than having metadata hosted separately. ## Quickstart ```python from datasets import load_dataset repo = 'McGill-NLP/statcan-dialogue-dataset-retrieval' # load english queries, training split queries_en = load_dataset(repo, 'queries_english', split='train') # or 'dev', 'test' queries_fr = load_dataset(repo, 'queries_french', split='train') # Dataset({ # features: ['query', 'query_id', 'doc_id'], # num_rows: 3782 # }) # load corpus (available in french and english) corpus_en = load_dataset(repo, 'corpus', split='english') corpus_fr = load_dataset(repo, 'corpus', split='french') # Dataset({ # features: ['doc_id', 'title', 'doc'], # num_rows: 5907 # }) ``` The queries is given in list of dicts, with the following keys: - role: either "user" or "operator". Operator is the expert in charge of helping the user. - content: the utterance of the user or operator. ```python # ...continued q = queries_en[0]['query'] doc_id = queries_en[0]['doc_id'] print("Query:", q) # You can filter the corpus to find your table (english and french are the same docs) d = corpus_fr.filter(lambda r: r['doc_id'] == doc_id)[0] print(d['title']) print('-'*50) print(d['doc']) ``` Note that `d['title']` only contains the title of the table, whereas `d['doc']` is the full document and contains: * title, * time period, * table dimensions, * subject (for taxonomy), * name of survey where it was sourced, * update frequency, * columns for each of the dimensions Warning: please do not do `d['title'] + d['doc']` as that would be redundant. ## License By using this dataset, you agree to the the following terms of use and restrictions: ### Terms of use Researchers must agree to the following terms: 1. These data represent anonymized (de-identified) data from individuals. Best efforts have been implemented to ensure that all directly and indirectly identifiable information has been removed. Researchers who download this dataset must agree to notify Graeme Gilmour (`graeme.gilmour <at> statcan.gc.ca`) and Harm de Vries (`harm.devries <at> servicenow.com`) if any inadvertently remaining identifiable information is discovered during the process of re-using this dataset. Researchers must agree to destroy any version of this dataset containing identifiable information. 2. The terms of this dataset require that reusers give credit to the creators. It allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, even for commercial purposes. 3. Have read and acknowledged the Appendix B (Dataset Card) of the latest version of the paper prior to using the dataset. ### Restrictions Downloaders cannot: 1. obtain information from the dataset that results in the researcher or any third party(ies) directly or indirectly identifying any participant with the aid of other information acquired elsewhere; 2. produce connections or links among or between the information included in the dataset and other third-party information that could be used to identify any individuals; and 3. extract information from the dataset that could aid researchers (downloaders) in gaining knowledge about or obtaining any means of contacting any individuals already known to the downloader/researcher

提供机构：

McGill-NLP

原始信息汇总

数据集概述

任务类别

问答
表格问答

数据集大小

1K<n<10K

语言

英语
法语

配置详情

queries_english
- 训练集: english/train.csv
- 验证集: english/dev.csv
- 测试集: english/test.csv
queries_french
- 训练集: french/train.csv
- 验证集: french/dev.csv
- 测试集: french/test.csv
corpus
- 英语文档: english/corpus.csv
- 法语文档: french/corpus.csv

数据集内容

查询数据
- 特征: query, query_id, doc_id
- 示例: queries_en[0][query], queries_en[0][doc_id]
文档数据
- 特征: doc_id, title, doc
- doc 包含: 标题, 时间周期, 表格维度, 主题, 来源调查名称, 更新频率, 各维度的列

使用许可

用户必须同意数据集的条款和限制，包括但不限于：
- 发现任何可识别信息时必须通知数据集提供者
- 使用数据集时需给予创作者适当的信用
- 使用前需阅读并承认相关论文的附录B（数据集卡）

5,000+

优质数据集

54 个

任务类型

进入经典数据集