circa

Name: circa
Creator: maas
Published: 2025-12-05 16:41:06
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/google-research-datasets/circa

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for CIRCA ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [CIRCA homepage](https://github.com/google-research-datasets/circa) - **Repository:** [CIRCA repository](https://github.com/google-research-datasets/circa) - **Paper:** ["I’d rather just go to bed”: Understanding Indirect Answers](https://arxiv.org/abs/2010.03450) - **Point of Contact:** [Circa team, Google](circa@google.com) ### Dataset Summary The Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to polar questions. The dataset contains pairs of yes/no questions and indirect answers, together with annotations for the interpretation of the answer. The data is collected in 10 different social conversational situations (eg. food preferences of a friend). The following are the situational contexts for the dialogs in the data. ``` 1. X wants to know about Y’s food preferences 2. X wants to know what activities Y likes to do during weekends. 3. X wants to know what sorts of books Y likes to read. 4. Y has just moved into a neighbourhood and meets his/her new neighbour X. 5. X and Y are colleagues who are leaving work on a Friday at the same time. 6. X wants to know about Y's music preferences. 7. Y has just travelled from a different city to meet X. 8. X and Y are childhood neighbours who unexpectedly run into each other at a cafe. 9. Y has just told X that he/she is thinking of buying a flat in New York. 10. Y has just told X that he/she is considering switching his/her job. ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The text in the dataset is in English. ## Dataset Structure ### Data Instances The columns indicate: ``` 1. id : unique id for the question-answer pair 2. context : the social situation for the dialogue. One of 10 situations (see next section). Each situation is a dialogue between a person who poses the question (X) and the person who answers (Y). 3. question-X : the question posed by X 4. canquestion-X : a (automatically) rewritten version of question into declarative form Eg. Do you like Italian? --> I like Italian. See the paper for details. 5. answer-Y : the answer given by Y to X 6. judgements : the interpretations for the QA pair from 5 annotators. The value is a list of 5 strings, separated by the token ‘#’ 7. goldstandard1 : a gold standard majority judgement from the annotators. The value is the most common interpretation and picked by at least 3 (out of 5 annotators). When a majority judgement was not reached by the above criteria, the value is ‘NA’ 8. goldstandard2 : Here the labels ‘Probably yes / sometimes yes’, ‘Probably no', and 'I am not sure how X will interpret Y’s answer' are mapped respectively to ‘Yes’, ‘No’, and 'In the middle, neither yes nor no’ before computing the majority. Still the label must be given at least 3 times to become the majority choice. This method represents a less strict way of analyzing the interpretations. ``` ### Data Fields ``` id : 1 context : X wants to know about Y's food preferences. question-X : Are you vegan? canquestion-X : I am vegan. answer-Y : I love burgers too much. judgements : no#no#no#no#no goldstandard1 : no (label(s) used for the classification task) goldstandard2 : no (label(s) used for the classification task) ``` ### Data Splits There are no explicit train/val/test splits in this dataset. ## Dataset Creation ### Curation Rationale They revisited a pragmatic inference problem in dialog: Understanding indirect responses to questions. Humans can interpret ‘I’m starving.’ in response to ‘Hungry?’, even without direct cue words such as ‘yes’ and ‘no’. In dialog systems, allowing natural responses rather than closed vocabularies would be similarly beneficial. However, today’s systems are only as sensitive to these pragmatic moves as their language model allows. They create and release the first large-scale English language corpus ‘Circa’ with 34,268 (polar question, indirect answer) pairs to enable progress on this task. ### Source Data #### Initial Data Collection and Normalization The QA pairs and judgements were collected using crowd annotations in three phases. They recruited English native speakers. The full descriptions of the data collection and quality control are present in [EMNLP 2020 paper](https://arxiv.org/pdf/2010.03450.pdf). Below is a brief overview only. Phase 1: In the first phase, they collected questions only. They designed 10 imaginary social situations which give the annotator a context for the conversation. Examples are: ``` ‘asking a friend for food preferences’ ‘meeting your childhood neighbour’ ‘your friend wants to buy a flat in New York’ ``` Annotators were asked to suggest questions which could be asked in each situation, such that each question only requires a ‘yes’ or ‘no’ answer. 100 annotators produced 5 questions each for the 10 situations, resulting in 5000 questions. Phase 2: Here they focused on eliciting answers to the questions. They sampled 3500 questions from our previous set. For each question, They collected possible answers from 10 different annotators. The annotators were instructed to provide a natural phrase or a sentence as the answer and to avoid the use of explicit ‘yes’ and ‘no’ words. Phase 3: Finally the QA pairs (34,268) were given to a third set of annotators who were asked how the question seeker would likely interpret a particular answer. These annotators had the following options to choose from: ``` * 'Yes' * 'Probably yes' / 'sometimes yes' * 'Yes, subject to some conditions' * 'No' * 'Probably no' * 'In the middle, neither yes nor no' * 'I am not sure how X will interpret Y's answer' ``` #### Who are the source language producers? The rest of the data apart from 10 initial questions was collected using crowd workers. They ran pilots for each step of data collection, and perused their results manually to ensure clarity in guidelines, and quality of the data. They also recruited native English speakers, mostly from the USA, and a few from the UK and Canada. They did not collect any further information about the crowd workers. ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? The rest of the data apart from 10 initial questions was collected using crowd workers. They ran pilots for each step of data collection, and perused their results manually to ensure clarity in guidelines, and quality of the data. They also recruited native English speakers, mostly from the USA, and a few from the UK and Canada. They did not collect any further information about the crowd workers. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators This dataset is the work of Annie Louis, Dan Roth, and Filip Radlinski from Google LLC. ### Licensing Information This dataset was made available under the Creative Commons Attribution 4.0 License. A full copy of the license can be found at https://creativecommons.org/licenses/by-sa/4.0/e and link to the license webpage if available. ### Citation Information ``` @InProceedings{louis_emnlp2020, author = "Annie Louis and Dan Roth and Filip Radlinski", title = ""{I}'d rather just go to bed": {U}nderstanding {I}ndirect {A}nswers", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", year = "2020", } ``` ### Contributions Thanks to [@bhavitvyamalik](https://github.com/bhavitvyamalik) for adding this dataset.

# CIRCA 数据集卡片（Dataset Card） ## 目录 - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与评测基准](#supported-tasks-and-leaderboards) - [语言覆盖](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏倚讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集概述 - **主页**：[CIRCA 主页](https://github.com/google-research-datasets/circa) - **代码仓库**：[CIRCA 代码仓库](https://github.com/google-research-datasets/circa) - **相关论文**：["I’d rather just go to bed": Understanding Indirect Answers](https://arxiv.org/abs/2010.03450) - **联系人**：[谷歌 CIRCA 团队](circa@google.com) ### 数据集摘要 Circa（意为“大约、近似”）数据集旨在助力机器学习系统解决**极性问题（polar questions）的间接回答（indirect answers）理解**难题。该数据集包含成对的是非问句与间接回答，并附带针对回答解读的标注。数据采集自10种不同的社交对话场景（例如询问朋友的饮食偏好）。以下为数据中对话对应的场景上下文： 1. X 欲了解 Y 的饮食偏好 2. X 想知道 Y 周末喜欢参与的活动 3. X 想了解 Y 喜欢阅读的书籍类型 4. Y 刚搬到新社区，偶遇新邻居 X 5. X 与 Y 为同事，二人于周五同时下班 6. X 欲了解 Y 的音乐偏好 7. Y 从其他城市赶来与 X 见面 8. X 与 Y 为童年邻居，意外在咖啡馆重逢 9. Y 刚告知 X 自己正考虑在纽约购置房产 10. Y 刚告知 X 自己正考虑更换工作 ### 支持任务与评测基准 [需补充更多信息] ### 语言覆盖数据集文本均为英语。 ## 数据集结构 ### 数据实例各列说明如下： 1. id : 问答对唯一标识符 2. context : 对话所属社交场景，共10种场景之一（详见下文），每种场景对应提问者X与回答者Y之间的对话。 3. question-X : X 提出的问题 4. canquestion-X : 将问句自动改写为陈述句式的版本，例如"Do you like Italian?" 改写为 "I like Italian."，详见相关论文。 5. answer-Y : Y 给出的回答 6. judgements : 5名标注者对该问答对的解读结果，为以分隔符‘#’连接的5个字符串组成的列表。 7. goldstandard1 : 基于标注者的多数投票得到的金标准标注。若某一解读获得至少3/5的标注者支持，则将其作为最常见的解读；若未达到该多数标准，则取值为‘NA’。 8. goldstandard2 : 采用更宽松的多数投票规则：先将‘可能是/有时是’‘可能否’以及‘不确定X将如何解读Y的回答’分别映射为‘是’‘否’以及‘中立，既非是也非否’，再进行多数投票，且仍需至少3名标注者支持方能成为多数选项，该方法代表一种更为宽松的标注分析方式。 ### 数据字段 id : 1 context : X 欲了解 Y 的饮食偏好 question-X : 你是素食主义者吗？ canquestion-X : 我是素食主义者。 answer-Y : 我太爱吃汉堡了。 judgements : 否#否#否#否#否 goldstandard1 : 否（用于分类任务的标签） goldstandard2 : 否（用于分类任务的标签） ### 数据划分该数据集未设置显式的训练/验证/测试划分。 ## 数据集构建 ### 构建初衷研究团队重新审视了对话中的语用推理问题：理解间接回答。人类能够在没有“是”“否”这类明确提示词的情况下，对“我快饿死了”这一回应“饿了吗？”的回答进行解读。在对话系统中，允许自然回答而非限定词汇表同样会带来益处，但当前系统仅能依托其语言模型对这类语用表达进行处理。为此，团队创建并发布了首个大规模英语语料库Circa，包含34268条（极性问题，间接回答）配对数据，以推动该任务的研究进展。 ### 源数据 #### 初始数据收集与标准化问答对与标注结果分三阶段通过**众包标注（crowd annotations）**收集完成，招募了以英语为母语的标注者。数据收集与质量控制的完整说明详见[EMNLP 2020 论文](https://arxiv.org/pdf/2010.03450.pdf)，以下仅为简要概述： 1. 第一阶段：仅收集问句。团队设计了10种虚构的社交场景，为标注者提供对话上下文，例如：‘向朋友询问饮食偏好’‘偶遇童年邻居’‘朋友打算在纽约购房’。要求标注者为每种场景提出仅需“是”或“否”回答的问题。100名标注者各为10种场景创作5个问题，共得到5000个问句。 2. 第二阶段：聚焦于收集问句的回答。从前期收集的问句中抽样3500条，为每条问句收集10名不同标注者提供的可能回答。要求标注者提供自然的短语或句子作为回答，避免使用明确的“是”“否”词汇。 3. 第三阶段：将34268个问答对交由第三批标注者，要求他们标注提问者X可能会如何解读该回答。标注者可从以下选项中选择： * 'Yes'（是） * 'Probably yes' / 'sometimes yes'（可能是/有时是） * 'Yes, subject to some conditions'（附条件的是） * 'No'（否） * 'Probably no'（可能否） * 'In the middle, neither yes nor no'（中立，既非是也非否） * 'I am not sure how X will interpret Y's answer'（不确定X将如何解读Y的回答） #### 源语言生产者是谁？除10个初始问句外，其余数据均通过众包工作者收集。团队在数据收集的每个阶段都进行了预实验，并手动审核结果以确保指南清晰、数据质量达标。招募的标注者均为以英语为母语的人员，大部分来自美国，少数来自英国与加拿大，未收集众包工作者的其他额外信息。 ### 标注 #### 标注流程 [需补充更多信息] #### 标注者是谁？除10个初始问句外，其余数据的标注均由众包工作者完成。团队在数据收集的每个阶段都开展了预实验，并手动核查结果以保障指南清晰性与数据质量。招募的标注者均为英语母语使用者，多数来自美国，少量来自英国与加拿大，未收集众包标注者的其他额外信息。 ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏倚讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者本数据集由谷歌有限责任公司的Annie Louis、Dan Roth与Filip Radlinski完成。 ### 许可信息本数据集采用知识共享署名4.0许可协议（Creative Commons Attribution 4.0 License）发布。协议完整文本可访问https://creativecommons.org/licenses/by-sa/4.0/ 查看。 ### 引用信息 @InProceedings{louis_emnlp2020, author = "Annie Louis and Dan Roth and Filip Radlinski", title = ""{I}'d rather just go to bed": {U}nderstanding {I}ndirect {A}nswers", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", year = "2020", } ### 贡献感谢 [@bhavitvyamalik](https://github.com/bhavitvyamalik) 为本数据集添加至该平台。

提供机构：

maas

创建时间：

2025-07-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集