RobZamp/sick
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/RobZamp/sick
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
language:
- en
license:
- cc-by-nc-sa-3.0
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- extended|image-flickr-8k
- extended|semeval2012-sts-msr-video
task_categories:
- text-classification
task_ids:
- natural-language-inference
paperswithcode_id: sick
pretty_name: Sentences Involving Compositional Knowledge
dataset_info:
features:
- name: id
dtype: string
- name: sentence_A
dtype: string
- name: sentence_B
dtype: string
- name: label
dtype:
class_label:
names:
'0': entailment
'1': neutral
'2': contradiction
- name: relatedness_score
dtype: float32
- name: entailment_AB
dtype: string
- name: entailment_BA
dtype: string
- name: sentence_A_original
dtype: string
- name: sentence_B_original
dtype: string
- name: sentence_A_dataset
dtype: string
- name: sentence_B_dataset
dtype: string
splits:
- name: train
num_bytes: 1180530
num_examples: 4439
- name: validation
num_bytes: 132913
num_examples: 495
- name: test
num_bytes: 1305846
num_examples: 4906
download_size: 217584
dataset_size: 2619289
---
# Dataset Card for sick
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** http://marcobaroni.org/composes/sick.html
- **Repository:** [Needs More Information]
- **Paper:** https://www.aclweb.org/anthology/L14-1314/
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
Shared and internationally recognized benchmarks are fundamental for the development of any computational system. We aim to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them. SICK consists of about 10,000 English sentence pairs that include many examples of the lexical, syntactic and semantic phenomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets (idiomatic multiword expressions, named entities, telegraphic language) that are not within the scope of CDSMs. By means of crowdsourcing techniques, each pair was annotated for two crucial semantic tasks: relatedness in meaning (with a 5-point rating scale as gold score) and entailment relation between the two elements (with three possible gold labels: entailment, contradiction, and neutral). The SICK data set was used in SemEval-2014 Task 1, and it freely available for research purposes.
### Supported Tasks and Leaderboards
[Needs More Information]
### Languages
The dataset is in English.
## Dataset Structure
### Data Instances
Example instance:
```
{
"entailment_AB": "A_neutral_B",
"entailment_BA": "B_neutral_A",
"label": 1,
"id": "1",
"relatedness_score": 4.5,
"sentence_A": "A group of kids is playing in a yard and an old man is standing in the background",
"sentence_A_dataset": "FLICKR",
"sentence_A_original": "A group of children playing in a yard, a man in the background.",
"sentence_B": "A group of boys in a yard is playing and a man is standing in the background",
"sentence_B_dataset": "FLICKR",
"sentence_B_original": "A group of children playing in a yard, a man in the background."
}
```
### Data Fields
- pair_ID: sentence pair ID
- sentence_A: sentence A
- sentence_B: sentence B
- label: textual entailment gold label: entailment (0), neutral (1) or contradiction (2)
- relatedness_score: semantic relatedness gold score (on a 1-5 continuous scale)
- entailment_AB: entailment for the A-B order (A_neutral_B, A_entails_B, or A_contradicts_B)
- entailment_BA: entailment for the B-A order (B_neutral_A, B_entails_A, or B_contradicts_A)
- sentence_A_original: original sentence from which sentence A is derived
- sentence_B_original: original sentence from which sentence B is derived
- sentence_A_dataset: dataset from which the original sentence A was extracted (FLICKR vs. SEMEVAL)
- sentence_B_dataset: dataset from which the original sentence B was extracted (FLICKR vs. SEMEVAL)
### Data Splits
Train Trial Test
4439 495 4906
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
[Needs More Information]
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
[Needs More Information]
### Citation Information
```
@inproceedings{marelli-etal-2014-sick,
title = "A {SICK} cure for the evaluation of compositional distributional semantic models",
author = "Marelli, Marco and
Menini, Stefano and
Baroni, Marco and
Bentivogli, Luisa and
Bernardi, Raffaella and
Zamparelli, Roberto",
booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}'14)",
month = may,
year = "2014",
address = "Reykjavik, Iceland",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf",
pages = "216--223",
}
```
### Contributions
Thanks to [@calpt](https://github.com/calpt) for adding this dataset.
annotations_creators:
- 众包(crowdsourced)
language_creators:
- 众包(crowdsourced)
language:
- 英语(en)
license:
- 知识共享署名-非商业性使用-相同方式共享3.0(cc-by-nc-sa-3.0)
multilinguality:
- 单语言(monolingual)
size_categories:
- 1K<n<10K
source_datasets:
- 扩展|image-flickr-8k(image-flickr-8k)
- 扩展|semeval2012-sts-msr-video
task_categories:
- 文本分类(text-classification)
task_ids:
- 自然语言推理(natural-language-inference)
paperswithcode_id: sick
pretty_name: 涉及组合知识的句子(Sentences Involving Compositional Knowledge)
dataset_info:
features:
- name: id
dtype: 字符串
- name: sentence_A
dtype: 字符串
- name: sentence_B
dtype: 字符串
- name: label
dtype:
class_label:
names:
'0': 蕴含(entailment)
'1': 中立(neutral)
'2': 矛盾(contradiction)
- name: relatedness_score
dtype: float32
- name: entailment_AB
dtype: 字符串
- name: entailment_BA
dtype: 字符串
- name: sentence_A_original
dtype: 字符串
- name: sentence_B_original
dtype: 字符串
- name: sentence_A_dataset
dtype: 字符串
- name: sentence_B_dataset
dtype: 字符串
splits:
- name: train
num_bytes: 1180530
num_examples: 4439
- name: validation
num_bytes: 132913
num_examples: 495
- name: test
num_bytes: 1305846
num_examples: 4906
download_size: 217584
dataset_size: 2619289
---
# SICK数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据集构建初衷](#curation-rationale)
- [源数据](#source-data)
- [注释](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集管理者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **Homepage:** http://marcobaroni.org/composes/sick.html
- **Repository:** [需补充更多信息]
- **Paper:** https://www.aclweb.org/anthology/L14-1314/
- **Leaderboard:** [需补充更多信息]
- **Point of Contact:** [需补充更多信息]
### 数据集概述
共享且国际公认的基准数据集是任何计算系统研发的基础。我们旨在为从事组合分布语义模型(Compositional Distributional Semantic Models, CDSM)研究的社区提供助力,为此构建了SICK(涉及组合知识的句子,Sentences Involving Compositional Knowledge)——一款专为该类研究设计的大规模英语基准数据集。SICK包含约10000对英语句子对,涵盖了组合分布语义模型需要处理的诸多词汇、句法与语义现象,同时无需处理现有句子数据集内不属于组合分布语义模型研究范畴的其他内容(如惯用多词表达式、命名实体、电报式语言等)。通过众包(crowdsourcing)技术,每一对句子都针对两项关键语义任务进行了标注:语义相关度(采用5分制评分作为金标准标签)与两个句子间的蕴含关系(包含三种金标准标签:蕴含、矛盾与中立)。SICK数据集曾用于SemEval-2014任务1,且面向研究用途免费开放。
### 支持任务与排行榜
[需补充更多信息]
### 语言
本数据集采用英语。
## 数据集结构
### 数据实例
示例数据如下:
{
"entailment_AB": "A_neutral_B",
"entailment_BA": "B_neutral_A",
"label": 1,
"id": "1",
"relatedness_score": 4.5,
"sentence_A": "A group of kids is playing in a yard and an old man is standing in the background",
"sentence_A_dataset": "FLICKR",
"sentence_A_original": "A group of children playing in a yard, a man in the background.",
"sentence_B": "A group of boys in a yard is playing and a man is standing in the background",
"sentence_B_dataset": "FLICKR",
"sentence_B_original": "A group of children playing in a yard, a man in the background."
}
### 数据字段
- pair_ID: 句子对编号
- sentence_A: 句子A
- sentence_B: 句子B
- label: 文本蕴含金标准标签:蕴含(0)、中立(1)或矛盾(2)
- relatedness_score: 语义相关度金标准得分(1-5分连续量表)
- entailment_AB: A到B方向的蕴含关系(取值为A_neutral_B、A_entails_B或A_contradicts_B)
- entailment_BA: B到A方向的蕴含关系(取值为B_neutral_A、B_entails_A或B_contradicts_A)
- sentence_A_original: 句子A的原始来源句
- sentence_B_original: 句子B的原始来源句
- sentence_A_dataset: 原始句子A的提取数据集(FLICKR与SEMEVAL二选一)
- sentence_B_dataset: 原始句子B的提取数据集(FLICKR与SEMEVAL二选一)
### 数据划分
训练集、验证集、测试集的样本数分别为4439、495、4906。
## 数据集构建
### 数据集构建初衷
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言生成者是谁?
[需补充更多信息]
### 注释
#### 注释流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集管理者
[需补充更多信息]
### 许可信息
[需补充更多信息]
### 引用信息
@inproceedings{marelli-etal-2014-sick,
title = "A {SICK} cure for the evaluation of compositional distributional semantic models",
author = "Marelli, Marco and
Menini, Stefano and
Baroni, Marco and
Bentivogli, Luisa and
Bernardi, Raffaella and
Zamparelli, Roberto",
booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}'14)",
month = may,
year = "2014",
address = "Reykjavik, Iceland",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf",
pages = "216--223",
}
### 贡献者
感谢[@calpt](https://github.com/calpt) 为本数据集的收录提供帮助。
提供机构:
RobZamp
原始信息汇总
数据集概述
数据集名称
- 名称: Sentences Involving Compositional Knowledge (SICK)
数据集基本信息
- 语言: 英语
- 许可证: cc-by-nc-sa-3.0
- 多语言性: 单语
- 大小: 1K<n<10K
- 源数据集:
- extended|image-flickr-8k
- extended|semeval2012-sts-msr-video
- 任务类别: 文本分类
- 任务ID: natural-language-inference
数据集结构
- 特征:
- id: 字符串
- sentence_A: 字符串
- sentence_B: 字符串
- label:
- 类别:
- 0: entailment
- 1: neutral
- 2: contradiction
- 类别:
- relatedness_score: float32
- entailment_AB: 字符串
- entailment_BA: 字符串
- sentence_A_original: 字符串
- sentence_B_original: 字符串
- sentence_A_dataset: 字符串
- sentence_B_dataset: 字符串
数据分割
- 训练集: 4439个样本,1180530字节
- 验证集: 495个样本,132913字节
- 测试集: 4906个样本,1305846字节
数据集创建
- 注释创建者: 众包
- 语言创建者: 众包
数据使用注意事项
- 许可证: 该数据集遵循cc-by-nc-sa-3.0许可证,使用时需遵守相关条款。
引用信息
@inproceedings{marelli-etal-2014-sick, title = "A {SICK} cure for the evaluation of compositional distributional semantic models", author = "Marelli, Marco and Menini, Stefano and Baroni, Marco and Bentivogli, Luisa and Bernardi, Raffaella and Zamparelli, Roberto", booktitle = "Proceedings of the Ninth International Conference on Language Resources and Evaluation ({LREC}14)", month = may, year = "2014", address = "Reykjavik, Iceland", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf", pages = "216--223", }
搜集汇总
数据集介绍

背景与挑战
背景概述
SICK数据集是一个包含约10,000个英文句子对的基准数据集,专门为组合分布语义模型(CDSMs)研究设计。每个句子对标注了语义相关性和蕴含关系,适用于自然语言推理任务,如语义相关性和蕴含关系分析。
以上内容由遇见数据集搜集并总结生成



