tum-nlp/cannot-dataset
收藏Hugging Face2025-10-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tum-nlp/cannot-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: CANNOT
license: cc-by-sa-4.0
size_categories:
- 10K<n<100K
---
<p align="center"><img width="500" src="https://github.com/dmlls/cannot-dataset/assets/22967053/a380dfdf-3514-4771-90c4-636698d5043d" alt="CANNOT dataset"></p>
<p align="center" display="inline-block">
<a href="https://github.com/dmlls/cannot-dataset/">
<img src="https://img.shields.io/badge/version-1.1-green">
</a>
</p>
<h1 align="center">Compilation of ANnotated, Negation-Oriented Text-pairs</h1>
---
# Dataset Card for CANNOT
## Dataset Description
- **Homepage: https://github.com/dmlls/cannot-dataset**
- **Repository: https://github.com/dmlls/cannot-dataset**
- **Paper: [This is not correct! Negation-aware Evaluation of Language Generation Systems](https://arxiv.org/abs/2307.13989)**
### Dataset Summary
**CANNOT** is a dataset that focuses on negated textual pairs. It currently
contains **77,376 samples**, of which roughly of them are negated pairs of
sentences, and the other half are not (they are paraphrased versions of each
other).
The most frequent negation that appears in the dataset is verbal negation (e.g.,
will → won't), although it also contains pairs with antonyms (cold → hot).
### Languages
CANNOT includes exclusively texts in **English**.
## Dataset Structure
The dataset is given as a
[`.tsv`](https://en.wikipedia.org/wiki/Tab-separated_values) file with the
following structure:
| premise | hypothesis | label |
|:------------|:---------------------------------------------------|:-----:|
| A sentence. | An equivalent, non-negated sentence (paraphrased). | 0 |
| A sentence. | The sentence negated. | 1 |
The dataset can be easily loaded into a Pandas DataFrame by running:
```Python
import pandas as pd
dataset = pd.read_csv('negation_dataset_v1.0.tsv', sep='\t')
```
## Dataset Creation
The dataset has been created by cleaning up and merging the following datasets:
1. _Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal
Negation_ (see
[`datasets/nan-nli`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/nan-nli)).
2. _GLUE Diagnostic Dataset_ (see
[`datasets/glue-diagnostic`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/glue-diagnostic)).
3. _Automated Fact-Checking of Claims from Wikipedia_ (see
[`datasets/wikifactcheck-english`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/wikifactcheck-english)).
4. _From Group to Individual Labels Using Deep Features_ (see
[`datasets/sentiment-labelled-sentences`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/sentiment-labelled-sentences)).
In this case, the negated sentences were obtained by using the Python module
[`negate`](https://github.com/dmlls/negate).
5. _It Is Not Easy To Detect Paraphrases: Analysing Semantic Similarity With
Antonyms and Negation Using the New SemAntoNeg Benchmark_ (see
[`datasets/antonym-substitution`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/antonym-substitution)).
<br>
Once processed, the number of remaining samples in each of the datasets above are:
| Dataset | Samples |
|:--------------------------------------------------------------------------|-----------:|
| Not another Negation Benchmark | 118 |
| GLUE Diagnostic Dataset | 154 |
| Automated Fact-Checking of Claims from Wikipedia | 14,970 |
| From Group to Individual Labels Using Deep Features | 2,110 |
| It Is Not Easy To Detect Paraphrases | 8,597 |
| <div align="right"><b>Total</b></div> | **25,949** |
<br>
Additionally, for each of the negated samples, another pair of non-negated
sentences has been added by paraphrasing them with the pre-trained model
[`🤗tuner007/pegasus_paraphrase`](https://huggingface.co/tuner007/pegasus_paraphrase).
Finally, the swapped version of each pair (premise ⇋ hypothesis) has also been
included, and any duplicates have been removed.
With this, the number of premises/hypothesis in the CANNOT dataset that appear
in the original datasets are:
| <div align="left"><b>Dataset</b></div> | <div align="center"><b>Sentences</b></div> |
|:--------------------------------------------------------------------------|----------------------:|
| Not another Negation Benchmark | 552 (0.36 %) |
| GLUE Diagnostic Dataset | 586 (0.38 %) |
| Automated Fact-Checking of Claims from Wikipedia | 89,728 (59.98 %) |
| From Group to Individual Labels Using Deep Features | 12,626 (8.16 %) |
| It Is Not Easy To Detect Paraphrases | 17,198 (11.11 %) |
| <div align="right"><b>Total</b></div> | **120,690** (77.99 %) |
The percentages above are in relation to the total number of premises and
hypothesis in the CANNOT dataset. The remaining 22.01 % (34,062 sentences) are
the novel premises/hypothesis added through paraphrase and rule-based negation.
## Additional Information
### Licensing Information
The CANNOT dataset is released under [CC BY-SA
4.0](https://creativecommons.org/licenses/by-sa/4.0/).
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">
<img alt="Creative Commons License" width="100px" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png"/>
</a>
### Citation
Please cite our [INLG 2023 paper](https://aclanthology.org/2023.inlg-main.12/), if you use our dataset.
**BibTeX:**
```bibtex
@inproceedings{anschutz-etal-2023-correct,
title = "This is not correct! Negation-aware Evaluation of Language Generation Systems",
author = {Ansch{\"u}tz, Miriam and
Miguel Lozano, Diego and
Groh, Georg},
editor = "Keet, C. Maria and
Lee, Hung-Yi and
Zarrie{\ss}, Sina",
booktitle = "Proceedings of the 16th International Natural Language Generation Conference",
month = sep,
year = "2023",
address = "Prague, Czechia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.inlg-main.12/",
doi = "10.18653/v1/2023.inlg-main.12",
pages = "163--175",
abstract = "Large language models underestimate the impact of negations on how much they change the meaning of a sentence. Therefore, learned evaluation metrics based on these models are insensitive to negations. In this paper, we propose NegBLEURT, a negation-aware version of the BLEURT evaluation metric. For that, we designed a rule-based sentence negation tool and used it to create the CANNOT negation evaluation dataset. Based on this dataset, we fine-tuned a sentence transformer and an evaluation metric to improve their negation sensitivity. Evaluating these models on existing benchmarks shows that our fine-tuned models outperform existing metrics on the negated sentences by far while preserving their base models' performances on other perturbations."
}
```
### Contributions
Contributions to the dataset can be submitted through the [project
repository](https://github.com/dmlls/cannot-dataset).
提供机构:
tum-nlp
原始信息汇总
数据集概述:CANNOT
数据集描述
- 名称: CANNOT
- 全称: Compilation of ANnotated, Negation-Oriented Text-pairs
- 语言: 仅包含英语文本
- 样本数量: 77,376个样本
- 内容: 主要包含否定文本对,其中约一半是句子否定对,另一半是彼此的改写版本。
- 常见否定形式: 包括动词否定(如 will → wont)和反义词(如 cold → hot)。
数据集结构
-
格式:
.tsv文件 -
结构:
premise hypothesis label A sentence. An equivalent, non-negated sentence (paraphrased). 0 A sentence. The sentence negated. 1 -
加载方式: 可通过Pandas DataFrame加载,示例代码如下: Python import pandas as pd dataset = pd.read_csv(negation_dataset_v1.0.tsv, sep= )
数据集创建
- 来源: 合并自多个数据集,包括:
- Not another Negation Benchmark
- GLUE Diagnostic Dataset
- Automated Fact-Checking of Claims from Wikipedia
- From Group to Individual Labels Using Deep Features
- It Is Not Easy To Detect Paraphrases
- 处理: 对每个否定样本添加了非否定版本的句子,并通过改写模型
🤗tuner007/pegasus_paraphrase进行改写。
附加信息
- 许可证: 数据集遵循CC BY-SA 4.0。
- 引用: 如使用此数据集,请引用INLG 2023 paper。
- 贡献: 可通过项目仓库提交贡献。



