tum-nlp/cannot-dataset

Name: tum-nlp/cannot-dataset
Creator: tum-nlp
Published: 2025-10-23 14:09:16
License: 暂无描述

Hugging Face2025-10-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/tum-nlp/cannot-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en pretty_name: CANNOT license: cc-by-sa-4.0 size_categories: - 10K<n<100K --- <p align="center"><img width="500" src="https://github.com/dmlls/cannot-dataset/assets/22967053/a380dfdf-3514-4771-90c4-636698d5043d" alt="CANNOT dataset"></p> <p align="center" display="inline-block"> <a href="https://github.com/dmlls/cannot-dataset/"> <img src="https://img.shields.io/badge/version-1.1-green"> </a> </p> <h1 align="center">Compilation of ANnotated, Negation-Oriented Text-pairs</h1> --- # Dataset Card for CANNOT ## Dataset Description - **Homepage: https://github.com/dmlls/cannot-dataset** - **Repository: https://github.com/dmlls/cannot-dataset** - **Paper: [This is not correct! Negation-aware Evaluation of Language Generation Systems](https://arxiv.org/abs/2307.13989)** ### Dataset Summary **CANNOT** is a dataset that focuses on negated textual pairs. It currently contains **77,376 samples**, of which roughly of them are negated pairs of sentences, and the other half are not (they are paraphrased versions of each other). The most frequent negation that appears in the dataset is verbal negation (e.g., will → won't), although it also contains pairs with antonyms (cold → hot). ### Languages CANNOT includes exclusively texts in **English**. ## Dataset Structure The dataset is given as a [`.tsv`](https://en.wikipedia.org/wiki/Tab-separated_values) file with the following structure: | premise | hypothesis | label | |:------------|:---------------------------------------------------|:-----:| | A sentence. | An equivalent, non-negated sentence (paraphrased). | 0 | | A sentence. | The sentence negated. | 1 | The dataset can be easily loaded into a Pandas DataFrame by running: ```Python import pandas as pd dataset = pd.read_csv('negation_dataset_v1.0.tsv', sep='\t') ``` ## Dataset Creation The dataset has been created by cleaning up and merging the following datasets: 1. _Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation_ (see [`datasets/nan-nli`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/nan-nli)). 2. _GLUE Diagnostic Dataset_ (see [`datasets/glue-diagnostic`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/glue-diagnostic)). 3. _Automated Fact-Checking of Claims from Wikipedia_ (see [`datasets/wikifactcheck-english`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/wikifactcheck-english)). 4. _From Group to Individual Labels Using Deep Features_ (see [`datasets/sentiment-labelled-sentences`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/sentiment-labelled-sentences)). In this case, the negated sentences were obtained by using the Python module [`negate`](https://github.com/dmlls/negate). 5. _It Is Not Easy To Detect Paraphrases: Analysing Semantic Similarity With Antonyms and Negation Using the New SemAntoNeg Benchmark_ (see [`datasets/antonym-substitution`](https://github.com/dmlls/cannot-dataset/tree/main/datasets/antonym-substitution)). <br> Once processed, the number of remaining samples in each of the datasets above are: | Dataset | Samples | |:--------------------------------------------------------------------------|-----------:| | Not another Negation Benchmark | 118 | | GLUE Diagnostic Dataset | 154 | | Automated Fact-Checking of Claims from Wikipedia | 14,970 | | From Group to Individual Labels Using Deep Features | 2,110 | | It Is Not Easy To Detect Paraphrases | 8,597 | | <div align="right"><b>Total</b></div> | **25,949** | <br> Additionally, for each of the negated samples, another pair of non-negated sentences has been added by paraphrasing them with the pre-trained model [`🤗tuner007/pegasus_paraphrase`](https://huggingface.co/tuner007/pegasus_paraphrase). Finally, the swapped version of each pair (premise ⇋ hypothesis) has also been included, and any duplicates have been removed. With this, the number of premises/hypothesis in the CANNOT dataset that appear in the original datasets are: | <div align="left"><b>Dataset</b></div> | <div align="center"><b>Sentences</b></div> | |:--------------------------------------------------------------------------|----------------------:| | Not another Negation Benchmark | 552     (0.36 %) | | GLUE Diagnostic Dataset | 586     (0.38 %) | | Automated Fact-Checking of Claims from Wikipedia | 89,728   (59.98 %) | | From Group to Individual Labels Using Deep Features | 12,626     (8.16 %) | | It Is Not Easy To Detect Paraphrases | 17,198   (11.11 %) | | <div align="right"><b>Total</b></div> | **120,690**   (77.99 %) | The percentages above are in relation to the total number of premises and hypothesis in the CANNOT dataset. The remaining 22.01 % (34,062 sentences) are the novel premises/hypothesis added through paraphrase and rule-based negation. ## Additional Information ### Licensing Information The CANNOT dataset is released under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/). <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"> <img alt="Creative Commons License" width="100px" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png"/> </a> ### Citation Please cite our [INLG 2023 paper](https://aclanthology.org/2023.inlg-main.12/), if you use our dataset. **BibTeX:** ```bibtex @inproceedings{anschutz-etal-2023-correct, title = "This is not correct! Negation-aware Evaluation of Language Generation Systems", author = {Ansch{\"u}tz, Miriam and Miguel Lozano, Diego and Groh, Georg}, editor = "Keet, C. Maria and Lee, Hung-Yi and Zarrie{\ss}, Sina", booktitle = "Proceedings of the 16th International Natural Language Generation Conference", month = sep, year = "2023", address = "Prague, Czechia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.inlg-main.12/", doi = "10.18653/v1/2023.inlg-main.12", pages = "163--175", abstract = "Large language models underestimate the impact of negations on how much they change the meaning of a sentence. Therefore, learned evaluation metrics based on these models are insensitive to negations. In this paper, we propose NegBLEURT, a negation-aware version of the BLEURT evaluation metric. For that, we designed a rule-based sentence negation tool and used it to create the CANNOT negation evaluation dataset. Based on this dataset, we fine-tuned a sentence transformer and an evaluation metric to improve their negation sensitivity. Evaluating these models on existing benchmarks shows that our fine-tuned models outperform existing metrics on the negated sentences by far while preserving their base models' performances on other perturbations." } ``` ### Contributions Contributions to the dataset can be submitted through the [project repository](https://github.com/dmlls/cannot-dataset).

提供机构：

tum-nlp

原始信息汇总

数据集概述：CANNOT

数据集描述

名称: CANNOT
全称: Compilation of ANnotated, Negation-Oriented Text-pairs
语言: 仅包含英语文本
样本数量: 77,376个样本
内容: 主要包含否定文本对，其中约一半是句子否定对，另一半是彼此的改写版本。
常见否定形式: 包括动词否定（如 will → wont）和反义词（如 cold → hot）。

数据集结构

格式: .tsv 文件
结构:

premise hypothesis label

A sentence. An equivalent, non-negated sentence (paraphrased). 0

A sentence. The sentence negated. 1
加载方式: 可通过Pandas DataFrame加载，示例代码如下： Python import pandas as pd dataset = pd.read_csv(negation_dataset_v1.0.tsv, sep= )

数据集创建

来源: 合并自多个数据集，包括：
1. Not another Negation Benchmark
2. GLUE Diagnostic Dataset
3. Automated Fact-Checking of Claims from Wikipedia
4. From Group to Individual Labels Using Deep Features
5. It Is Not Easy To Detect Paraphrases
处理: 对每个否定样本添加了非否定版本的句子，并通过改写模型🤗tuner007/pegasus_paraphrase进行改写。

附加信息

许可证: 数据集遵循CC BY-SA 4.0。
引用: 如使用此数据集，请引用INLG 2023 paper。
贡献: 可通过项目仓库提交贡献。

5,000+

优质数据集

54 个

任务类型

进入经典数据集

premise	hypothesis	label
A sentence.	An equivalent, non-negated sentence (paraphrased).	0
A sentence.	The sentence negated.	1