ibm-research/clinic150-sur
收藏Hugging Face2023-05-30 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ibm-research/clinic150-sur
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
annotations_creators: other
language_creators: other
language: en
multilinguality: monolingual
size_categories: 100K<n<1M
source_datasets: extended|clinic150
task_categories:
- text-classification
pretty_name: Clinic150-SUR
---
dataset_info:
features:
- name: intent
dtype: string
- name: user_utterance
dtype: string
- name: origin
dtype: string
# Dataset Card for "clinic150-SUR"
### Dataset Summary
The Clinic150-SUR dataset is a novel and augmented dataset designed to simulate natural human behavior during interactions with customer service-like centers.
Extending the [Clinic150 dataset](https://aclanthology.org/D19-1131/), it incorporates two augmentation techniques, including IBM's [LAMBADA](https://arxiv.org/abs/1911.03118) and [Parrot](https://github.com/PrithivirajDamodaran/Parrot_Paraphraser) models and carefully curated duplicated utterances.
This dataset aims to provide a more comprehensive and realistic representation of customer service interactions,
facilitating the development and evaluation of robust and efficient dialogue systems.
Key Features:
- Augmentation with IBM's [LAMBADA Model](https://arxiv.org/abs/1911.03118): The Clinic150-SUR dataset leverages IBM's LAMBADA model, a language generation model trained on a large corpus of text, to augment the original dataset. This augmentation process enhances the diversity and complexity of the dialogue data, allowing for a broader range of interactions.
- Integration of [Parrot](https://github.com/PrithivirajDamodaran/Parrot_Paraphraser) Model: In addition to the LAMBADA model, the Clinic150-SUR dataset also incorporates the Parrot model, providing a variety of paraphrases. By integrating Parrot, the dataset achieves more variations of existing utterances.
- Duplicated Utterances: The dataset includes carefully curated duplicated utterances to mimic real-world scenarios where users rephrase or repeat commonly asked queries. This feature adds variability to the data, reflecting the natural tendencies of human interactions, and enables dialogue systems to handle such instances better.
- [Clinic150](https://aclanthology.org/D19-1131/) as the Foundation: The Clinic150-SUR dataset is built upon the Clinic150 dataset, which originally consisted of 150 in-domain intent classes and 150 human utterances for each intent. By utilizing this foundation, the augmented dataset retains the in-domain expertise while better reflecting the nature of user requests towards a dialog system.
### Data Instances
#### clinic150-SUR
- **Size of downloaded dataset file:** 29 MB
### Data Fields
#### clinic150-SUR
- `intent`: a `string` feature.
- `user_utterance`: a `string` feature.
- `origin`: a `string` feature ('original', 'lambada', 'parrot').
### Citation Information
```
@inproceedings{rabinovich2022reliable,
title={Reliable and Interpretable Drift Detection in Streams of Short Texts},
author={Rabinovich, Ella and Vetzler, Matan and Ackerman, Samuel and Anaby-Tavor, Ateret},
booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (industry track)",
publisher = "Association for Computational Linguistics",
year={2023},
url={https://arxiv.org/abs/2305.17750}
}
```
### Contributions
Thanks to [Matan Vetzler](https://www.linkedin.com/in/matanvetzler/), [Ella Rabinovich](https://www.linkedin.com/in/ella-rabinovich-7b9a06/) for adding this dataset.
提供机构:
ibm-research



