ibm-research/clinic150-sur

Name: ibm-research/clinic150-sur
Creator: ibm-research
Published: 2023-05-30 11:22:19
License: 暂无描述

Hugging Face2023-05-30 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ibm-research/clinic150-sur

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit annotations_creators: other language_creators: other language: en multilinguality: monolingual size_categories: 100K<n<1M source_datasets: extended|clinic150 task_categories: - text-classification pretty_name: Clinic150-SUR --- dataset_info: features: - name: intent dtype: string - name: user_utterance dtype: string - name: origin dtype: string # Dataset Card for "clinic150-SUR" ### Dataset Summary The Clinic150-SUR dataset is a novel and augmented dataset designed to simulate natural human behavior during interactions with customer service-like centers. Extending the [Clinic150 dataset](https://aclanthology.org/D19-1131/), it incorporates two augmentation techniques, including IBM's [LAMBADA](https://arxiv.org/abs/1911.03118) and [Parrot](https://github.com/PrithivirajDamodaran/Parrot_Paraphraser) models and carefully curated duplicated utterances. This dataset aims to provide a more comprehensive and realistic representation of customer service interactions, facilitating the development and evaluation of robust and efficient dialogue systems. Key Features: - Augmentation with IBM's [LAMBADA Model](https://arxiv.org/abs/1911.03118): The Clinic150-SUR dataset leverages IBM's LAMBADA model, a language generation model trained on a large corpus of text, to augment the original dataset. This augmentation process enhances the diversity and complexity of the dialogue data, allowing for a broader range of interactions. - Integration of [Parrot](https://github.com/PrithivirajDamodaran/Parrot_Paraphraser) Model: In addition to the LAMBADA model, the Clinic150-SUR dataset also incorporates the Parrot model, providing a variety of paraphrases. By integrating Parrot, the dataset achieves more variations of existing utterances. - Duplicated Utterances: The dataset includes carefully curated duplicated utterances to mimic real-world scenarios where users rephrase or repeat commonly asked queries. This feature adds variability to the data, reflecting the natural tendencies of human interactions, and enables dialogue systems to handle such instances better. - [Clinic150](https://aclanthology.org/D19-1131/) as the Foundation: The Clinic150-SUR dataset is built upon the Clinic150 dataset, which originally consisted of 150 in-domain intent classes and 150 human utterances for each intent. By utilizing this foundation, the augmented dataset retains the in-domain expertise while better reflecting the nature of user requests towards a dialog system. ### Data Instances #### clinic150-SUR - **Size of downloaded dataset file:** 29 MB ### Data Fields #### clinic150-SUR - `intent`: a `string` feature. - `user_utterance`: a `string` feature. - `origin`: a `string` feature ('original', 'lambada', 'parrot'). ### Citation Information ``` @inproceedings{rabinovich2022reliable, title={Reliable and Interpretable Drift Detection in Streams of Short Texts}, author={Rabinovich, Ella and Vetzler, Matan and Ackerman, Samuel and Anaby-Tavor, Ateret}, booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (industry track)", publisher = "Association for Computational Linguistics", year={2023}, url={https://arxiv.org/abs/2305.17750} } ``` ### Contributions Thanks to [Matan Vetzler](https://www.linkedin.com/in/matanvetzler/), [Ella Rabinovich](https://www.linkedin.com/in/ella-rabinovich-7b9a06/) for adding this dataset.

提供机构：

ibm-research

5,000+

优质数据集

54 个

任务类型

进入经典数据集