five

dialogstudio

收藏
魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/dialogstudio
下载链接
链接失效反馈
官方服务:
资源简介:
<img src="https://huggingface.co/datasets/jianguozhang/logos/resolve/main/logo.png" alt="drawing" width="510"/> # DialogStudio: Unified Dialog Datasets and Instruction-Aware Models for Conversational AI **Author**: [Jianguo Zhang](https://github.com/jianguoz), [Kun Qian](https://github.com/qbetterk) [Paper](https://arxiv.org/pdf/2307.10172.pdf)|[Github](https://github.com/salesforce/DialogStudio)|[GDrive] 🎉 **March 18, 2024: Update for AI Agent**. Check [xLAM](https://github.com/SalesforceAIResearch/xLAM) for the latest data and models relevant to AI Agent! 🎉 **March 10 2024: Update for dataset viewer issues:** - Please refer to https://github.com/salesforce/DialogStudio for view of each dataset, where we provide 5 converted examples along with 5 original examples under each data folder. - For example, https://github.com/salesforce/DialogStudio/tree/main/open-domain-dialogues/ShareGPT contains two files: [converted_examples.json](https://github.com/salesforce/DialogStudio/blob/main/open-domain-dialogues/ShareGPT/converted_example.json) and [original_example.json](https://github.com/salesforce/DialogStudio/blob/main/open-domain-dialogues/ShareGPT/original_example.json). <img src="https://huggingface.co/datasets/jianguozhang/logos/resolve/main/DialogStudio_Stats.jpg" alt="drawing" width="800"/> **Follow the [DialogStudio](https://github.com/salesforce/DialogStudio) GitHub repository for latest information.** ### Datasets ### Load dataset The datasets are split into several categories in HuggingFace ``` Datasets/ ├── Knowledge-Grounded-Dialogues ├── Natural-Language-Understanding ├── Open-Domain-Dialogues ├── Task-Oriented-Dialogues ├── Dialogue-Summarization ├── Conversational-Recommendation-Dialogs ``` You can load any dataset in the DialogStudio from the [HuggingFace hub](https://huggingface.co/datasets/Salesforce/dialogstudio) by claiming the `{dataset_name}`, which is exactly the dataset folder name. All available datasets are described in [dataset content](https://github.com/salesforce/DialogStudio/blob/main/Dataset_Stats.csv). For easier reference, [available dataset names](#Available Datasets) are also listed below. Below is one example to load the [MULTIWOZ2_2](https://huggingface.co/datasets/Salesforce/dialogstudio/blob/main/task_oriented/MULTIWOZ2_2.zip) dataset under the [task-oriented-dialogues](https://huggingface.co/datasets/Salesforce/dialogstudio/tree/main/task_oriented) category: Load the dataset ```python from datasets import load_dataset dataset = load_dataset('Salesforce/dialogstudio', 'MULTIWOZ2_2') ``` Here is the output structure of MultiWOZ 2.2 ```python DatasetDict({ train: Dataset({ features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'], num_rows: 8437 }) validation: Dataset({ features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'], num_rows: 1000 }) test: Dataset({ features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'], num_rows: 1000 }) }) ``` ### Available Datasets The ``data_name`` for ``load_dataset("Salesforce/dialogstudio", data_name)`` can be found below. More detailed information for each dataset can be found in out [github](https://github.com/salesforce/DialogStudio/blob/main/Dataset_Stats.csv). ```python "natural_language_understanding": [ "ATIS", "ATIS-NER", "BANKING77", "BANKING77-OOS", "CLINC-Single-Domain-OOS-banking", "CLINC-Single-Domain-OOS-credit_cards", "CLINC150", "DSTC8-SGD", "HWU64", "MIT-Movie", "MIT-Restaurant", "RESTAURANTS8K", "SNIPS", "SNIPS-NER", "TOP", "TOP-NER" ], "task_oriented": [ "ABCD", "AirDialogue", "BiTOD", "CaSiNo", "CraigslistBargains", "Disambiguation", "DSTC2-Clean", "FRAMES", "GECOR", "HDSA-Dialog", "KETOD", "KVRET", "MetaLWOZ", "MS-DC", "MuDoCo", "MulDoGO", "MultiWOZ_2.1", "MULTIWOZ2_2", "SGD", "SimJointGEN", "SimJointMovie", "SimJointRestaurant", "STAR", "Taskmaster1", "Taskmaster2", "Taskmaster3", "WOZ2_0" ], "dialogue_summarization": [ "AMI", "CRD3", "DialogSum", "ECTSum", "ICSI", "MediaSum", "QMSum", "SAMSum", "TweetSumm", "ConvoSumm", "SummScreen_ForeverDreaming", "SummScreen_TVMegaSite" ], "conversational_recommendation": [ "Redial", "DuRecDial-2.0", "OpenDialKG", "SalesBot", ], "open_domain": [ "chitchat-dataset", "ConvAI2", "AntiScam", "Empathetic", "HH-RLHF", "PLACES3.5", "Prosocial", "SODA", "ShareGPT" ], "knowledge_grounded": [ "CompWebQ", "CoQA", "CoSQL", "DART", "FeTaQA", "GrailQA", "HybridQA", "MTOP", "MultiModalQA", "SParC", "Spider", "SQA", "ToTTo", "WebQSP", "WikiSQL", "WikiTQ", "wizard_of_internet", "wizard_of_wikipedia" ], ``` # License Our project follows the following structure with respect to licensing: 1. For all the modified datasets in DialogStudio: - A portion of these datasets is under the [Apache License 2.0](https://github.com/salesforce/DialogStudio/blob/main/LICENSE.txt). - Some retain their original licenses even after modification. - For a few datasets that lacked a license, we have cited the relevant papers. 2. Original dataset licenses: For reference, we also put the original avaliable licenses for each dataset into their respective dataset folders. 3. Code: Our codebase is under the [Apache License 2.0](https://github.com/salesforce/DialogStudio/blob/main/LICENSE.txt). For detailed licensing information, please refer to the specific licenses accompanying the datasets. If you utilize datasets from DialogStudio, we kindly request that you cite our work. # Ethical Considerations This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. # Citation The data and code in this repository is mostly developed for or derived from the paper below. If you utilize datasets from DialogStudio, we kindly request that you cite both the original work and our own (Accepted by EACL 2024 Findings as a long paper). ``` @article{zhang2023dialogstudio, title={DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI}, author={Zhang, Jianguo and Qian, Kun and Liu, Zhiwei and Heinecke, Shelby and Meng, Rui and Liu, Ye and Yu, Zhou and Savarese, Silvio and Xiong, Caiming}, journal={arXiv preprint arXiv:2307.10172}, year={2023} } ```

# DialogStudio:面向对话式AI的统一对话数据集与指令感知模型 **作者**:[张国健](https://github.com/jianguoz)、[钱坤](https://github.com/qbetterk) [论文](https://arxiv.org/pdf/2307.10172.pdf)|[GitHub仓库](https://github.com/salesforce/DialogStudio)|[谷歌云端硬盘] 🎉 **2024年3月18日:AI智能体(AI Agent)相关更新**。如需获取与AI智能体相关的最新数据与模型,请访问 [xLAM](https://github.com/SalesforceAIResearch/xLAM)! 🎉 **2024年3月10日:数据集查看器问题修复说明**: - 请访问 https://github.com/salesforce/DialogStudio 查看各数据集详情,我们在每个数据集文件夹下均提供了5条转换后示例与5条原始示例。 - 示例:https://github.com/salesforce/DialogStudio/tree/main/open-domain-dialogues/ShareGPT 包含两个文件:[converted_examples.json](https://github.com/salesforce/DialogStudio/blob/main/open-domain-dialogues/ShareGPT/converted_example.json) 与 [original_example.json](https://github.com/salesforce/DialogStudio/blob/main/open-domain-dialogues/ShareGPT/original_example.json)。 <img src="https://huggingface.co/datasets/jianguozhang/logos/resolve/main/logo.png" alt="drawing" width="510"/> <img src="https://huggingface.co/datasets/jianguozhang/logos/resolve/main/DialogStudio_Stats.jpg" alt="drawing" width="800"/> **请关注 [DialogStudio](https://github.com/salesforce/DialogStudio) GitHub仓库以获取最新动态。** ### 数据集 ### 加载数据集 这些数据集在HuggingFace平台上分为以下类别: 数据集目录/ ├── 知识驱动型对话数据集 ├── 自然语言理解数据集 ├── 开放域对话数据集 ├── 任务导向型对话数据集 ├── 对话摘要数据集 ├── 对话式推荐对话数据集 你可以通过指定`{dataset_name}`从[HuggingFace Hub](https://huggingface.co/datasets/Salesforce/dialogstudio)加载DialogStudio中的任意数据集,`dataset_name`即为数据集文件夹的名称。所有可用数据集的详细说明请参见[数据集统计信息](https://github.com/salesforce/DialogStudio/blob/main/Dataset_Stats.csv)。为便于查阅,下方也列出了[可用数据集名称](#可用数据集)。 以下为加载[任务导向型对话数据集](https://huggingface.co/datasets/Salesforce/dialogstudio/tree/main/task_oriented)分类下的[MULTIWOZ2_2](https://huggingface.co/datasets/Salesforce/dialogstudio/blob/main/task_oriented/MULTIWOZ2_2.zip)数据集的示例: python from datasets import load_dataset dataset = load_dataset('Salesforce/dialogstudio', 'MULTIWOZ2_2') 以下为MultiWOZ 2.2数据集的输出结构: python DatasetDict({ train: Dataset({ features: ['原始对话ID', '新对话ID', '对话索引', '原始对话信息', '对话日志', '提示词', '非扁平化外部知识', '外部知识', '对话状态跟踪知识', '意图知识'], num_rows: 8437 }) validation: Dataset({ features: ['原始对话ID', '新对话ID', '对话索引', '原始对话信息', '对话日志', '提示词', '非扁平化外部知识', '外部知识', '对话状态跟踪知识', '意图知识'], num_rows: 1000 }) test: Dataset({ features: ['原始对话ID', '新对话ID', '对话索引', '原始对话信息', '对话日志', '提示词', '非扁平化外部知识', '外部知识', '对话状态跟踪知识', '意图知识'], num_rows: 1000 }) }) ### 可用数据集 用于`load_dataset("Salesforce/dialogstudio", data_name)`的`data_name`如下所示。各数据集的详细信息请参见我们的[GitHub仓库](https://github.com/salesforce/DialogStudio/blob/main/Dataset_Stats.csv)。 python "自然语言理解": [ "ATIS", "ATIS-NER", "BANKING77", "BANKING77-OOS", "CLINC-Single-Domain-OOS-banking", "CLINC-Single-Domain-OOS-credit_cards", "CLINC150", "DSTC8-SGD", "HWU64", "MIT-Movie", "MIT-Restaurant", "RESTAURANTS8K", "SNIPS", "SNIPS-NER", "TOP", "TOP-NER" ], "任务导向型对话": [ "ABCD", "AirDialogue", "BiTOD", "CaSiNo", "CraigslistBargains", "Disambiguation", "DSTC2-Clean", "FRAMES", "GECOR", "HDSA-Dialog", "KETOD", "KVRET", "MetaLWOZ", "MS-DC", "MuDoCo", "MulDoGO", "MultiWOZ_2.1", "MULTIWOZ2_2", "SGD", "SimJointGEN", "SimJointMovie", "SimJointRestaurant", "STAR", "Taskmaster1", "Taskmaster2", "Taskmaster3", "WOZ2_0" ], "对话摘要": [ "AMI", "CRD3", "DialogSum", "ECTSum", "ICSI", "MediaSum", "QMSum", "SAMSum", "TweetSumm", "ConvoSumm", "SummScreen_ForeverDreaming", "SummScreen_TVMegaSite" ], "对话式推荐": [ "Redial", "DuRecDial-2.0", "OpenDialKG", "SalesBot", ], "开放域对话": [ "chitchat-dataset", "ConvAI2", "AntiScam", "Empathetic", "HH-RLHF", "PLACES3.5", "Prosocial", "SODA", "ShareGPT" ], "知识驱动型对话": [ "CompWebQ", "CoQA", "CoSQL", "DART", "FeTaQA", "GrailQA", "HybridQA", "MTOP", "MultiModalQA", "SParC", "Spider", "SQA", "ToTTo", "WebQSP", "WikiSQL", "WikiTQ", "wizard_of_internet", "wizard_of_wikipedia" ], # 许可证 本项目的许可证遵循以下规则: 1. 对于DialogStudio中所有经过修改的数据集: - 部分数据集遵循 [Apache License 2.0](https://github.com/salesforce/DialogStudio/blob/main/LICENSE.txt) 协议。 - 部分数据集在修改后仍保留其原始许可证。 - 对于少数未附带许可证的数据集,我们已引用相关论文。 2. 原始数据集许可证:为便于参考,我们已将各数据集的原始可用许可证放入其对应的数据集文件夹中。 3. 代码:本项目的代码库遵循 [Apache License 2.0](https://github.com/salesforce/DialogStudio/blob/main/LICENSE.txt) 协议。 详细的许可证信息请参见各数据集附带的具体许可证。若您使用了DialogStudio中的数据集,请务必引用我们的工作。 # 伦理考量 本项目仅用于支持学术论文的研究目的。我们的模型、数据集与代码并非专为所有下游场景设计或评估。我们强烈建议用户在部署本模型前,对准确性、安全性与公平性相关问题进行评估与处理。我们鼓励用户考虑AI的普遍局限性,遵守适用法律法规,并在选择使用场景时采用最佳实践,尤其是在错误或不当使用可能对人们的生命、权利或安全造成重大影响的高风险场景中。如需更多使用场景相关指导,请参阅我们的AUP(可接受使用政策)与AI可接受使用政策。 # 引用 本仓库中的数据与代码主要为以下论文开发或衍生而来。若您使用了DialogStudio中的数据集,请同时引用原研究与我们的工作(已被EACL 2024 Findings接收为长文)。 @article{zhang2023dialogstudio, title={DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI}, author={Zhang, Jianguo and Qian, Kun and Liu, Zhiwei and Heinecke, Shelby and Meng, Rui and Liu, Ye and Yu, Zhou and Savarese, Silvio and Xiong, Caiming}, journal={arXiv preprint arXiv:2307.10172}, year={2023} }
提供机构:
maas
创建时间:
2025-08-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作