dialogstudio|对话式人工智能数据集|数据集数据集

魔搭社区2025-08-22 更新2025-08-23 收录

对话式人工智能

数据集

下载链接：

https://modelscope.cn/datasets/Salesforce/dialogstudio

下载链接

链接失效反馈

资源简介：

# DialogStudio: Unified Dialog Datasets and Instruction-Aware Models for Conversational AI **Author**: [Jianguo Zhang](https://github.com/jianguoz), [Kun Qian](https://github.com/qbetterk) [Paper](https://arxiv.org/pdf/2307.10172.pdf)|[Github](https://github.com/salesforce/DialogStudio)|[GDrive] 🎉 **March 18, 2024: Update for AI Agent**. Check [xLAM](https://github.com/SalesforceAIResearch/xLAM) for the latest data and models relevant to AI Agent! 🎉 **March 10 2024: Update for dataset viewer issues:** - Please refer to https://github.com/salesforce/DialogStudio for view of each dataset, where we provide 5 converted examples along with 5 original examples under each data folder. - For example, https://github.com/salesforce/DialogStudio/tree/main/open-domain-dialogues/ShareGPT contains two files: [converted_examples.json](https://github.com/salesforce/DialogStudio/blob/main/open-domain-dialogues/ShareGPT/converted_example.json) and [original_example.json](https://github.com/salesforce/DialogStudio/blob/main/open-domain-dialogues/ShareGPT/original_example.json). **Follow the [DialogStudio](https://github.com/salesforce/DialogStudio) GitHub repository for latest information.** ### Datasets ### Load dataset The datasets are split into several categories in HuggingFace ``` Datasets/ ├── Knowledge-Grounded-Dialogues ├── Natural-Language-Understanding ├── Open-Domain-Dialogues ├── Task-Oriented-Dialogues ├── Dialogue-Summarization ├── Conversational-Recommendation-Dialogs ``` You can load any dataset in the DialogStudio from the [HuggingFace hub](https://huggingface.co/datasets/Salesforce/dialogstudio) by claiming the `{dataset_name}`, which is exactly the dataset folder name. All available datasets are described in [dataset content](https://github.com/salesforce/DialogStudio/blob/main/Dataset_Stats.csv). For easier reference, [available dataset names](#Available Datasets) are also listed below. Below is one example to load the [MULTIWOZ2_2](https://huggingface.co/datasets/Salesforce/dialogstudio/blob/main/task_oriented/MULTIWOZ2_2.zip) dataset under the [task-oriented-dialogues](https://huggingface.co/datasets/Salesforce/dialogstudio/tree/main/task_oriented) category: Load the dataset ```python from datasets import load_dataset dataset = load_dataset('Salesforce/dialogstudio', 'MULTIWOZ2_2') ``` Here is the output structure of MultiWOZ 2.2 ```python DatasetDict({ train: Dataset({ features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'], num_rows: 8437 }) validation: Dataset({ features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'], num_rows: 1000 }) test: Dataset({ features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'], num_rows: 1000 }) }) ``` ### Available Datasets The ``data_name`` for ``load_dataset("Salesforce/dialogstudio", data_name)`` can be found below. More detailed information for each dataset can be found in out [github](https://github.com/salesforce/DialogStudio/blob/main/Dataset_Stats.csv). ```python "natural_language_understanding": [ "ATIS", "ATIS-NER", "BANKING77", "BANKING77-OOS", "CLINC-Single-Domain-OOS-banking", "CLINC-Single-Domain-OOS-credit_cards", "CLINC150", "DSTC8-SGD", "HWU64", "MIT-Movie", "MIT-Restaurant", "RESTAURANTS8K", "SNIPS", "SNIPS-NER", "TOP", "TOP-NER" ], "task_oriented": [ "ABCD", "AirDialogue", "BiTOD", "CaSiNo", "CraigslistBargains", "Disambiguation", "DSTC2-Clean", "FRAMES", "GECOR", "HDSA-Dialog", "KETOD", "KVRET", "MetaLWOZ", "MS-DC", "MuDoCo", "MulDoGO", "MultiWOZ_2.1", "MULTIWOZ2_2", "SGD", "SimJointGEN", "SimJointMovie", "SimJointRestaurant", "STAR", "Taskmaster1", "Taskmaster2", "Taskmaster3", "WOZ2_0" ], "dialogue_summarization": [ "AMI", "CRD3", "DialogSum", "ECTSum", "ICSI", "MediaSum", "QMSum", "SAMSum", "TweetSumm", "ConvoSumm", "SummScreen_ForeverDreaming", "SummScreen_TVMegaSite" ], "conversational_recommendation": [ "Redial", "DuRecDial-2.0", "OpenDialKG", "SalesBot", ], "open_domain": [ "chitchat-dataset", "ConvAI2", "AntiScam", "Empathetic", "HH-RLHF", "PLACES3.5", "Prosocial", "SODA", "ShareGPT" ], "knowledge_grounded": [ "CompWebQ", "CoQA", "CoSQL", "DART", "FeTaQA", "GrailQA", "HybridQA", "MTOP", "MultiModalQA", "SParC", "Spider", "SQA", "ToTTo", "WebQSP", "WikiSQL", "WikiTQ", "wizard_of_internet", "wizard_of_wikipedia" ], ``` # License Our project follows the following structure with respect to licensing: 1. For all the modified datasets in DialogStudio: - A portion of these datasets is under the [Apache License 2.0](https://github.com/salesforce/DialogStudio/blob/main/LICENSE.txt). - Some retain their original licenses even after modification. - For a few datasets that lacked a license, we have cited the relevant papers. 2. Original dataset licenses: For reference, we also put the original avaliable licenses for each dataset into their respective dataset folders. 3. Code: Our codebase is under the [Apache License 2.0](https://github.com/salesforce/DialogStudio/blob/main/LICENSE.txt). For detailed licensing information, please refer to the specific licenses accompanying the datasets. If you utilize datasets from DialogStudio, we kindly request that you cite our work. # Ethical Considerations This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP. # Citation The data and code in this repository is mostly developed for or derived from the paper below. If you utilize datasets from DialogStudio, we kindly request that you cite both the original work and our own (Accepted by EACL 2024 Findings as a long paper). ``` @article{zhang2023dialogstudio, title={DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI}, author={Zhang, Jianguo and Qian, Kun and Liu, Zhiwei and Heinecke, Shelby and Meng, Rui and Liu, Ye and Yu, Zhou and Savarese, Silvio and Xiong, Caiming}, journal={arXiv preprint arXiv:2307.10172}, year={2023} } ```

提供机构：

maas

创建时间：

2025-08-16

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

PDT Dataset

PDT数据集是由山东计算机科学中心（国家超级计算济南中心）和齐鲁工业大学（山东省科学院）联合开发的无人机目标检测数据集，专门用于检测树木病虫害。该数据集包含高分辨率和低分辨率两种版本，共计5775张图像，涵盖了健康和受病虫害影响的松树图像。数据集的创建过程包括实地采集、数据预处理和人工标注，旨在为无人机在农业中的精准喷洒提供高精度的目标检测支持。PDT数据集的应用领域主要集中在农业无人机技术，旨在提高无人机在植物保护中的目标识别精度，解决传统检测模型在实际应用中的不足。

arXiv 收录

The MaizeGDB

The MaizeGDB（Maize Genetics and Genomics Database）是一个专门为玉米（Zea mays）基因组学研究提供数据和工具的在线资源。该数据库包含了玉米的基因组序列、基因注释、遗传图谱、突变体信息、表达数据、以及与玉米相关的文献和研究工具。MaizeGDB旨在支持玉米遗传学和基因组学的研究，为科学家提供了一个集成的平台来访问和分析玉米的遗传和基因组数据。

www.maizegdb.org 收录

EcoInvent

EcoInvent是一个生命周期评估（LCA）数据库，包含了大量产品的环境影响数据。它提供了详细的产品生命周期数据，包括原材料提取、生产、使用和废弃处理等各个阶段的环境影响信息。

www.ecoinvent.org 收录

中国陆域及周边逐日1km全天候地表温度数据集（TRIMS LST；2000-2024）

地表温度（Land surface temperature, LST）是地球表面与大气之间界面的重要参量之一。它既是地表与大气能量交互作用的直接体现，又对于地气过程具有复杂的反馈作用。因此，地表温度不仅是气候变化的敏感指示因子和掌握气候变化规律的重要前提，还是众多模型的直接输入参数，在许多领域有广泛的应用，如气象气候、环境生态、水文等。伴随地学及相关领域研究的深入和精细化，学术界对卫星遥感的全天候地表温度（All-weather LST）具有迫切的需求。本数据集的制备方法是增强型的卫星热红外遥感-再分析数据集成方法。方法的主要输入数据为Terra/Aqua MODIS LST产品和GLDAS等数据，辅助数据包括卫星遥感提供的植被指数、地表反照率等。方法充分利用了卫星热红外遥感和再分析数据提供的地表温度高频分量、低频分量以及地表温度的空间相关性，最终重建得到较高质量的全天候地表温度数据集。评价结果表明，本数据集具有良好的图像质量和精度，不仅在空间上无缝，还与当前学术界广泛采用的逐日1 km Terra/Aqua MODIS LST产品在幅值和空间分布上具有较高的一致性。当以MODIS LST为参考时，该数据集在白天和夜间的平均偏差（MBE）为0.09K和-0.03K，偏差标准差（STD）为1.45K和1.17K。基于19个站点实测数据的检验结果表明，其MBE为-2.26K至1.73K，RMSE为0.80K至3.68K，且在晴空与非晴空条件下无显著区别。本数据集的时间分辨率为逐日4次，空间分辨率为1km，时间跨度为2000年-2024年；空间范围包括我国陆域的主要区域（包含港澳台地区，暂不包含我国南海诸岛）及周边区域（72°E-135°E，19°N-55°N）。本数据集的缩写名为TRIMS LST（Thermal and Reanalysis Integrating Moderate-resolution Spatial-seamless LST），以便用户使用。需要说明的是，TRIMS LST的空间子集TRIMS LST-TP（中国西部逐日1 km全天候地表温度数据集（TRIMS LST-TP；2000-2024）V2）同步在国家青藏高原科学数据中心发布，以减少相关用户数据下载和处理的工作量。

国家青藏高原科学数据中心收录

Global Wind Atlas (GWA)

Global Wind Atlas (GWA) 是一个全球风能资源数据集，提供了高分辨率的风速和风能密度数据。该数据集覆盖全球范围，包括陆地和海洋，旨在支持风能项目的规划和评估。数据集提供了多种风速和风能密度指标，以及风向和风能分布图。

globalwindatlas.info 收录