silver/mmchat

Name: silver/mmchat
Creator: silver
Published: 2022-07-10 13:04:36
License: 暂无描述

Hugging Face2022-07-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/silver/mmchat

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - found language: - zh license: - other multilinguality: - monolingual paperswithcode_id: mmchat-multi-modal-chat-dataset-on-social pretty_name: "MMChat: Multi-Modal Chat Dataset on Social Media" size_categories: - 10M<n<100M source_datasets: - original task_categories: - conversational task_ids: - dialogue-generation --- # Dataset Card for MMChat ## Table of Contents - [Dataset Card for MMChat](#dataset-card-for-mmchat) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.zhengyinhe.com/datasets/ - **Repository:** https://github.com/silverriver/MMChat - **Paper:** https://arxiv.org/abs/2108.07154 ### Dataset Summary MMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat. MMChat comes with 4 different versions: - `mmchat`: The MMChat dataset used in our paper. - `mmchat_hf`: Contains human annotation on 100K sessions of dialogues. - `mmchat_raw`: Raw dialogues used to construct MMChat. `mmchat_lccc_filtered`: Raw dialogues filtered using the LCCC dataset. If you what to use high quality multi-modal dialogues that are closed related to the given images, I suggest you to use the `mmchat_hf` version. If you only care about the quality of dialogue texts, I suggest you to use the `mmchat_lccc_filtered` version. ### Supported Tasks and Leaderboards - dialogue-generation: The dataset can be used to train a model for generating dialogue responses. - response-retrieval: The dataset can be used to train a reranker model that can be used to implement a retrieval-based dialogue model. ### Languages MMChat is in Chinese MMChat中的对话是中文的 ## Dataset Structure ### Data Instances Several versions of MMChat are available. For `mmchat`, `mmchat_raw`, `mmchat_lccc_filtered`, the following instance applies: ```json { "dialog": ["你只拍出了你十分之一的美", "你的头像竟然换了，奥"], "weibo_content": "分享图片", "imgs": ["https://wx4.sinaimg.cn/mw2048/d716a6e2ly1fmug2w2l9qj21o02yox6p.jpg"] } ``` For `mmchat_hf`, the following instance applies: ```json { "dialog": ["白百合", "啊？", "有点像", "还好吧哈哈哈牙像", "有男盆友没呢", "还没", "和你说话呢。没回我"], "weibo_content": "补一张昨天礼仪的照片", "imgs": ["https://ww2.sinaimg.cn/mw2048/005Co9wdjw1eyoz7ib9n5j307w0bu3z5.jpg"], "labels": { "image_qualified": true, "dialog_qualified": true, "dialog_image_related": true } } ``` ### Data Fields - `dialog` (list of strings): List of utterances consisting of a dialogue. - `weibo_content` (string): Weibo content of the dialogue. - `imgs` (list of strings): List of URLs of images. - `labels` (dict): Human-annotated labels of the dialogue. - `image_qualified` (bool): Whether the image is of high quality. - `dialog_qualified` (bool): Whether the dialogue is of high quality. - `dialog_image_related` (bool): Whether the dialogue is related to the image. ### Data Splits For `mmchat`, we provide the following splits: |train|valid|test| |---:|---:|---:| |115,842 | 4,000 | 1,000 | For other versions, we do not provide the offical split. More stastics are listed here: | `mmchat` | Count | |--------------------------------------|--------:| | Sessions | 120.84 K | | Sessions with more than 4 utterances | 17.32 K | | Utterances | 314.13 K | | Images | 198.82 K | | Avg. utterance per session | 2.599 | | Avg. image per session | 2.791 | | Avg. character per utterance | 8.521 | | `mmchat_hf` | Count | |--------------------------------------|--------:| | Sessions | 19.90 K | | Sessions with more than 4 utterances | 8.91 K | | Totally annotated sessions | 100.01 K | | Utterances | 81.06 K | | Images | 52.66K | | Avg. utterance per session | 4.07 | | Avg. image per session | 2.70 | | Avg. character per utterance | 11.93 | | `mmchat_raw` | Count | |--------------------------------------|---------:| | Sessions | 4.257 M | | Sessions with more than 4 utterances | 2.304 M | | Utterances | 18.590 M | | Images | 4.874 M | | Avg. utterance per session | 4.367 | | Avg. image per session | 1.670 | | Avg. character per utterance | 14.104 | | `mmchat_lccc_filtered` | Count | |--------------------------------------|--------:| | Sessions | 492.6 K | | Sessions with more than 4 utterances | 208.8 K | | Utterances | 1.986 M | | Images | 1.066 M | | Avg. utterance per session | 4.031 | | Avg. image per session | 2.514 | | Avg. character per utterance | 11.336 | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information other-weibo This dataset is collected from Weibo. You can refer to the [detailed policy](https://weibo.com/signup/v5/privacy) required to use this dataset. Please restrict the usage of this dataset to non-commerical purposes. ### Citation Information ``` @inproceedings{zheng2022MMChat, author = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian}, title = {MMChat: Multi-Modal Chat Dataset on Social Media}, booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference}, year = {2022}, publisher = {European Language Resources Association}, } @inproceedings{wang2020chinese, title={A Large-Scale Chinese Short-Text Conversation Dataset}, author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie}, booktitle={NLPCC}, year={2020}, url={https://arxiv.org/abs/2008.03946} } ``` ### Contributions Thanks to [Yinhe Zheng](https://github.com/silverriver) for adding this dataset.

提供机构：

silver

原始信息汇总

数据集概述

数据集名称

MMChat: Multi-Modal Chat Dataset on Social Media

数据集详情

语言: 中文
许可证: 其他-微博
多语言性: 单语种
大小: 10M<n<100M
源数据: 原始数据
任务类别: 对话生成
任务ID: 对话生成

数据集版本

mmchat: 论文中使用的MMChat数据集。
mmchat_hf: 包含100K对话会话的人工标注。
mmchat_raw: 构建MMChat的原始对话。
mmchat_lccc_filtered: 使用LCCC数据集过滤的原始对话。

数据集结构

数据实例: 每个实例包含对话、微博内容和图片链接。
数据字段: 对话、微博内容、图片链接、标注标签。
数据分割: mmchat版本提供训练、验证和测试分割。

数据集创建

许可证信息: 数据集收集自微博，仅限非商业用途。
引用信息: 引用时需使用提供的文献信息。

使用考虑

许可证: 使用时需遵守微博的详细政策。

数据集统计

`mmchat`

会话数: 120.84 K
平均每会话发言数: 2.599
平均每会话图片数: 2.791

`mmchat_hf`

会话数: 19.90 K
总标注会话数: 100.01 K
平均每会话发言数: 4.07
平均每会话图片数: 2.70

`mmchat_raw`

会话数: 4.257 M
平均每会话发言数: 4.367
平均每会话图片数: 1.670

`mmchat_lccc_filtered`

会话数: 492.6 K
平均每会话发言数: 4.031
平均每会话图片数: 2.514

5,000+

优质数据集

54 个

任务类型

进入经典数据集