five

silver/mmchat

收藏
Hugging Face2022-07-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/silver/mmchat
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - zh license: - other multilinguality: - monolingual paperswithcode_id: mmchat-multi-modal-chat-dataset-on-social pretty_name: "MMChat: Multi-Modal Chat Dataset on Social Media" size_categories: - 10M<n<100M source_datasets: - original task_categories: - conversational task_ids: - dialogue-generation --- # Dataset Card for MMChat ## Table of Contents - [Dataset Card for MMChat](#dataset-card-for-mmchat) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.zhengyinhe.com/datasets/ - **Repository:** https://github.com/silverriver/MMChat - **Paper:** https://arxiv.org/abs/2108.07154 ### Dataset Summary MMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat. MMChat comes with 4 different versions: - `mmchat`: The MMChat dataset used in our paper. - `mmchat_hf`: Contains human annotation on 100K sessions of dialogues. - `mmchat_raw`: Raw dialogues used to construct MMChat. `mmchat_lccc_filtered`: Raw dialogues filtered using the LCCC dataset. If you what to use high quality multi-modal dialogues that are closed related to the given images, I suggest you to use the `mmchat_hf` version. If you only care about the quality of dialogue texts, I suggest you to use the `mmchat_lccc_filtered` version. ### Supported Tasks and Leaderboards - dialogue-generation: The dataset can be used to train a model for generating dialogue responses. - response-retrieval: The dataset can be used to train a reranker model that can be used to implement a retrieval-based dialogue model. ### Languages MMChat is in Chinese MMChat中的对话是中文的 ## Dataset Structure ### Data Instances Several versions of MMChat are available. For `mmchat`, `mmchat_raw`, `mmchat_lccc_filtered`, the following instance applies: ```json { "dialog": ["你只拍出了你十分之一的美", "你的头像竟然换了,奥"], "weibo_content": "分享图片", "imgs": ["https://wx4.sinaimg.cn/mw2048/d716a6e2ly1fmug2w2l9qj21o02yox6p.jpg"] } ``` For `mmchat_hf`, the following instance applies: ```json { "dialog": ["白百合", "啊?", "有点像", "还好吧哈哈哈牙像", "有男盆友没呢", "还没", "和你说话呢。没回我"], "weibo_content": "补一张昨天礼仪的照片", "imgs": ["https://ww2.sinaimg.cn/mw2048/005Co9wdjw1eyoz7ib9n5j307w0bu3z5.jpg"], "labels": { "image_qualified": true, "dialog_qualified": true, "dialog_image_related": true } } ``` ### Data Fields - `dialog` (list of strings): List of utterances consisting of a dialogue. - `weibo_content` (string): Weibo content of the dialogue. - `imgs` (list of strings): List of URLs of images. - `labels` (dict): Human-annotated labels of the dialogue. - `image_qualified` (bool): Whether the image is of high quality. - `dialog_qualified` (bool): Whether the dialogue is of high quality. - `dialog_image_related` (bool): Whether the dialogue is related to the image. ### Data Splits For `mmchat`, we provide the following splits: |train|valid|test| |---:|---:|---:| |115,842 | 4,000 | 1,000 | For other versions, we do not provide the offical split. More stastics are listed here: | `mmchat` | Count | |--------------------------------------|--------:| | Sessions | 120.84 K | | Sessions with more than 4 utterances | 17.32 K | | Utterances | 314.13 K | | Images | 198.82 K | | Avg. utterance per session | 2.599 | | Avg. image per session | 2.791 | | Avg. character per utterance | 8.521 | | `mmchat_hf` | Count | |--------------------------------------|--------:| | Sessions | 19.90 K | | Sessions with more than 4 utterances | 8.91 K | | Totally annotated sessions | 100.01 K | | Utterances | 81.06 K | | Images | 52.66K | | Avg. utterance per session | 4.07 | | Avg. image per session | 2.70 | | Avg. character per utterance | 11.93 | | `mmchat_raw` | Count | |--------------------------------------|---------:| | Sessions | 4.257 M | | Sessions with more than 4 utterances | 2.304 M | | Utterances | 18.590 M | | Images | 4.874 M | | Avg. utterance per session | 4.367 | | Avg. image per session | 1.670 | | Avg. character per utterance | 14.104 | | `mmchat_lccc_filtered` | Count | |--------------------------------------|--------:| | Sessions | 492.6 K | | Sessions with more than 4 utterances | 208.8 K | | Utterances | 1.986 M | | Images | 1.066 M | | Avg. utterance per session | 4.031 | | Avg. image per session | 2.514 | | Avg. character per utterance | 11.336 | ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information other-weibo This dataset is collected from Weibo. You can refer to the [detailed policy](https://weibo.com/signup/v5/privacy) required to use this dataset. Please restrict the usage of this dataset to non-commerical purposes. ### Citation Information ``` @inproceedings{zheng2022MMChat, author = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian}, title = {MMChat: Multi-Modal Chat Dataset on Social Media}, booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference}, year = {2022}, publisher = {European Language Resources Association}, } @inproceedings{wang2020chinese, title={A Large-Scale Chinese Short-Text Conversation Dataset}, author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie}, booktitle={NLPCC}, year={2020}, url={https://arxiv.org/abs/2008.03946} } ``` ### Contributions Thanks to [Yinhe Zheng](https://github.com/silverriver) for adding this dataset.
提供机构:
silver
原始信息汇总

数据集概述

数据集名称

  • MMChat: Multi-Modal Chat Dataset on Social Media

数据集详情

  • 语言: 中文
  • 许可证: 其他-微博
  • 多语言性: 单语种
  • 大小: 10M<n<100M
  • 源数据: 原始数据
  • 任务类别: 对话生成
  • 任务ID: 对话生成

数据集版本

  • mmchat: 论文中使用的MMChat数据集。
  • mmchat_hf: 包含100K对话会话的人工标注。
  • mmchat_raw: 构建MMChat的原始对话。
  • mmchat_lccc_filtered: 使用LCCC数据集过滤的原始对话。

数据集结构

  • 数据实例: 每个实例包含对话、微博内容和图片链接。
  • 数据字段: 对话、微博内容、图片链接、标注标签。
  • 数据分割: mmchat版本提供训练、验证和测试分割。

数据集创建

  • 许可证信息: 数据集收集自微博,仅限非商业用途。
  • 引用信息: 引用时需使用提供的文献信息。

使用考虑

  • 许可证: 使用时需遵守微博的详细政策。

数据集统计

mmchat

  • 会话数: 120.84 K
  • 平均每会话发言数: 2.599
  • 平均每会话图片数: 2.791

mmchat_hf

  • 会话数: 19.90 K
  • 总标注会话数: 100.01 K
  • 平均每会话发言数: 4.07
  • 平均每会话图片数: 2.70

mmchat_raw

  • 会话数: 4.257 M
  • 平均每会话发言数: 4.367
  • 平均每会话图片数: 1.670

mmchat_lccc_filtered

  • 会话数: 492.6 K
  • 平均每会话发言数: 4.031
  • 平均每会话图片数: 2.514
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作