five

MMCR|多模态对话数据集|自然语言处理数据集

收藏
arXiv2025-03-24 更新2025-03-28 收录
多模态对话
自然语言处理
下载链接:
http://arxiv.org/abs/2503.18533v1
下载链接
链接失效反馈
资源简介:
MMCR数据集是由西北工业大学和阿里巴巴集团联合创建的多模态多轮对话数据集。该数据集包含MMCR-310k和MMCR-Bench两部分,其中MMCR-310k是一个包含310000个对话的数据集,对话覆盖1-4张图片,分为4轮或8轮;MMCR-Bench则是一个诊断性基准,包含8个领域的对话和40个子主题。该数据集通过模拟真实世界的用户聊天机器人交互,强调每轮对话的上下文关联和逻辑推进,旨在提升视觉语言模型的多轮对话上下文推理能力。
提供机构:
西北工业大学
创建时间:
2025-03-24
AI搜集汇总
数据集介绍
main_image_url
构建方式
在视觉语言模型领域,多轮多图像对话数据集的构建一直是一个挑战。MMCR数据集的构建采用了创新的方法,首先从OmniCorpus-CC-210M中精心筛选了120万单图像样本和40万多图像样本作为基础数据。通过精心设计的提示工程,研究人员引导GPT-4o生成具有强上下文逻辑的多轮对话数据。为确保数据质量,采用CLIP模型对生成的对话进行语义相似度评估,严格过滤不符合要求的样本,最终获得21万单图像和10万多图像的高质量对话数据。这一构建过程特别注重对话的上下文关联性,确保每个对话轮次都能深入探讨图像细节和主题关联。
特点
MMCR数据集在视觉语言模型领域具有显著特点。作为目前最大的多图像多轮对话数据集,它包含31万条上下文对话,涵盖1-4张图像和4-8轮对话。数据集特别强调对话的上下文相关性和逻辑递进性,要求每个对话轮次都建立在前序对话基础上,深入探讨图像细节和主题关联。同时,数据集覆盖8个主要领域(人文、自然、科学、教育等)和40个子主题,确保了内容的广泛性和多样性。这些特点使MMCR成为评估和提升视觉语言模型上下文推理能力的理想选择。
使用方法
MMCR数据集的使用方法体现了其在视觉语言模型研究中的独特价值。研究人员可以将MMCR-310k用于模型微调,通过多轮多图像对话训练提升模型的上下文推理能力。MMCR-Bench则可用于全面评估模型性能,其600个精心筛选的评估样本涵盖多个领域和主题。评估采用GPT-4o作为评判者,从描述精确性、上下文一致性、逻辑关系等五个维度进行评分。值得注意的是,使用该数据集时需要保持任务类型的平衡分布,实验表明合理的数据配比对模型性能提升至关重要,这体现了'少即是多'的数据使用理念。
背景与挑战
背景概述
MMCR(Multimodal Multi-turn Contextual Reasoning)数据集由西北工业大学网络空间安全学院、阿里巴巴集团AI业务部门以及浙江大学计算机科学与技术学院的研究团队于2025年提出,旨在推动视觉语言模型(VLMs)在多轮多图像对话场景中的发展。该数据集包含MMCR-310k和MMCR-Bench两部分,分别提供了31万条多轮对话数据和涵盖8大领域40个子主题的评测基准。MMCR的诞生填补了现有VLMs在真实人机交互场景中多轮多图像对话能力评估的空白,其设计灵感来源于人类对话的聚焦主题与逻辑连贯性特点,通过GPT-4o生成并经过CLIP模型严格筛选,显著提升了模型在上下文推理任务中的表现。
当前挑战
MMCR数据集面临的挑战主要体现在两个方面:领域问题层面,传统VLMs主要针对单图像单轮对话优化,难以处理多图像跨轮次的复杂语义关联与长期依赖关系,导致对话连贯性和主题一致性不足;构建过程层面,需克服单图像数据生成多轮对话时的幻觉问题,确保多图像间语义关联强度,并通过提示工程精确控制对话的渐进式深度探索与逻辑递进。此外,数据平衡性对模型性能的影响揭示了'少即是多'现象,要求构建时兼顾数据规模与任务类型分布的合理性。
常用场景
经典使用场景
在视觉语言模型(VLMs)的研究领域,MMCR数据集被广泛应用于多轮多图像对话场景的模型训练与评估。该数据集通过模拟真实人机交互中的连续对话模式,为模型提供了丰富的上下文推理信息。研究者利用MMCR-310k中的31万条跨4-8轮对话数据,能够有效训练模型处理涉及1-4张图像的复杂对话序列,显著提升了模型在跨模态语境下的连贯性表现。
解决学术问题
MMCR针对当前视觉语言模型在长程多轮对话中存在的逻辑断裂和语境一致性不足等核心问题,提供了系统性解决方案。其构建的多维度评估框架MMCR-Bench覆盖8大领域40个子主题,通过精准标注的600组对话样本,首次实现了对模型在跨图像引用、主题延续性和冗余控制等五大维度的量化评估。实验表明,基于该数据集微调的模型在上下文准确率上提升达5.2%,同步推动AI2D等传统基准1.1-1.2%的性能突破。
衍生相关工作
该数据集催生了多项重要研究进展:基于MMCR构建的Ovis模型创新性采用可学习视觉嵌入表,在跨模态对齐任务中取得突破;其提出的'少即是多'训练现象颠覆了传统数据量认知,为参数效率优化提供新范式。后续工作如MMDU-45k等均在MMCR的评估框架基础上,进一步拓展了多图像对话任务的边界。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

LFW

人脸数据集;LFW数据集共有13233张人脸图像,每张图像均给出对应的人名,共有5749人,且绝大部分人仅有一张图片。每张图片的尺寸为250X250,绝大部分为彩色图像,但也存在少许黑白人脸图片。 URL: http://vis-www.cs.umass.edu/lfw/index.html#download

AI_Studio 收录

jpft/danbooru2023

Danbooru2023是一个大规模的动漫图像数据集,包含超过500万张由爱好者社区贡献并详细标注的图像。图像标签涵盖角色、场景、版权、艺术家等方面,平均每张图像有30个标签。该数据集可用于训练图像分类、多标签标注、角色检测、生成模型等多种计算机视觉任务。数据集基于danbooru2021构建,扩展至包含ID #6,857,737的图像,增加了超过180万张新图像,总大小约为8TB。图像以原始格式提供,分为1000个子目录,使用图像ID的模1000进行分桶,以避免文件系统性能问题。

hugging_face 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

中国近海台风路径集合数据集(1945-2024)

1945-2024年度,中国近海台风路径数据集,包含每个台风的真实路径信息、台风强度、气压、中心风速、移动速度、移动方向。 数据源为获取温州台风网(http://www.wztf121.com/)的真实观测路径数据,经过处理整合后形成文件,如使用csv文件需使用文本编辑器打开浏览,否则会出现乱码,如要使用excel查看数据,请使用xlsx的格式。

国家海洋科学数据中心 收录

Tropicos

Tropicos是一个全球植物名称数据库,包含超过130万种植物的名称、分类信息、分布数据、图像和参考文献。该数据库由密苏里植物园维护,旨在为植物学家、生态学家和相关领域的研究人员提供全面的植物信息。

www.tropicos.org 收录