SpursgoZmy/MMTab|多模态学习数据集|表格理解数据集

hugging_face2024-07-18 更新2024-06-12 收录

多模态学习

表格理解

下载链接：

https://hf-mirror.com/datasets/SpursgoZmy/MMTab

下载链接

链接失效反馈

资源简介：

MMTab是一个大规模多模态指令调优数据集，旨在增强和评估多模态大语言模型（LLMs）的视觉表格理解能力。该数据集包含多样化的表格图像和指令跟随数据，涵盖15种表格任务，如表格问答、表格转文本、表格结构理解等。数据集分为三个部分：MMTab-pre（预训练）、MMTab-instruct（指令微调）和MMTab-eval（评估）。每个部分包含不同数量的表格图像和样本数据，且数据格式遵循LLaVA对话格式。数据集基于14个公开的表格数据集构建，涵盖了8个领域，并通过脚本将原始文本表格转换为表格图像。数据集的主要用途是研究大型多模态模型和聊天机器人，主要用户是计算机视觉、自然语言处理、机器学习和人工智能领域的研究人员和爱好者。数据集的局限性包括主要关注英文单表场景，未考虑多表场景和更广泛的语言覆盖，且表格图像主要来自高质量数据集，未涵盖低质量图像。

MMTab is a large-scale multimodal instruction-tuning dataset for enhancing and evaluating the visual table understanding ability of multimodal LLMs. It contains diversified table images and instruction following data, covering 15 tabular tasks, e.g., table question answering, table2text, table structure understanding. The dataset is divided into three parts: MMTab-pre (pre-training), MMTab-instruct (instruction fine-tuning), and MMTab-eval (evaluation). Each part contains different amounts of table images and sample data, and the data format follows the LLaVA dialogue format. The dataset is constructed based on 14 publicly available table datasets from 8 domains, and the original textual tables are converted into table images using scripts. The primary use of MMTab is research on large multimodal models and chatbots, and the primary intended users are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. The limitations of the dataset include its focus on single-table scenarios in English, lack of consideration for multi-table scenarios and broader language coverage, and the fact that the table images are mainly from high-quality datasets, not covering low-quality images.

提供机构：

SpursgoZmy

原始信息汇总

MMTab 数据集概述

数据集描述

MMTab 是一个大规模多模态指令调优数据集，旨在增强和评估多模态大型语言模型（LLMs）的视觉表格理解能力。该数据集要求模型根据表格图像和输入请求生成正确的响应。MMTab 包含多样化的表格图像和指令跟随数据，涵盖15种表格任务，例如表格问答、表格到文本转换、表格结构理解等。

MMTab 可以分为三个部分（MMTab-pre、MMTab-instruct、MMTab-eval），分别用于预训练、指令微调和评估。

数据集详情

分割	文件名	数据大小	描述
MMTab-eval	`MMTab-eval_table_images_23K.zip`	23K	用于评估的23K表格图像
	`MMTab-eval_test_data_49K.json`	49K	45K样本用于内部评估，4K样本用于外部评估
MMTab-instruct	`MMTab-instruct_table_images_82K.zip`	82K	用于指令微调的82K表格图像
	`MMTab-instruct_sft_data_llava_format_232K.json`	232K	195K单轮和37K多轮指令微调样本，采用LLaVA对话格式
	`enhanced_llava_sft_data_898k.json`	898K	232K MMTab-instruct样本 + 665K原始LLaVA-1.5指令微调样本，用于微调Table-LLaVA
MMTab-pre	`MMTab-instruct_table_images_82K.zip`	82K	这部分表格图像也用于预训练，即作为`MMTab-pre_table_images_part_1_82K.zip`
	`MMTab-pre_table_images_part_2_16K.zip`	16K	从ToTTo数据集额外收集的16K表格图像，用于预训练
	`MMTab-pre_pretrain_data_llava_format_150K.json`	150K	150K表格识别样本，用于预训练，采用LLaVA对话格式
	`enhanced_llava_pretrain_data_708K.json`	708K	150K MMTab-pre样本 + 558K原始LLaVA-1.5预训练样本，用于预训练Table-LLaVA

数据集结构

指令微调和预训练样本遵循LLaVA的对话数据格式，如下所示：

Python {id: ToTTo_train_item_534, # 样本ID image: table_instructV/images/ToTTo_train_table_21297.jpg, # 对应的表格图像文件路径 conversations: [{from: human, # 发言来自人类还是模型 value: "Provide a single-sentence description for the highlighted table cells in a Wikipedia table labeled Chesney Hawkes along with its metadata. <image>"}, # 对话内容 {from: gpt, value: Chesney Hawkes released a single called "Another Fine Mess" in 2005 that reached number 48.}] }

数据集创建

为了支持多模态表格理解的多模态大型语言模型（MLLMs）的开发和评估，我们基于14个公开可用的8个领域的表格数据集构建了MMTab。我们精心设计脚本，将这些数据集中的原始文本表格转换为突出广泛表格结构和样式的表格图像，并将所有特定任务样本转换为具有统一格式的多模态指令微调样本<表格图像，输入请求，输出响应>。

预期用途

主要预期用途： MMTab 主要用于大型多模态模型和聊天机器人的研究。

主要预期用户： MMTab 的主要预期用户是计算机视觉、自然语言处理、机器学习和人工智能领域的研究人员和爱好者。

限制

首先，该数据集主要关注单个英语表格。多表格场景和更广泛的语言覆盖尚未考虑。其次，MMTab 基于精心选择的表格数据集中的真实世界表格，并包含由自动化脚本渲染的多样化高质量表格图像。然而，现实世界中的表格图像可能是低质量的，例如模糊、手写或不完整的表格图像。为了进一步缩小学术研究和实际应用场景之间的差距，未来可以收集更多样化的现实世界表格图像，并构建相应的指令跟随数据。

AI搜集汇总

数据集介绍

构建方式

MMTab数据集的构建基于14个公开的表格数据集，涵盖8个领域。通过精心设计的脚本，原始文本表格被转换为图像，突出了广泛的表格结构和样式。所有任务特定的样本被转换为多模态指令调优样本，采用统一的<表格图像，输入请求，输出响应>格式。这一过程确保了数据集在多模态表格理解任务中的广泛适用性和高质量。

特点

MMTab数据集具有多模态特性，结合了图像和文本数据，适用于增强和评估多模态大语言模型（LLMs）的视觉表格理解能力。数据集分为三个部分：预训练（MMTab-pre）、指令微调（MMTab-instruct）和评估（MMTab-eval），分别用于模型的不同训练阶段。此外，数据集采用了LLaVA对话格式，便于模型的指令调优和评估。

使用方法

MMTab数据集主要用于研究大型多模态模型和聊天机器人。用户可以通过加载数据集中的图像和对话数据，进行模型的预训练、指令微调和评估。数据集的结构遵循LLaVA对话格式，便于直接应用于现有的多模态模型训练框架。此外，数据集的多样性和高质量图像使其成为评估和提升模型在多模态表格理解任务中性能的理想选择。

背景与挑战

背景概述

在多模态语言模型（MLLMs）领域，视觉表格理解能力的重要性日益凸显。MMTab数据集应运而生，旨在通过大规模的多模态指令调优数据集，提升和评估多模态语言模型对表格图像的理解能力。该数据集由ACL 2024会议论文《Multimodal Table Understanding》提出，主要研究人员和机构通过整合14个公开的表格数据集，覆盖8个领域，精心设计脚本将原始文本表格转换为表格图像，并将其转化为多模态指令调优样本。MMTab数据集的构建不仅丰富了表格图像的多样性，还为多模态语言模型的预训练、指令微调和评估提供了坚实的基础，对推动多模态语言模型在实际应用中的表现具有重要意义。

当前挑战

尽管MMTab数据集在多模态表格理解领域取得了显著进展，但仍面临若干挑战。首先，该数据集主要聚焦于单一的英文表格，尚未涵盖多表格场景和更广泛的语言覆盖。其次，虽然MMTab基于高质量的表格图像构建，但现实应用中的表格图像可能存在模糊、手写或不完整等问题，这限制了其在实际场景中的应用。此外，数据集的构建过程中，如何确保表格图像的多样性和代表性，以及如何有效处理和标注这些图像，都是亟待解决的问题。未来，进一步收集和处理多样化、低质量的表格图像，并构建相应的指令跟随数据，将是提升MMTab数据集实用性和广泛性的关键。

常用场景

经典使用场景

在多模态语言模型的研究领域，MMTab数据集以其丰富的表格图像和指令跟随数据，成为评估和增强模型视觉表格理解能力的重要资源。该数据集的经典使用场景包括表格问答、表格到文本的生成以及表格结构理解等任务。通过这些任务，研究者能够深入探索模型在处理复杂表格数据时的表现，从而推动多模态语言模型在实际应用中的性能提升。

实际应用

在实际应用中，MMTab数据集为开发和评估多模态语言模型提供了宝贵的资源。例如，在金融、医疗和教育等领域，表格数据的处理是常见任务。通过使用MMTab数据集，开发者可以训练和优化模型，使其能够准确理解和生成复杂的表格信息，从而提高这些领域中自动化数据处理和分析的效率。

衍生相关工作

MMTab数据集的发布催生了一系列相关研究工作，特别是在多模态语言模型的预训练和微调方面。例如，基于MMTab数据集的预训练和指令微调方法已被应用于Table-LLaVA模型的开发，显著提升了模型在表格理解和生成任务中的表现。此外，该数据集还激发了更多关于多模态数据处理和模型评估的研究，推动了多模态语言模型领域的整体进步。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

poi

本项目收集国内POI兴趣点，当前版本数据来自于openstreetmap。

github 收录

THCHS-30

“THCHS30是由清华大学语音与语言技术中心（CSLT）发布的开放式汉语语音数据库。原始录音是2002年在清华大学国家重点实验室的朱晓燕教授的指导下，由王东完成的。清华大学计算机科学系智能与系统，原名“TCMSD”，意思是“清华连续普通话语音数据库”，时隔13年出版，由王东博士发起，并得到了教授的支持。朱小燕。我们希望为语音识别领域的新研究人员提供一个玩具数据库。因此，该数据库对学术用户完全免费。整个软件包包含建立中文语音识别所需的全套语音和语言资源系统。”

OpenDataLab 收录

YOLO Drone Detection Dataset

为了促进无人机检测模型的开发和评估，我们引入了一个新颖且全面的数据集，专门为训练和测试无人机检测算法而设计。该数据集来源于Kaggle上的公开数据集，包含在各种环境和摄像机视角下捕获的多样化的带注释图像。数据集包括无人机实例以及其他常见对象，以实现强大的检测和分类。

github 收录

中国劳动力动态调查

“中国劳动力动态调查” （China Labor-force Dynamics Survey，简称 CLDS）是“985”三期“中山大学社会科学特色数据库建设”专项内容，CLDS的目的是通过对中国城乡以村/居为追踪范围的家庭、劳动力个体开展每两年一次的动态追踪调查，系统地监测村/居社区的社会结构和家庭、劳动力个体的变化与相互影响，建立劳动力、家庭和社区三个层次上的追踪数据库，从而为进行实证导向的高质量的理论研究和政策研究提供基础数据。

中国学术调查数据资料库收录