SpursgoZmy/MMTab|多模态学习数据集|表格理解数据集
收藏MMTab 数据集概述
数据集描述
MMTab 是一个大规模多模态指令调优数据集,旨在增强和评估多模态大型语言模型(LLMs)的视觉表格理解能力。该数据集要求模型根据表格图像和输入请求生成正确的响应。MMTab 包含多样化的表格图像和指令跟随数据,涵盖15种表格任务,例如表格问答、表格到文本转换、表格结构理解等。
MMTab 可以分为三个部分(MMTab-pre、MMTab-instruct、MMTab-eval),分别用于预训练、指令微调和评估。
数据集详情
分割 | 文件名 | 数据大小 | 描述 |
---|---|---|---|
MMTab-eval | MMTab-eval_table_images_23K.zip |
23K | 用于评估的23K表格图像 |
MMTab-eval_test_data_49K.json |
49K | 45K样本用于内部评估,4K样本用于外部评估 | |
MMTab-instruct | MMTab-instruct_table_images_82K.zip |
82K | 用于指令微调的82K表格图像 |
MMTab-instruct_sft_data_llava_format_232K.json |
232K | 195K单轮和37K多轮指令微调样本,采用LLaVA对话格式 | |
enhanced_llava_sft_data_898k.json |
898K | 232K MMTab-instruct样本 + 665K原始LLaVA-1.5指令微调样本,用于微调Table-LLaVA | |
MMTab-pre | MMTab-instruct_table_images_82K.zip |
82K | 这部分表格图像也用于预训练,即作为MMTab-pre_table_images_part_1_82K.zip |
MMTab-pre_table_images_part_2_16K.zip |
16K | 从ToTTo数据集额外收集的16K表格图像,用于预训练 | |
MMTab-pre_pretrain_data_llava_format_150K.json |
150K | 150K表格识别样本,用于预训练,采用LLaVA对话格式 | |
enhanced_llava_pretrain_data_708K.json |
708K | 150K MMTab-pre样本 + 558K原始LLaVA-1.5预训练样本,用于预训练Table-LLaVA |
数据集结构
指令微调和预训练样本遵循LLaVA的对话数据格式,如下所示:
Python {id: ToTTo_train_item_534, # 样本ID image: table_instructV/images/ToTTo_train_table_21297.jpg, # 对应的表格图像文件路径 conversations: [{from: human, # 发言来自人类还是模型 value: "Provide a single-sentence description for the highlighted table cells in a Wikipedia table labeled Chesney Hawkes along with its metadata. <image>"}, # 对话内容 {from: gpt, value: Chesney Hawkes released a single called "Another Fine Mess" in 2005 that reached number 48.}] }
数据集创建
为了支持多模态表格理解的多模态大型语言模型(MLLMs)的开发和评估,我们基于14个公开可用的8个领域的表格数据集构建了MMTab。我们精心设计脚本,将这些数据集中的原始文本表格转换为突出广泛表格结构和样式的表格图像,并将所有特定任务样本转换为具有统一格式的多模态指令微调样本<表格图像,输入请求,输出响应>。
预期用途
主要预期用途: MMTab 主要用于大型多模态模型和聊天机器人的研究。
主要预期用户: MMTab 的主要预期用户是计算机视觉、自然语言处理、机器学习和人工智能领域的研究人员和爱好者。
限制
首先,该数据集主要关注单个英语表格。多表格场景和更广泛的语言覆盖尚未考虑。其次,MMTab 基于精心选择的表格数据集中的真实世界表格,并包含由自动化脚本渲染的多样化高质量表格图像。然而,现实世界中的表格图像可能是低质量的,例如模糊、手写或不完整的表格图像。为了进一步缩小学术研究和实际应用场景之间的差距,未来可以收集更多样化的现实世界表格图像,并构建相应的指令跟随数据。

Canadian Census
**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).
Databricks 收录
poi
本项目收集国内POI兴趣点,当前版本数据来自于openstreetmap。
github 收录
THCHS-30
“THCHS30是由清华大学语音与语言技术中心(CSLT)发布的开放式汉语语音数据库。原始录音是2002年在清华大学国家重点实验室的朱晓燕教授的指导下,由王东完成的。清华大学计算机科学系智能与系统,原名“TCMSD”,意思是“清华连续普通话语音数据库”,时隔13年出版,由王东博士发起,并得到了教授的支持。朱小燕。我们希望为语音识别领域的新研究人员提供一个玩具数据库。因此,该数据库对学术用户完全免费。整个软件包包含建立中文语音识别所需的全套语音和语言资源系统。”
OpenDataLab 收录
YOLO Drone Detection Dataset
为了促进无人机检测模型的开发和评估,我们引入了一个新颖且全面的数据集,专门为训练和测试无人机检测算法而设计。该数据集来源于Kaggle上的公开数据集,包含在各种环境和摄像机视角下捕获的多样化的带注释图像。数据集包括无人机实例以及其他常见对象,以实现强大的检测和分类。
github 收录
中国劳动力动态调查
“中国劳动力动态调查” (China Labor-force Dynamics Survey,简称 CLDS)是“985”三期“中山大学社会科学特色数据库建设”专项内容,CLDS的目的是通过对中国城乡以村/居为追踪范围的家庭、劳动力个体开展每两年一次的动态追踪调查,系统地监测村/居社区的社会结构和家庭、劳动力个体的变化与相互影响,建立劳动力、家庭和社区三个层次上的追踪数据库,从而为进行实证导向的高质量的理论研究和政策研究提供基础数据。
中国学术调查数据资料库 收录