five

google/wit|多模态学习数据集|自然语言处理数据集

收藏
hugging_face2022-07-04 更新2024-03-04 收录
多模态学习
自然语言处理
下载链接:
https://hf-mirror.com/datasets/google/wit
下载链接
链接失效反馈
资源简介:
--- annotations_creators: - machine-generated language_creators: - found language: - af - ar - ast - azb - be - bg - bn - br - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gl - hr - hu - hy - id - it - iw - ja - ka - ko - la - lt - lv - mk - ml - ms - nl - nn - 'no' - pl - pt - ro - ru - sk - sl - sr - sv - th - tr - uk - ur - vi - vo - zh license: - cc-by-sa-3.0 multilinguality: - multilingual paperswithcode_id: wit pretty_name: Wikipedia-based Image Text size_categories: - 10M<n<100M source_datasets: - original - extended|wikipedia task_categories: - text-retrieval - image-to-text task_ids: - text-retrieval-other-text-image-retrieval - image-captioning --- # Dataset Card for WIT ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Preprocessing](#dataset-preprocessing) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [WIT homepage](https://github.com/google-research-datasets/wit) - **Repository:** [WIT repository](https://github.com/google-research-datasets/wit) - **Paper:** [WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning ](https://arxiv.org/abs/2103.01913) - **Leaderboard:** [WIT leaderboard](https://www.kaggle.com/c/wikipedia-image-caption) - **Point of Contact:** [WIT e-mail](mailto:wit-dataset@google.com) ### Dataset Summary Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. A few unique advantages of WIT: * The largest multimodal dataset (time of this writing) by the number of image-text examples. * A massively multilingual (first of its kind) with coverage for over 100+ languages. * A collection of diverse set of concepts and real world entities. * Brings forth challenging real-world test sets. ### Dataset Preprocessing This dataset doesn't download the images locally by default. Instead, it exposes URLs to the images. To fetch the images, use the following code: ```python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": get_datasets_user_agent()}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("wit") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ``` ### Supported Tasks and Leaderboards - `image-captioning`: This dataset can be used to train a model for image captioning where the goal is to predict a caption given the image. - `text-retrieval`: The goal in this task is to build a model that retrieves the text closest to an image. In these tasks, any combination of the `caption_reference_description`, `caption_attribution_description` and `caption_alt_text_description` fields can be used as the input text/caption. ### Languages The dataset contains examples from all Wikipedia languages, with the following stats: Image-Text | # Lang | Uniq. Images | # Lang ------------ | ------ | ------------- | ------ total > 1M | 9 | images > 1M | 6 total > 500K | 10 | images > 500K | 12 total > 100K | 36 | images > 100K | 35 total > 50K | 15 | images > 50K | 17 total > 14K | 38 | images > 13K | 38 ## Dataset Structure ### Data Instances ``` { 'language': 'en', 'page_url': 'https://en.wikipedia.org/wiki/Oxydactylus', 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/5/5f/Oxydactylus_longipes_fm.jpg', 'page_title': 'Oxydactylus', 'section_title': None, 'hierarchical_section_title': 'Oxydactylus', 'caption_reference_description': None, 'caption_attribution_description': 'English: Mounted skeleton of Oxydactylus longipes in the Field Museum of Natural History.', 'caption_alt_text_description': None, 'mime_type': 'image/jpeg', 'original_height': 3564, 'original_width': 2748, 'is_main_image': True, 'attribution_passes_lang_id': True, 'page_changed_recently': True, 'context_page_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene, existing for approximately 14 million years. The name is from the Ancient Greek οξύς and δάκτυλος.\nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.', 'context_section_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene (28.4–13.7 mya), existing for approximately 14 million years. The name is from the Ancient Greek οξύς (oxys, "sharp")and δάκτυλος (daktylos, "finger").\n \nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.' } ``` ### Data Fields - `language`: Language code depicting wikipedia language of the page - `page_url`: URL to wikipedia page - `image_url`: URL to wikipedia image - `page_title`: Wikipedia page's title - `section_title`: Section's title - `hierarchical_section_title`: Hierarchical section's title - `caption_reference_description`: This is the caption that is visible on the wiki page directly below the image. - `caption_attribution_description`: This is the text found on the Wikimedia page of the image. This text is common to all occurrences of that image across all Wikipedias and thus can be in a language different to the original page article. - `caption_alt_text_description`: This is the “alt” text associated with the image. While not visible in general, it is commonly used for accessibility / screen readers - `mime_type`: Mime type associated to the image. - `original_height`: Image height - `original_width`: Image width - `is_main_image`: Flag determining if the image is the first image of the page. Usually displayed on the top-right part of the page when using web browsers. - `attribution_passes_lang_id`: Compared `language` field with the attribution language (written in the prefix of the attribution description). - `page_changed_recently`: [More Information Needed] - `context_page_description`: Page description corresponds to the short description of the page. It provides a concise explanation of the scope of the page. - `context_section_description`: Text within the image's section. <p align='center'> <img width='75%' src='https://production-media.paperswithcode.com/datasets/Screenshot_2021-03-04_at_14.26.02.png' alt="Half Dome" /> </br> <b>Figure: WIT annotation example. </b> </p> Details on the field content can be found directly in the [paper, figure 5 and table 12.](https://arxiv.org/abs/2103.01913) ### Data Splits All data is held in `train` split, with a total of 37046386 rows. ## Dataset Creation ### Curation Rationale From the [repository](https://github.com/google-research-datasets/wit#motivation): > Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding. > > To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets. > > The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs). ### Source Data #### Initial Data Collection and Normalization From the [paper, section 3.1](https://arxiv.org/abs/2103.01913): > We started with all Wikipedia content pages (i.e., ignoring other pages that have discussions, comments and such). These number about ∼124M pages across 279 languages. #### Who are the source language producers? Text was extracted from Wikipedia. ### Annotations #### Annotation process WIT was constructed using an automatic process. However it was human-validated. From the [paper, section 3.7](https://arxiv.org/abs/2103.01913): > To further verify the quality of the WIT dataset we performed a study using (crowd-sourced) human annotators. As seen in Fig. 3, we asked raters to answer 3 questions. Given an image and the page title, raters first evaluate the quality of the attribution description and reference description in the first two questions (order randomized). The third question understands the contextual quality of these text descriptions given the page description and caption. Each response is on a 3-point scale: "Yes" if the text perfectly describes the image, "Maybe" if it is sufficiently explanatory and "No" if it is irrelevant or the image is inappropriate. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases From the [paper, section 3.4](https://arxiv.org/abs/2103.01913): > Lastly we found that certain image-text pairs occurred very frequently. These were often generic images that did not have much to do with the main article page. Common examples included flags, logos, maps, insignia and such. To prevent biasing the data, we heavily under-sampled all such images ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ```bibtex @article{srinivasan2021wit, title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc}, journal={arXiv preprint arXiv:2103.01913}, year={2021} } ``` ### Contributions Thanks to [@thomasw21](https://github.com/thomasw21), [@nateraw](https://github.com/nateraw) and [hassiahk](https://github.com/hassiahk) for adding this dataset.
提供机构:
google
原始信息汇总

数据集概述

数据集名称

  • 名称: Wikipedia-based Image Text (WIT)
  • 别名: WIT

数据集基本信息

  • 类型: 多模态多语言数据集
  • 规模: 包含37.6 million的图像-文本对,涉及11.5 million独特图像,覆盖108种Wikipedia语言
  • 语言: 支持多种语言,包括但不限于英语、中文、阿拉伯语等
  • 许可证: cc-by-sa-3.0

数据集特点

  • 规模: 目前最大的多模态数据集
  • 多语言性: 覆盖超过100种语言,是首个此类数据集
  • 内容多样性: 包含多种概念和现实世界实体
  • 应用挑战性: 提供具有挑战性的真实世界测试集

数据集结构

  • 数据实例: 每个实例包含语言、页面URL、图像URL等详细信息
  • 数据字段: 包括语言、页面URL、图像URL、页面标题等
  • 数据分割: 所有数据存储在train分割中,共37046386行

数据集创建

  • 数据来源: 从Wikipedia内容页面提取
  • 注释过程: 自动生成,经过人工验证

支持的任务

  • 图像标题生成: 训练模型以预测给定图像的标题
  • 文本检索: 构建模型以检索与图像最接近的文本

使用注意事项

  • 数据偏差: 已采取措施减少常见图像的过度采样,以避免数据偏差

附加信息

  • 引用信息: 引用格式请参考提供的BibTeX条目
  • 贡献者: 感谢多位GitHub用户对该数据集的贡献
AI搜集汇总
数据集介绍
构建方式
WIT数据集的构建,是基于从Wikipedia文章和Wikimedia图像链接中提取与图像关联的多种文本。首先,从Wikipedia的内容页面中获取数据,并经过严格的过滤,保留高质量的图像文本对。然后,通过自动化的过程构建数据集,并由人工验证其质量。为了确保文本描述与图像内容的高度相关性,研究人员设计了一个由人工标注者参与的评价体系,对文本描述进行质量评估。
特点
WIT数据集具有几个显著特点。首先,它是目前最大的多模态数据集,包含超过3.76亿个图像文本示例,覆盖了108种Wikipedia语言。其次,它是首个大规模的多语言数据集,每种语言都有超过12,000个示例,53种语言有超过10万个图像文本对。此外,WIT数据集还包含了一个多样化的概念和现实世界实体的集合,并提供了具有挑战性的现实世界测试集。
使用方法
使用WIT数据集的方法相对简单。数据集默认不下载图片,而是提供了图片的URL。要获取图片,可以使用Python的`urllib`库和`PIL`库来下载和打开图片。此外,可以使用`datasets`库来加载和操作数据集,并利用`map`函数对数据进行预处理,如图片下载等。在使用过程中,可以根据不同的任务需求,选择合适的数据字段和分割方式。
背景与挑战
背景概述
WIT数据集,全称为Wikipedia-based Image Text,是一个庞大的多模态多语言数据集。该数据集由谷歌研究团队创建,于2021年发布。它包含来自108种维基百科语言的3760万个实体丰富的图像-文本示例,以及1150万个独特的图像。WIT数据集的创建旨在解决现有数据集中语言覆盖不足的问题,并推动多语言多模态学习的研究。该数据集为机器学习模型提供了丰富的视觉和文本信息,使其能够更好地理解图像和文本之间的关系,并应用于图像字幕和文本检索等任务。
当前挑战
WIT数据集在构建过程中面临着一些挑战。首先,由于数据集包含多种语言,因此需要解决语言多样性的问题。其次,数据集的规模庞大,需要高效的数据处理和存储方案。此外,数据集中可能存在一些与特定文化或地区相关的偏见,需要通过数据清洗和平衡来减少这些偏见的影响。最后,数据集的更新和维护也需要持续进行,以保持其质量和相关性。
常用场景
经典使用场景
在多模态机器学习领域,WIT数据集被广泛用于模型预训练。其庞大的图像-文本实例集和多语言覆盖使其成为训练和评估多模态模型性能的理想选择。WIT数据集支持的任务包括图像字幕生成和文本检索,其中图像字幕生成任务尤为经典。研究者们利用WIT数据集中的丰富文本和图像实例,训练模型以自动生成与图像内容相符的描述性文本,这对于智能图像识别、自动报告生成等领域具有重要意义。
衍生相关工作
基于WIT数据集的研究衍生了许多经典工作。例如,一些研究者利用WIT数据集探索了多模态预训练模型的跨语言迁移学习能力,展示了WIT数据集在多语言多模态学习中的重要作用。此外,一些研究者利用WIT数据集研究了图像-文本对应关系的学习机制,为构建更精确的图像字幕生成模型提供了理论依据。
数据集最近研究
最新研究方向
WIT数据集作为目前最大的多模态多语言数据集,为跨语言和跨模态的机器学习研究提供了丰富的资源。该数据集在图像描述、文本检索等领域具有广泛的应用前景。在图像描述任务中,研究者可以利用WIT数据集训练模型,使其能够根据图像生成准确的文本描述。在文本检索任务中,WIT数据集可以帮助模型更好地理解图像内容,从而实现更准确的文本检索。此外,WIT数据集的多语言特性使得它在跨语言信息检索、机器翻译等领域也具有重要的研究价值。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4099个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

LIDC-IDRI

LIDC-IDRI 数据集包含来自四位经验丰富的胸部放射科医师的病变注释。 LIDC-IDRI 包含来自 1010 名肺部患者的 1018 份低剂量肺部 CT。

OpenDataLab 收录

中国交通事故深度调查(CIDAS)数据集

交通事故深度调查数据通过采用科学系统方法现场调查中国道路上实际发生交通事故相关的道路环境、道路交通行为、车辆损坏、人员损伤信息,以探究碰撞事故中车损和人伤机理。目前已积累深度调查事故10000余例,单个案例信息包含人、车 、路和环境多维信息组成的3000多个字段。该数据集可作为深入分析中国道路交通事故工况特征,探索事故预防和损伤防护措施的关键数据源,为制定汽车安全法规和标准、完善汽车测评试验规程、

北方大数据交易中心 收录

美团数据采集

查询店铺商品管理、门店管理、美团收单、门店资质、订单管理、顾客评价、财务管理等数据等数据

湖北省公共数据授权运营平台 收录

ELSA

ELSA(English Longitudinal Study of Ageing)是一个纵向研究项目,旨在调查英国50岁及以上人群的健康、经济状况和社会关系。数据集包括参与者的健康状况、生活方式、经济状况、社会网络等多方面的信息。

www.elsa-project.ac.uk 收录

Materials Project 在线材料数据库

Materials Project 是一个由伯克利加州大学和劳伦斯伯克利国家实验室于 2011 年共同发起的大型开放式在线材料数据库。这个项目的目标是利用高通量第一性原理计算,为超过百万种无机材料提供全面的性能数据、结构信息和计算模拟结果,以此加速新材料的发现和创新过程。数据库中的数据不仅包括晶体结构和能量特性,还涵盖了电子结构和热力学性质等详尽信息,为研究人员提供了丰富的材料数据资源。相关论文成果为「Commentary: The Materials Project: A materials genome approach to accelerating materials innovation」。

超神经 收录