wit

Name: wit
Creator: maas
Published: 2026-01-06 16:29:30
License: 暂无描述

魔搭社区2026-01-06 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/wit

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for WIT ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Preprocessing](#dataset-preprocessing) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [WIT homepage](https://github.com/google-research-datasets/wit) - **Repository:** [WIT repository](https://github.com/google-research-datasets/wit) - **Paper:** [WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning ](https://arxiv.org/abs/2103.01913) - **Leaderboard:** [WIT leaderboard](https://www.kaggle.com/c/wikipedia-image-caption) - **Point of Contact:** [WIT e-mail](mailto:wit-dataset@google.com) ### Dataset Summary Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. A few unique advantages of WIT: * The largest multimodal dataset (time of this writing) by the number of image-text examples. * A massively multilingual (first of its kind) with coverage for over 100+ languages. * A collection of diverse set of concepts and real world entities. * Brings forth challenging real-world test sets. ### Dataset Preprocessing This dataset doesn't download the images locally by default. Instead, it exposes URLs to the images. To fetch the images, use the following code: ```python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": get_datasets_user_agent()}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("wit") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ``` ### Supported Tasks and Leaderboards - `image-captioning`: This dataset can be used to train a model for image captioning where the goal is to predict a caption given the image. - `text-retrieval`: The goal in this task is to build a model that retrieves the text closest to an image. In these tasks, any combination of the `caption_reference_description`, `caption_attribution_description` and `caption_alt_text_description` fields can be used as the input text/caption. ### Languages The dataset contains examples from all Wikipedia languages, with the following stats: Image-Text | # Lang | Uniq. Images | # Lang ------------ | ------ | ------------- | ------ total > 1M | 9 | images > 1M | 6 total > 500K | 10 | images > 500K | 12 total > 100K | 36 | images > 100K | 35 total > 50K | 15 | images > 50K | 17 total > 14K | 38 | images > 13K | 38 ## Dataset Structure ### Data Instances ``` { 'language': 'en', 'page_url': 'https://en.wikipedia.org/wiki/Oxydactylus', 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/5/5f/Oxydactylus_longipes_fm.jpg', 'page_title': 'Oxydactylus', 'section_title': None, 'hierarchical_section_title': 'Oxydactylus', 'caption_reference_description': None, 'caption_attribution_description': 'English: Mounted skeleton of Oxydactylus longipes in the Field Museum of Natural History.', 'caption_alt_text_description': None, 'mime_type': 'image/jpeg', 'original_height': 3564, 'original_width': 2748, 'is_main_image': True, 'attribution_passes_lang_id': True, 'page_changed_recently': True, 'context_page_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene, existing for approximately 14 million years. The name is from the Ancient Greek οξύς and δάκτυλος.\nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.', 'context_section_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene (28.4–13.7 mya), existing for approximately 14 million years. The name is from the Ancient Greek οξύς (oxys, "sharp")and δάκτυλος (daktylos, "finger").\n \nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.' } ``` ### Data Fields - `language`: Language code depicting wikipedia language of the page - `page_url`: URL to wikipedia page - `image_url`: URL to wikipedia image - `page_title`: Wikipedia page's title - `section_title`: Section's title - `hierarchical_section_title`: Hierarchical section's title - `caption_reference_description`: This is the caption that is visible on the wiki page directly below the image. - `caption_attribution_description`: This is the text found on the Wikimedia page of the image. This text is common to all occurrences of that image across all Wikipedias and thus can be in a language different to the original page article. - `caption_alt_text_description`: This is the “alt” text associated with the image. While not visible in general, it is commonly used for accessibility / screen readers - `mime_type`: Mime type associated to the image. - `original_height`: Image height - `original_width`: Image width - `is_main_image`: Flag determining if the image is the first image of the page. Usually displayed on the top-right part of the page when using web browsers. - `attribution_passes_lang_id`: Compared `language` field with the attribution language (written in the prefix of the attribution description). - `page_changed_recently`: [More Information Needed] - `context_page_description`: Page description corresponds to the short description of the page. It provides a concise explanation of the scope of the page. - `context_section_description`: Text within the image's section. <img width='75%' src='https://production-media.paperswithcode.com/datasets/Screenshot_2021-03-04_at_14.26.02.png' alt="Half Dome" /> Figure: WIT annotation example. Details on the field content can be found directly in the [paper, figure 5 and table 12.](https://arxiv.org/abs/2103.01913) ### Data Splits All data is held in `train` split, with a total of 37046386 rows. ## Dataset Creation ### Curation Rationale From the [repository](https://github.com/google-research-datasets/wit#motivation): > Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding. > > To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets. > > The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs). ### Source Data #### Initial Data Collection and Normalization From the [paper, section 3.1](https://arxiv.org/abs/2103.01913): > We started with all Wikipedia content pages (i.e., ignoring other pages that have discussions, comments and such). These number about ∼124M pages across 279 languages. #### Who are the source language producers? Text was extracted from Wikipedia. ### Annotations #### Annotation process WIT was constructed using an automatic process. However it was human-validated. From the [paper, section 3.7](https://arxiv.org/abs/2103.01913): > To further verify the quality of the WIT dataset we performed a study using (crowd-sourced) human annotators. As seen in Fig. 3, we asked raters to answer 3 questions. Given an image and the page title, raters first evaluate the quality of the attribution description and reference description in the first two questions (order randomized). The third question understands the contextual quality of these text descriptions given the page description and caption. Each response is on a 3-point scale: "Yes" if the text perfectly describes the image, "Maybe" if it is sufficiently explanatory and "No" if it is irrelevant or the image is inappropriate. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases From the [paper, section 3.4](https://arxiv.org/abs/2103.01913): > Lastly we found that certain image-text pairs occurred very frequently. These were often generic images that did not have much to do with the main article page. Common examples included flags, logos, maps, insignia and such. To prevent biasing the data, we heavily under-sampled all such images ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ```bibtex @article{srinivasan2021wit, title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc}, journal={arXiv preprint arXiv:2103.01913}, year={2021} } ``` ### Contributions Thanks to [@thomasw21](https://github.com/thomasw21), [@nateraw](https://github.com/nateraw) and [hassiahk](https://github.com/hassiahk) for adding this dataset.

# WIT 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [数据集预处理](#dataset-preprocessing) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言分布](#languages) - [数据集结构](#dataset-structure) - [数据样本示例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页**：[WIT 主页](https://github.com/google-research-datasets/wit) - **代码仓库**：[WIT 仓库](https://github.com/google-research-datasets/wit) - **相关论文**：[WIT: 基于维基百科的多模态多语言机器学习图像文本数据集](https://arxiv.org/abs/2103.01913) - **排行榜**：[WIT 排行榜](https://www.kaggle.com/c/wikipedia-image-caption) - **联系方式**：[WIT 邮箱](mailto:wit-dataset@google.com) ### 数据集概述基于维基百科的图像文本（Wikipedia-based Image Text，以下简称WIT）数据集是一款大型多模态多语言数据集。WIT包含经过精选的3760万条富含实体的图像-文本样本，涵盖108种维基百科语言的1150万张独特图像。其规模使其可作为多模态机器学习模型的预训练数据集。 WIT具备几项独特优势： * 就当前撰写本文时而言，按图像-文本样本数量计算，它是规模最大的多模态数据集。 * 实现了超100种语言的覆盖，属于同类首创的超大规模多语言数据集。 * 涵盖了多样化的概念与现实世界实体。 * 提供了具有挑战性的真实世界测试集。 ### 数据集预处理默认情况下，该数据集不会在本地下载图像，仅提供图像的URL。若要获取图像，请使用以下代码： python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": get_datasets_user_agent()}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("wit") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ### 支持任务与排行榜 - `图像字幕生成（image-captioning）`: 该数据集可用于训练图像字幕生成模型，任务目标为根据给定图像生成对应的字幕。 - `文本检索（text-retrieval）`: 该任务的目标是构建能够检索与图像最匹配的文本的模型。在这些任务中，可将`caption_reference_description`、`caption_attribution_description`以及`caption_alt_text_description`字段的任意组合作为输入文本/字幕。 ### 语言分布该数据集包含所有维基百科语言的样本，统计信息如下： | 图像-文本样本规模 | 对应语言数量 | 独特图像规模 | 对应语言数量 | | ---------------- | -------- | ------------ | -------- | | 超100万条 | 9 | 超100万张 | 6 | | 超50万条 | 10 | 超50万张 | 12 | | 超10万条 | 36 | 超10万张 | 35 | | 超5万条 | 15 | 超5万张 | 17 | | 超1.4万条 | 38 | 超1.3万张 | 38 | ## 数据集结构 ### 数据样本示例 { 'language': 'en', 'page_url': 'https://en.wikipedia.org/wiki/Oxydactylus', 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/5/5f/Oxydactylus_longipes_fm.jpg', 'page_title': 'Oxydactylus', 'section_title': None, 'hierarchical_section_title': 'Oxydactylus', 'caption_reference_description': None, 'caption_attribution_description': 'English: Mounted skeleton of Oxydactylus longipes in the Field Museum of Natural History.', 'caption_alt_text_description': None, 'mime_type': 'image/jpeg', 'original_height': 3564, 'original_width': 2748, 'is_main_image': True, 'attribution_passes_lang_id': True, 'page_changed_recently': True, 'context_page_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene, existing for approximately 14 million years. The name is from the Ancient Greek οξύς and δάκτυλος. They had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.', 'context_section_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene (28.4–13.7 mya), existing for approximately 14 million years. The name is from the Ancient Greek οξύς (oxys, "sharp")and δάκτυλος (daktylos, "finger"). They had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.' } ### 数据字段 - `language`: 表示页面所属维基百科语言的语言代码 - `page_url`: 维基百科页面的URL - `image_url`: 维基百科图像的URL - `page_title`: 维基百科页面的标题 - `section_title`: 章节标题 - `hierarchical_section_title`: 层级式章节标题 - `caption_reference_description`: 维基百科页面中图像正下方可见的字幕文本 - `caption_attribution_description`: 图像所属维基媒体页面上的文本。该文本适用于该图像在所有维基百科站点的所有使用场景，因此其语言可能与原页面文章的语言不同 - `caption_alt_text_description`: 图像关联的“替代文本”。通常情况下不可见，主要用于无障碍访问/屏幕阅读器 - `mime_type`: 图像的MIME类型 - `original_height`: 图像原始高度 - `original_width`: 图像原始宽度 - `is_main_image`: 标记该图像是否为页面的首张图像。使用浏览器访问时，通常会显示在页面的右上角区域 - `attribution_passes_lang_id`: 将`language`字段与归属描述的前缀中注明的语言进行比对的结果 - `page_changed_recently`: [更多信息待补充] - `context_page_description`: 对应页面的简短描述，用于简洁说明页面的主题范围 - `context_section_description`: 图像所属章节内的文本 <img width='75%' src='https://production-media.paperswithcode.com/datasets/Screenshot_2021-03-04_at_14.26.02.png' alt='Half Dome' /> 图：WIT 标注示例 关于字段内容的详细说明可直接参阅[论文的图5与表12](https://arxiv.org/abs/2103.01913)。 ### 数据划分所有数据均位于`train`划分中，总计37046386条样本。 ## 数据集构建 ### 构建初衷引自[代码仓库](https://github.com/google-research-datasets/wit#motivation)： > 多模态视觉语言模型依赖于丰富的数据集来学习图像与文本之间的关联。如近期研究所示，大型图像-文本数据集可显著提升模型性能。此外，现有数据集大多仅支持英语，语言覆盖范围不足，这也阻碍了多模态多语言领域的研究——考虑到图像作为语言无关媒介在提升多语言文本理解方面的潜力，这种现状无疑是一种损失。 > > 为应对这些挑战并推进多语言多模态学习的研究，我们创建了基于维基百科的图像文本（WIT）数据集。WIT通过从维基百科文章与维基媒体图像链接中提取与图像关联的多种不同文本（如上述示例所示）构建而成，并经过严格筛选以仅保留高质量的图像-文本样本。 > > 最终得到的数据集包含超过3760万条图像-文本样本，使其成为截至本文撰写时公开可用的规模最大的多模态数据集，同时具备前所未有的多语言覆盖范围——108种语言每种均拥有1.2万以上的样本（其中53种语言拥有10万以上的图像-文本对）。 ### 源数据 #### 初始数据收集与标准化引自[论文3.1节](https://arxiv.org/abs/2103.01913)： > 我们从所有维基百科内容页面出发（即忽略讨论页、评论页等其他类型页面），这些页面在279种语言中总计约1.24亿个。 #### 源文本的生产者是谁？文本均从维基百科提取。 ### 标注 #### 标注流程 WIT通过自动化流程构建，但经过了人工验证。引自[论文3.7节](https://arxiv.org/abs/2103.01913)： > 为进一步验证WIT数据集的质量，我们开展了由（众包）人工标注者参与的研究。如图3所示，我们要求评分者回答3个问题。给定一张图像与页面标题，评分者需先依次评估归属描述与参考描述的质量（顺序随机）。第三个问题则结合页面描述与字幕，评估这些文本描述的上下文质量。每个回答采用3分制：“是”表示文本完美描述了图像，“可能”表示文本具备足够的解释性，“否”表示文本与图像无关或图像不符合要求。 #### 标注者是谁？ [更多信息待补充] ### 个人与敏感信息 [更多信息待补充] ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充] ### 偏差讨论引自[论文3.4节](https://arxiv.org/abs/2103.01913)： > 最后我们发现，部分图像-文本样本出现频率极高。这些通常是与主文章页面关联度较低的通用图像，常见示例包括旗帜、标识、地图、徽章等。为避免数据引入偏差，我们对这类图像进行了大幅降采样。 ### 其他已知局限性 [更多信息待补充] ## 附加信息 ### 数据集维护者 [更多信息待补充] ### 许可信息 [更多信息待补充] ### 引用信息 bibtex @article{srinivasan2021wit, title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc}, journal={arXiv preprint arXiv:2103.01913}, year={2021} } ### 贡献感谢[@thomasw21](https://github.com/thomasw21)、[@nateraw](https://github.com/nateraw)与[hassiahk](https://github.com/hassiahk)为本数据集添加支持。

提供机构：

maas

创建时间：

2025-04-21

搜集汇总

数据集介绍