wikimedia/wit_base

Name: wikimedia/wit_base
Creator: wikimedia
Published: 2022-11-04 15:09:33
License: 暂无描述

Hugging Face2022-11-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/wikimedia/wit_base

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - found language: - af - an - ar - arz - ast - az - azb - ba - bar - be - bg - bn - br - bs - ca - ce - ceb - ckb - cs - cv - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gl - hi - hr - hsb - ht - hu - hy - ia - id - io - is - it - iw - ja - jv - ka - kk - kn - ko - la - lah - lb - lmo - lt - lv - mg - mk - ml - mn - mr - ms - my - nan - nds - ne - nl - nn - 'no' - nv - oc - pa - pl - pt - qu - ro - ru - sco - si - sk - sl - sq - sr - sv - sw - ta - te - tg - th - tr - tt - uk - ur - uz - vec - vi - vo - war - xmf - yue - zh license: - cc-by-sa-4.0 multilinguality: - multilingual size_categories: - 1M<n<10M source_datasets: - original - extended|wikipedia task_categories: - image-to-text - text-retrieval task_ids: - image-captioning paperswithcode_id: wit pretty_name: Wikipedia-based Image Text language_bcp47: - af - an - ar - arz - ast - az - azb - ba - bar - be - be-tarask - bg - bn - br - bs - ca - ce - ceb - ckb - cs - cv - cy - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - fy - ga - gl - hi - hr - hsb - ht - hu - hy - ia - id - io - is - it - iw - ja - jv - ka - kk - kn - ko - la - lah - lb - lmo - lt - lv - mg - mk - ml - mn - mr - ms - my - nan - nds - ne - nl - nn - 'no' - nv - oc - pa - pl - pt - qu - ro - ru - sco - si - sk - sl - sq - sr - sr-Latn - sv - sw - ta - te - tg - th - tr - tt - uk - ur - uz - vec - vi - vo - war - xmf - yue - zh - zh-TW tags: - text-image-retrieval --- # Dataset Card for WIT ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [WIT homepage](https://github.com/google-research-datasets/wit) - **Paper:** [WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning ](https://arxiv.org/abs/2103.01913) - **Leaderboard:** [WIT leaderboard](https://paperswithcode.com/sota/text-image-retrieval-on-wit) and [WIT Kaggle competition](https://www.kaggle.com/competitions/wikipedia-image-caption/leaderboard) - **Point of Contact:** [Miriam Redi](mailto:miriam@wikimedia.org) ### Dataset Summary Wikimedia's version of the Wikipedia-based Image Text (WIT) Dataset, a large multimodal multilingual dataset. From the [official blog post](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/): > The core training data is taken from the Wikipedia Image-Text (WIT) Dataset, a large curated set of more than 37 million image-text associations extracted from Wikipedia articles in 108 languages that was recently released by Google Research. > > The WIT dataset offers extremely valuable data about the pieces of text associated with Wikipedia images. However, due to licensing and data volume issues, the Google dataset only provides the image name and corresponding URL for download and not the raw image files. > > Getting easy access to the image files is crucial for participants to successfully develop competitive models. Therefore, today, the Wikimedia Research team is releasing its first large image dataset. It contains more than six million image files from Wikipedia articles in 100+ languages, which correspond to almost [1] all captioned images in the WIT dataset. Image files are provided at a 300-px resolution, a size that is suitable for most of the learning frameworks used to classify and analyze images. > [1] We are publishing all images having a non-null “reference description” in the WIT dataset. For privacy reasons, we are not publishing images where a person is the primary subject, i.e., where a person’s face covers more than 10% of the image surface. To identify faces and their bounding boxes, we use the RetinaFace detector. In addition, to avoid the inclusion of inappropriate images or images that violate copyright constraints, we have removed all images that are candidate for deletion on Commons from the dataset. **Note**: Compared to [Google's version](https://huggingface.co/datasets/google/wit), which has contents of one Wikipedia page per data sample, this version groups contents of all Wikipedia pages available in different languages for the image in one single data sample to avoid duplication of image bytes. ### Supported Tasks and Leaderboards - `image-captioning`: This dataset can be used to train a model for image captioning where the goal is to predict a caption given the image. - `text-retrieval`: The goal in this task is to build a model that retrieves the text (`caption_title_and_reference_description`) closest to an image. The leaderboard for this task can be found [here](https://paperswithcode.com/sota/text-image-retrieval-on-wit). This task also has a competition on [Kaggle](https://www.kaggle.com/c/wikipedia-image-caption). In these tasks, any combination of the `caption_reference_description`, `caption_attribution_description` and `caption_alt_text_description` fields can be used as the input text/caption. ### Languages The dataset contains examples from all Wikipedia languages. ## Dataset Structure ### Data Instances Each instance is an image, its representation in bytes, a pre-computed embedding, and the set of captions attached to the image in Wikipedia. ``` { 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=300x225 at 0x7F88F3876358>, 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/8/8b/Scolopendra_gigantea.jpg', 'embedding': [1.4784087, 2.8710432, 0.0, 0.51603067, ..., 10.266883, 0.51142216, 0.0, 2.3464653], 'metadata_url': 'http://commons.wikimedia.org/wiki/File:Scolopendra_gigantea.jpg', 'original_height': 3000, 'original_width': 4000, 'mime_type': 'image/jpeg', 'caption_attribution_description': 'English: Puerto Rican Giant Centipede, Scolopendra gigantea; Vieques, Puerto Rico Slovenčina: Stonožka obrovská, Scolopendra gigantea; Vieques, Portoriko', 'wit_features': { 'language': ['ro', 'vi', 'sk', ..., 'nl', 'th', 'lv'], 'page_url': ['https://ro.wikipedia.org/wiki/Scolopendra_gigantea', 'https://vi.wikipedia.org/wiki/Scolopendra_gigantea', 'https://sk.wikipedia.org/wiki/Scolopendra_gigantea', ..., 'https://nl.wikipedia.org/wiki/Scolopendra_gigantea', 'https://th.wikipedia.org/wiki/%E0%B8%95%E0%B8%B0%E0%B8%82%E0%B8%B2%E0%B8%9A%E0%B8%A2%E0%B8%B1%E0%B8%81%E0%B8%A9%E0%B9%8C%E0%B8%82%E0%B8%B2%E0%B9%80%E0%B8%AB%E0%B8%A5%E0%B8%B7%E0%B8%AD%E0%B8%87%E0%B9%80%E0%B8%9B%E0%B8%A3%E0%B8%B9', 'https://lv.wikipedia.org/wiki/Skolopendru_dzimta'], 'attribution_passes_lang_id': [True, True, True, ..., True, True, True], 'caption_alt_text_description': [None, None, None, ..., 'Scolopendra gigantea', None, 'Milzu skolopendra (Scolopendra gigantea)'], 'caption_reference_description': [None, None, None, ..., None, None, 'Milzu skolopendra (Scolopendra gigantea)'], 'caption_title_and_reference_description': [None, 'Scolopendra gigantea [SEP] ', None, ..., 'Scolopendra gigantea [SEP] ', None, 'Skolopendru dzimta [SEP] Milzu skolopendra (Scolopendra gigantea)'], 'context_page_description': ['Scolopendra gigantea este un miriapod din clasa Chilopoda, fiind cel mai mare reprezentant al genului Scolopendra. Adultul poate atinge o lungime de 26 cm, uneori depășind 30 cm. Această specie habitează în regiunile de nord și de vest a Americii de Sud, pe insulele Trinidad, insulele Virgine, Jamaica Hispaniola ș.a. Localnicii denumesc scolopendra chilopodul gigant galben și chilopodul gigant amazonian.', 'Scolopendra gigantea là đại diện lớn nhất của chi Scolopendra nói riêng và cả lớp rết nói chung, thường đạt độ dài 26 cm và có thể vượt quá 30 cm. Sinh sống ở khu vực phía bắc và tây của Nam Mỹ và các đảo Trinidad, Puerto Rico, Saint Thomas, U.S. Virgin Islands, Jamaica, và Hispaniola.', 'Scolopendra gigantea, starší slovenský nazov: štípavica veľká, je živočích z rodu Scolopendra, s veľkosťou do 30 cm.', ..., 'Scolopendra gigantea is een tijgerduizendpoot uit Zuid-Amerika. De soort jaagt onder andere op grote geleedpotigen, amfibieën, reptielen en kleine zoogdieren. Het is voor zover bekend de grootste niet uitgestorven duizendpoot ter wereld.', 'ตะขาบยักษ์ขาเหลืองเปรู หรือ ตะขาบยักษ์อเมซอน เป็นตะขาบชนิดที่มีขนาดใหญ่ที่สุดในสกุล Scolopendra โดยปกติเมื่อโตเต็มที่จะยาว 26 เซนติเมตร แต่บางครั้งก็สามารถโตได้ถึง 30 เซนติเมตร ตะขาบชนิดนี้อาศัยอยู่ทางแถบเหนือและตะวันตกของทวีปอเมริกาใต้ และตามเกาะแก่งของประเทศตรินิแดดและจาไมกา เป็นสัตว์กินเนื้อ โดยกินจิ้งจก, กบ, นก, หนู และแม้แต่ค้างคาวเป็นอาหาร และขึ้นชื่อในเรื่องความดุร้าย', 'Skolpendru dzimta pieder pie simtkāju kārtas. Ap 400 dzimtas sugas sastopamas visā pasaulē, īpaši subtropu un tropu apgabalos. Mitinās augsnē, nobirušās lapās, plaisās, spraugās.'], 'context_section_description': [None, 'Scolopendra gigantea (còn được gọi là Rết chân vàng khổng lồ Peru và Rết khổng lồ Amazon) là đại diện lớn nhất của chi Scolopendra nói riêng và cả lớp rết nói chung, thường đạt độ dài 26\xa0cm (10\xa0in) và có thể vượt quá 30\xa0cm (12\xa0in). Sinh sống ở khu vực phía bắc và tây của Nam Mỹ và các đảo Trinidad, Puerto Rico, Saint Thomas, U.S. Virgin Islands, Jamaica, và Hispaniola.', None, ..., 'Scolopendra gigantea is een tijgerduizendpoot uit Zuid-Amerika. De soort jaagt onder andere op grote geleedpotigen, amfibieën, reptielen en kleine zoogdieren. Het is voor zover bekend de grootste niet uitgestorven duizendpoot ter wereld.', None, 'Skolpendru dzimta (Scolopendridae) pieder pie simtkāju kārtas. Ap 400 dzimtas sugas sastopamas visā pasaulē, īpaši subtropu un tropu apgabalos. Mitinās augsnē, nobirušās lapās, plaisās, spraugās.'], 'hierarchical_section_title': ['Scolopendra gigantea', 'Scolopendra gigantea', 'Scolopendra gigantea', ..., 'Scolopendra gigantea', 'ตะขาบยักษ์ขาเหลืองเปรู', 'Skolopendru dzimta'], 'is_main_image': [True, True, True, ..., True, True, True], 'page_title': ['Scolopendra gigantea', 'Scolopendra gigantea', 'Scolopendra gigantea', ..., 'Scolopendra gigantea', 'ตะขาบยักษ์ขาเหลืองเปรู', 'Skolopendru dzimta'], 'section_title': [None, None, None, ..., None, None, None] } } ``` **Note**: The dataset is stored in Parquet for better performance. This dataset was generated from the original files using [this script](wit_base/blob/main/scripts/wit.py). Additionally, 120 examples from the original files have incorrectly formatted one or more of the following fields: `original_height`, `original_width`, `mime_type` and `caption_attribution_description`. The fixed versions of these examples that were used in the generation script can be found [here](wit_base/blob/main/scripts/corrected_examples.py). ### Data Fields - `image`: A `PIL.Image.Image` object containing the image resized to a width of 300-px while preserving its aspect ratio. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]`. - `image_url`: URL to wikipedia image - `embedding`: Precomputed image embedding. Each image is described with a 2048-dimensional signature extracted from the second-to-last layer of a [ResNet-50](https://arxiv.org/abs/1512.03385) neural network trained with [Imagenet](https://www.image-net.org/) data. These embeddings contain rich information about the image content and layout, in a compact form - `metadata_url`: URL to wikimedia page containing the image and the metadata - `original_height`: Original image height before resizing - `original_width`: Original image width before resizing - `mime_type`: Mime type associated to the image - `caption_attribution_description`: This is the text found on the Wikimedia page of the image. This text is common to all occurrences of that image across all Wikipedias. - `wit_features`: Sequence of captions for the image with language, page URL, information about the page, caption text, etc. - `language`: Language code depicting wikipedia language of the page - `page_url`: URL to wikipedia page - `attribution_passes_lang_id`: Compared `language` field with the attribution language (written in the prefix of the attribution description. - `caption_alt_text_description`: This is the “alt” text associated with the image. While not visible in general, it is commonly used for accessibility / screen readers - `caption_reference_description`: This is the caption that is visible on the wikipedia page directly below the image. - `caption_title_and_reference_description`: Concatenation of `page_title` and `caption_reference_description`. - `context_page_description`: Corresponds to the short description of the page. It provides a concise explanation of the scope of the page. - `context_section_description`: Text within the image's section - `hierarchical_section_title`: Hierarchical section's title - `is_main_image`: Flag determining if the image is the first image of the page. Usually displayed on the top-right part of the page when using web browsers. - `page_changed_recently`: [More Information Needed] - `page_title`: Wikipedia page's title - `section_title`: Section's title <img width='75%' src='https://production-media.paperswithcode.com/datasets/Screenshot_2021-03-04_at_14.26.02.png' alt="Half Dome" /> Figure: WIT annotation example. Details on the field content can be found directly in the [paper, figure 5 and table 12.](https://arxiv.org/abs/2103.01913) ### Data Splits All data is held in `train` split, with a total of 6477255 examples. ## Dataset Creation ### Curation Rationale From the [official blog post](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/): > The WIT dataset offers extremely valuable data about the pieces of text associated with Wikipedia images. > Getting easy access to the image files is crucial for participants to successfully develop competitive models. > With this large release of visual data, we aim to help the competition participants—as well as researchers and practitioners who are interested in working with Wikipedia images—find and download the large number of image files associated with the challenge, in a compact form. ### Source Data #### Initial Data Collection and Normalization From the [paper, section 3.1](https://arxiv.org/abs/2103.01913): > We started with all Wikipedia content pages (i.e., ignoring other pages that have discussions, comments and such). These number about ~124M pages across 279 languages. #### Who are the source language producers? Text was extracted from Wikipedia. ### Annotations #### Annotation process WIT was constructed using an automatic process. However it was human-validated. From the [paper, section 3.7](https://arxiv.org/abs/2103.01913): > To further verify the quality of the WIT dataset we performed a study using (crowd-sourced) human annotators. As seen in Fig. 3, we asked raters to answer 3 questions. Given an image and the page title, raters first evaluate the quality of the attribution description and reference description in the first two questions (order randomized). The third question understands the contextual quality of these text descriptions given the page description and caption. Each response is on a 3-point scale: "Yes" if the text perfectly describes the image, "Maybe" if it is sufficiently explanatory and "No" if it is irrelevant or the image is inappropriate. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information From the [official blog post](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/#FN1): > For privacy reasons, we are not publishing images where a person is the primary subject, i.e., where a person’s face covers more than 10% of the image surface. To identify faces and their bounding boxes, we use the [RetinaFace](https://arxiv.org/abs/1905.00641) detector. In addition, to avoid the inclusion of inappropriate images or images that violate copyright constraints, we have removed all images that are [candidate for deletion](https://commons.wikimedia.org/wiki/Commons:Deletion_requests) on Commons from the dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases From the [paper, section 3.4](https://arxiv.org/abs/2103.01913): > Lastly we found that certain image-text pairs occurred very frequently. These were often generic images that did not have much to do with the main article page. Common examples included flags, logos, maps, insignia and such. To prevent biasing the data, we heavily under-sampled all such images ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Miriam Redi, Fabian Kaelin and Tiziano Piccardi. ### Licensing Information [CC BY-SA 4.0 international license](https://creativecommons.org/licenses/by-sa/4.0/) ### Citation Information ```bibtex @article{srinivasan2021wit, title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc}, journal={arXiv preprint arXiv:2103.01913}, year={2021} } ``` ### Contributions Thanks to [@nateraw](https://github.com/nateraw), [yjernite](https://github.com/yjernite) and [mariosasko](https://github.com/mariosasko) for adding this dataset.

提供机构：

wikimedia

原始信息汇总

数据集卡片 for WIT

数据集描述

数据集概述

WIT 数据集是一个大型的多模态多语言数据集，由 Wikimedia 版本的 Wikipedia-based Image Text (WIT) 数据集组成。该数据集包含超过 3700 万个从 108 种语言的 Wikipedia 文章中提取的图像-文本关联。

支持的任务和排行榜

image-captioning: 该数据集可用于训练图像字幕生成模型，目标是在给定图像的情况下预测字幕。
text-retrieval: 该任务的目标是构建一个模型，该模型能够检索与图像最接近的文本 (caption_title_and_reference_description)。该任务的排行榜可以在这里找到。

语言

数据集包含来自所有 Wikipedia 语言的示例。

数据集结构

数据实例

每个实例包含一个图像、其字节表示、预计算的嵌入以及与该图像相关的所有 Wikipedia 中的字幕。

数据字段

image: 一个 PIL.Image.Image 对象，包含调整到 300 像素宽度的图像。
image_url: Wikipedia 图像的 URL。
embedding: 预计算的图像嵌入，每个图像由一个 2048 维的签名描述。
metadata_url: 包含图像和元数据的 Wikimedia 页面的 URL。
original_height: 调整大小前的原始图像高度。
original_width: 调整大小前的原始图像宽度。
mime_type: 与图像关联的 MIME 类型。
caption_attribution_description: 在 Wikimedia 页面上找到的文本。
wit_features: 图像的字幕序列，包含语言、页面 URL、页面信息、字幕文本等。

数据分割

所有数据都保存在 train 分割中，共有 6477255 个示例。

数据集创建

策划理由

WIT 数据集提供了关于与 Wikipedia 图像相关联的文本片段的极其有价值的数据。获取图像文件的便捷访问对于参与者成功开发竞争模型至关重要。

源数据

初始数据收集和规范化

数据集从所有 Wikipedia 内容页面（忽略讨论、评论等其他页面）开始，这些页面在 279 种语言中约有 1.24 亿个页面。

注释

注释过程

WIT 是通过自动过程构建的，但经过了人工验证。

个人和敏感信息

出于隐私原因，不发布以人物为主要主题的图像，即人物面部覆盖图像表面超过 10% 的图像。此外，已从数据集中删除所有在 Commons 上候选删除的图像。

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

某些图像-文本对出现频率非常高，这些通常是与主要文章页面无关的通用图像。为了防止数据偏差，对所有此类图像进行了大量下采样。

其他已知限制

[更多信息需要]

附加信息

数据集策展人

Miriam Redi, Fabian Kaelin 和 Tiziano Piccardi。

许可信息

CC BY-SA 4.0 国际许可

引用信息

bibtex @article{srinivasan2021wit, title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc}, journal={arXiv preprint arXiv:2103.01913}, year={2021} }

贡献

感谢 @nateraw, yjernite 和 mariosasko 添加此数据集。

搜集汇总

数据集介绍

构建方式

该数据集的构建基于维基百科的内容，通过自动化的方式从维基百科的页面中提取了超过3700万条图像与文本的关联数据。这些数据涵盖了108种语言，并且经过人工验证以确保质量。为了保护隐私，数据集中排除了人脸占据图像表面超过10%的图片，并使用了RetinaFace检测器来识别和排除这些图片。此外，所有可能因版权问题被删除的图片也被移除，确保数据集的合规性。

特点

该数据集的主要特点在于其多模态和多语言的特性，涵盖了从维基百科中提取的图像及其对应的多种语言描述。每个数据样本包含图像、图像的预计算嵌入、以及与该图像相关的多语言标题和描述。数据集的多样性和大规模使其非常适合用于图像描述生成和文本检索等任务。

使用方法

该数据集可用于训练图像描述生成模型和文本检索模型。用户可以通过访问数据集中的图像、嵌入向量和多语言描述来进行模型训练。数据集的结构设计使得用户可以轻松地提取图像及其对应的文本信息，并利用这些信息进行多模态学习。此外，数据集的预计算嵌入向量可以加速模型的训练过程，提高效率。

背景与挑战

背景概述

WIT（Wikipedia-based Image Text）数据集是由Google Research和Wikimedia Foundation合作创建的一个大规模多模态多语言数据集。该数据集的核心训练数据来源于Wikipedia文章中的图像与文本关联，涵盖了超过3700万条图像-文本对，涉及108种语言。WIT数据集的创建旨在为多模态机器学习提供丰富的资源，尤其是在图像描述生成和文本检索任务中。通过提供图像文件及其对应的文本描述，WIT数据集为研究人员提供了一个强大的工具，以推动多语言和多模态学习的发展。

当前挑战

WIT数据集在构建过程中面临多个挑战。首先，数据集的规模庞大，涉及多种语言和图像，这使得数据清洗和标注过程变得复杂。其次，由于隐私和版权问题，数据集中排除了包含人脸的图像以及可能违反版权的图像，这需要使用先进的面部检测技术（如RetinaFace）进行筛选。此外，数据集中存在一些通用图像（如旗帜、徽标等）的高频出现，这可能导致数据偏差，因此需要进行下采样处理。最后，数据集的多语言特性要求模型能够处理不同语言之间的语义差异，这对模型的跨语言能力提出了更高的要求。

常用场景

经典使用场景

WIT数据集的经典使用场景主要集中在多模态学习和跨语言图像描述任务中。该数据集通过提供丰富的图像与多语言文本配对，支持图像描述生成（image-captioning）和文本检索（text-retrieval）任务。在图像描述生成任务中，模型通过学习图像与文本之间的关联，能够为给定的图像生成准确的描述。而在文本检索任务中，模型则需要从大量文本中检索出与图像最相关的描述，这对于多语言环境下的信息检索具有重要意义。

解决学术问题

WIT数据集解决了多模态学习中的关键问题，特别是在多语言环境下的图像与文本关联建模。通过提供大规模的多语言图像与文本配对，该数据集为研究者提供了一个强大的工具，用于探索和解决跨语言图像描述生成和检索中的挑战。此外，WIT数据集还帮助研究者理解不同语言和文化背景下的图像描述差异，从而推动了多模态学习领域的研究进展。

衍生相关工作

WIT数据集的发布催生了一系列相关的经典工作，特别是在多模态学习和跨语言信息检索领域。许多研究者基于该数据集开发了新的模型和算法，用于提升图像描述生成和文本检索的性能。例如，一些研究工作探索了如何利用WIT数据集中的多语言信息来增强模型的跨语言理解能力。此外，WIT数据集还激发了关于多模态数据集构建和评估方法的讨论，推动了该领域的标准化和规范化进程。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集