【我遇到的问题】 • 现象:该数据集的下载链接已失效 【相关信息】 • 可考虑访问这个链接获取类似文件~https://www.selectdataset.com/dataset/3688356173feccbcf1f1e490ddc6bc72
google/wit
收藏Hugging Face2022-07-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/google/wit
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language_creators:
- found
language:
- af
- ar
- ast
- azb
- be
- bg
- bn
- br
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gl
- hr
- hu
- hy
- id
- it
- iw
- ja
- ka
- ko
- la
- lt
- lv
- mk
- ml
- ms
- nl
- nn
- 'no'
- pl
- pt
- ro
- ru
- sk
- sl
- sr
- sv
- th
- tr
- uk
- ur
- vi
- vo
- zh
license:
- cc-by-sa-3.0
multilinguality:
- multilingual
paperswithcode_id: wit
pretty_name: Wikipedia-based Image Text
size_categories:
- 10M<n<100M
source_datasets:
- original
- extended|wikipedia
task_categories:
- text-retrieval
- image-to-text
task_ids:
- text-retrieval-other-text-image-retrieval
- image-captioning
---
# Dataset Card for WIT
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Dataset Preprocessing](#dataset-preprocessing)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [WIT homepage](https://github.com/google-research-datasets/wit)
- **Repository:** [WIT repository](https://github.com/google-research-datasets/wit)
- **Paper:** [WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
](https://arxiv.org/abs/2103.01913)
- **Leaderboard:** [WIT leaderboard](https://www.kaggle.com/c/wikipedia-image-caption)
- **Point of Contact:** [WIT e-mail](mailto:wit-dataset@google.com)
### Dataset Summary
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.
A few unique advantages of WIT:
* The largest multimodal dataset (time of this writing) by the number of image-text examples.
* A massively multilingual (first of its kind) with coverage for over 100+ languages.
* A collection of diverse set of concepts and real world entities.
* Brings forth challenging real-world test sets.
### Dataset Preprocessing
This dataset doesn't download the images locally by default. Instead, it exposes URLs to the images. To fetch the images, use the following code:
```python
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import io
import urllib
import PIL.Image
from datasets import load_dataset
from datasets.utils.file_utils import get_datasets_user_agent
def fetch_single_image(image_url, timeout=None, retries=0):
for _ in range(retries + 1):
try:
request = urllib.request.Request(
image_url,
data=None,
headers={"user-agent": get_datasets_user_agent()},
)
with urllib.request.urlopen(request, timeout=timeout) as req:
image = PIL.Image.open(io.BytesIO(req.read()))
break
except Exception:
image = None
return image
def fetch_images(batch, num_threads, timeout=None, retries=0):
fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries)
with ThreadPoolExecutor(max_workers=num_threads) as executor:
batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"]))
return batch
num_threads = 20
dset = load_dataset("wit")
dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads})
```
### Supported Tasks and Leaderboards
- `image-captioning`: This dataset can be used to train a model for image captioning where the goal is to predict a caption given the image.
- `text-retrieval`: The goal in this task is to build a model that retrieves the text closest to an image.
In these tasks, any combination of the `caption_reference_description`, `caption_attribution_description` and `caption_alt_text_description` fields can be used as the input text/caption.
### Languages
The dataset contains examples from all Wikipedia languages, with the following stats:
Image-Text | # Lang | Uniq. Images | # Lang
------------ | ------ | ------------- | ------
total > 1M | 9 | images > 1M | 6
total > 500K | 10 | images > 500K | 12
total > 100K | 36 | images > 100K | 35
total > 50K | 15 | images > 50K | 17
total > 14K | 38 | images > 13K | 38
## Dataset Structure
### Data Instances
```
{
'language': 'en',
'page_url': 'https://en.wikipedia.org/wiki/Oxydactylus',
'image_url': 'https://upload.wikimedia.org/wikipedia/commons/5/5f/Oxydactylus_longipes_fm.jpg',
'page_title': 'Oxydactylus',
'section_title': None,
'hierarchical_section_title': 'Oxydactylus',
'caption_reference_description': None,
'caption_attribution_description': 'English: Mounted skeleton of Oxydactylus longipes in the Field Museum of Natural History.',
'caption_alt_text_description': None,
'mime_type': 'image/jpeg',
'original_height': 3564,
'original_width': 2748,
'is_main_image': True,
'attribution_passes_lang_id': True,
'page_changed_recently': True,
'context_page_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene, existing for approximately 14 million years. The name is from the Ancient Greek οξύς and δάκτυλος.\nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.',
'context_section_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene (28.4–13.7 mya), existing for approximately 14 million years. The name is from the Ancient Greek οξύς (oxys, "sharp")and δάκτυλος (daktylos, "finger").\n \nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.'
}
```
### Data Fields
- `language`: Language code depicting wikipedia language of the page
- `page_url`: URL to wikipedia page
- `image_url`: URL to wikipedia image
- `page_title`: Wikipedia page's title
- `section_title`: Section's title
- `hierarchical_section_title`: Hierarchical section's title
- `caption_reference_description`: This is the caption that is visible on the wiki page directly below the image.
- `caption_attribution_description`: This is the text found on the Wikimedia page of the image. This text is common to all occurrences of that image across all Wikipedias and thus can be in a language different to the original page article.
- `caption_alt_text_description`: This is the “alt” text associated with the image. While not visible in general, it is commonly used for accessibility / screen readers
- `mime_type`: Mime type associated to the image.
- `original_height`: Image height
- `original_width`: Image width
- `is_main_image`: Flag determining if the image is the first image of the page. Usually displayed on the top-right part of the page when using web browsers.
- `attribution_passes_lang_id`: Compared `language` field with the attribution language (written in the prefix of the attribution description).
- `page_changed_recently`: [More Information Needed]
- `context_page_description`: Page description corresponds to the short description of the page. It provides a concise explanation of the scope of the page.
- `context_section_description`: Text within the image's section.
<p align='center'>
<img width='75%' src='https://production-media.paperswithcode.com/datasets/Screenshot_2021-03-04_at_14.26.02.png' alt="Half Dome" /> </br>
<b>Figure: WIT annotation example. </b>
</p>
Details on the field content can be found directly in the [paper, figure 5 and table 12.](https://arxiv.org/abs/2103.01913)
### Data Splits
All data is held in `train` split, with a total of 37046386 rows.
## Dataset Creation
### Curation Rationale
From the [repository](https://github.com/google-research-datasets/wit#motivation):
> Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding.
>
> To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets.
>
> The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs).
### Source Data
#### Initial Data Collection and Normalization
From the [paper, section 3.1](https://arxiv.org/abs/2103.01913):
> We started with all Wikipedia content pages (i.e., ignoring other
pages that have discussions, comments and such). These number about ∼124M pages across 279 languages.
#### Who are the source language producers?
Text was extracted from Wikipedia.
### Annotations
#### Annotation process
WIT was constructed using an automatic process. However it was human-validated.
From the [paper, section 3.7](https://arxiv.org/abs/2103.01913):
> To further verify the quality of the WIT dataset we performed a
study using (crowd-sourced) human annotators. As seen in Fig. 3,
we asked raters to answer 3 questions. Given an image and the page
title, raters first evaluate the quality of the attribution description
and reference description in the first two questions (order randomized). The third question understands the contextual quality of these
text descriptions given the page description and caption. Each response is on a 3-point scale: "Yes" if the text perfectly describes
the image, "Maybe" if it is sufficiently explanatory and "No" if it is
irrelevant or the image is inappropriate.
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
From the [paper, section 3.4](https://arxiv.org/abs/2103.01913):
> Lastly we found that certain image-text pairs occurred very
frequently. These were often generic images that did not have
much to do with the main article page. Common examples
included flags, logos, maps, insignia and such. To prevent
biasing the data, we heavily under-sampled all such images
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```bibtex
@article{srinivasan2021wit,
title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
journal={arXiv preprint arXiv:2103.01913},
year={2021}
}
```
### Contributions
Thanks to [@thomasw21](https://github.com/thomasw21), [@nateraw](https://github.com/nateraw) and [hassiahk](https://github.com/hassiahk) for adding this dataset.
---
annotations_creators:
- 机器生成(machine-generated)
language_creators:
- 公开采集(found)
language:
- af
- ar
- ast
- azb
- be
- bg
- bn
- br
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gl
- hr
- hu
- hy
- id
- it
- iw
- ja
- ka
- ko
- la
- lt
- lv
- mk
- ml
- ms
- nl
- nn
- 'no'
- pl
- pt
- ro
- ru
- sk
- sl
- sr
- sv
- th
- tr
- uk
- ur
- vi
- vo
- zh
license:
- 知识共享署名-相同方式共享3.0协议(cc-by-sa-3.0)
multilinguality:
- 多语言(multilingual)
paperswithcode_id: wit
pretty_name: 基于维基百科的图文文本(WIT, Wikipedia-based Image Text)
size_categories:
- 1000万<n<1亿
source_datasets:
- 原始数据集(original)
- 扩展维基百科数据集(extended|wikipedia)
task_categories:
- 文本检索(text-retrieval)
- 图像到文本(image-to-text)
task_ids:
- 文本检索-其他图文检索(text-retrieval-other-text-image-retrieval)
- 图像字幕(image-captioning)
---
# WIT数据集卡片(Dataset Card for WIT)
## 目录(Table of Contents)
- [目录](#table-of-contents)
- [数据集概述](#dataset-description)
- [数据集总结](#dataset-summary)
- [数据集预处理](#dataset-preprocessing)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [涉及语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [注释](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可证信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集概述(Dataset Description)
- **主页(Homepage)**: [WIT数据集主页](https://github.com/google-research-datasets/wit)
- **仓库(Repository)**: [WIT数据集仓库](https://github.com/google-research-datasets/wit)
- **论文(Paper)**: [WIT: 面向多模态多语言机器学习的基于维基百科的图文文本数据集](https://arxiv.org/abs/2103.01913)
- **排行榜(Leaderboard)**: [WIT数据集排行榜](https://www.kaggle.com/c/wikipedia-image-caption)
- **联络方式(Point of Contact)**: [WIT数据集邮箱](mailto:wit-dataset@google.com)
### 数据集总结(Dataset Summary)
基于维基百科的图文文本(WIT, Wikipedia-based Image Text)数据集是一款大型多模态多语言数据集。该数据集包含经筛选整理的3760万条富含实体的图文样本,涵盖108种维基百科语言的1150万张独特图像。其规模足以支撑多模态机器学习模型的预训练任务。
WIT具备多项独特优势:
* 截至本文件撰写时,按图文样本数量计算的规模最大的多模态数据集
* 首款覆盖超100种语言的超大规模多语言数据集
* 涵盖多样化的概念与现实世界实体
* 提供具备挑战性的真实世界测试集
### 数据集预处理(Dataset Preprocessing)
默认情况下,该数据集不会在本地下载图像,仅提供图像的URL链接。若需获取图像,可使用以下代码:
python
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import io
import urllib
import PIL.Image
from datasets import load_dataset
from datasets.utils.file_utils import get_datasets_user_agent
def fetch_single_image(image_url, timeout=None, retries=0):
for _ in range(retries + 1):
try:
request = urllib.request.Request(
image_url,
data=None,
headers={"user-agent": get_datasets_user_agent()},
)
with urllib.request.urlopen(request, timeout=timeout) as req:
image = PIL.Image.open(io.BytesIO(req.read()))
break
except Exception:
image = None
return image
def fetch_images(batch, num_threads, timeout=None, retries=0):
fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries)
with ThreadPoolExecutor(max_workers=num_threads) as executor:
batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"]))
return batch
num_threads = 20
dset = load_dataset("wit")
dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads})
### 支持任务与排行榜(Supported Tasks and Leaderboards)
- `图像字幕(image-captioning)`: 该数据集可用于训练图像字幕模型,任务目标为根据输入图像生成对应的描述文本。
- `文本检索(text-retrieval)`: 该任务的目标是构建模型,实现与输入图像最匹配的文本检索。
在上述任务中,可使用`caption_reference_description`、`caption_attribution_description`与`caption_alt_text_description`字段的任意组合作为输入文本/字幕。
### 涉及语言(Languages)
该数据集涵盖所有维基百科语言的样本,相关统计信息如下:
| 图文样本规模 | 语言数量 | 独特图像规模 | 语言数量 |
| ---- | ---- | ---- | ---- |
| 超100万 | 9 | 超100万 | 6 |
| 超50万 | 10 | 超50万 | 12 |
| 超10万 | 36 | 超10万 | 35 |
| 超5万 | 15 | 超5万 | 17 |
| 超1.4万 | 38 | 超1.3万 | 38 |
## 数据集结构(Dataset Structure)
### 数据实例(Data Instances)
{
'language': 'en',
'page_url': 'https://en.wikipedia.org/wiki/Oxydactylus',
'image_url': 'https://upload.wikimedia.org/wikipedia/commons/5/5f/Oxydactylus_longipes_fm.jpg',
'page_title': 'Oxydactylus',
'section_title': None,
'hierarchical_section_title': 'Oxydactylus',
'caption_reference_description': None,
'caption_attribution_description': 'English: Mounted skeleton of Oxydactylus longipes in the Field Museum of Natural History.',
'caption_alt_text_description': None,
'mime_type': 'image/jpeg',
'original_height': 3564,
'original_width': 2748,
'is_main_image': True,
'attribution_passes_lang_id': True,
'page_changed_recently': True,
'context_page_description': 'Oxydactylus是一种已灭绝的骆驼科动物,仅分布于北美。其生存年代为渐新世晚期至中新世中期,存续时长约1400万年。该名称源自古希腊语的οξύς(oxys,意为“尖锐的”)与δάκτυλος(daktylos,意为“手指”)。
它们拥有极长的四肢与颈部,可能适应了取食高处植被的生态位,与现代长颈鹿类似。与现代骆驼科动物不同,它们拥有蹄而非厚实的掌垫,且脚趾展开。',
'context_section_description': 'Oxydactylus是一种已灭绝的骆驼科动物,仅分布于北美。其生存年代为渐新世晚期至中新世中期(2840万至1370万年前),存续时长约1400万年。该名称源自古希腊语的οξύς(oxys,意为“尖锐的”)与δάκτυλος(daktylos,意为“手指”)。
它们拥有极长的四肢与颈部,可能适应了取食高处植被的生态位,与现代长颈鹿类似。与现代骆驼科动物不同,它们拥有蹄而非厚实的掌垫,且脚趾展开。'
}
### 数据字段(Data Fields)
- `language`: 表示页面所属维基百科语言的语言代码
- `page_url`: 维基百科页面的URL链接
- `image_url`: 维基百科图像的URL链接
- `page_title`: 维基百科页面的标题
- `section_title`: 章节标题
- `hierarchical_section_title`: 层级章节标题
- `caption_reference_description`: 图像下方直接显示在维基百科页面上的官方字幕
- `caption_attribution_description`: 图像在维基媒体平台页面上的归因文本,该文本适用于所有维基百科平台上的该图像,因此可能与原页面语言不同
- `caption_alt_text_description`: 图像对应的“替代文本(alt text)”,虽通常不会直接显示,但常用于无障碍访问或屏幕阅读器
- `mime_type`: 图像的MIME类型
- `original_height`: 图像原始高度
- `original_width`: 图像原始宽度
- `is_main_image`: 标记该图像是否为页面的首张图像,通常会显示在页面的右上角位置
- `attribution_passes_lang_id`: 对比`language`字段与归因描述中的语言前缀,验证语言匹配性
- `page_changed_recently`: [需补充更多信息]
- `context_page_description`: 页面的简短描述文本,用于概括该页面的主题范围
- `context_section_description`: 图像所在章节的文本内容
<p align='center'>
<img width='75%' src='https://production-media.paperswithcode.com/datasets/Screenshot_2021-03-04_at_14.26.02.png' alt="Half Dome" /> </br>
<b>图:WIT标注示例。 </b>
</p>
关于数据字段的详细说明可参阅[论文的图5与表12](https://arxiv.org/abs/2103.01913)。
### 数据划分(Data Splits)
所有数据均包含在`train`(训练)划分中,总计37046386条数据。
## 数据集构建(Dataset Creation)
### 构建初衷(Curation Rationale)
引自[数据集仓库](https://github.com/google-research-datasets/wit#motivation):
> 多模态视觉语言模型依赖高质量数据集来学习图像与文本之间的关联模式。现有研究表明,大规模图文数据集可显著提升模型性能。此外,现有数据集的语言覆盖范围有限(多数仅支持英语),这阻碍了多语言多模态领域的研究——考虑到图像作为跨语言媒介在提升多语言文本理解方面的潜力,现有数据集的局限无疑是一大损失。
>
> 为解决上述挑战,推动多语言多模态学习领域的研究,我们构建了基于维基百科的图文文本(WIT)数据集。WIT通过从维基百科文章与维基媒体图像链接中提取与图像关联的多种文本(如上图示例)构建而成,并经过严格筛选以保留高质量的图文样本。
>
> 最终构建的数据集包含超过3760万条图文样本,是截至本文件撰写时公开可用的规模最大的多模态数据集,同时具备前所未有的多语言覆盖能力——108种语言中每种语言均包含12000+条样本,其中53种语言的图文样本数量超过10万条。
### 源数据(Source Data)
#### 初始数据收集与标准化(Initial Data Collection and Normalization)
引自[论文第3.1节](https://arxiv.org/abs/2103.01913):
> 我们以所有维基百科内容页面为起点(即排除讨论页、评论页等其他类型页面),这些页面涵盖279种语言,总计约1.24亿页。
#### 源语言生产者(Who are the source language producers?)
文本内容均提取自维基百科。
### 注释(Annotations)
#### 注释流程(Annotation process)
WIT通过自动化流程构建,并经过人工验证。
引自[论文第3.7节](https://arxiv.org/abs/2103.01913):
> 为进一步验证WIT数据集的质量,我们采用众包人工标注员开展了质量评估研究。如图3所示,我们要求标注员回答三个问题:给定一张图像与页面标题,标注员首先评估前两个问题(顺序随机)中归因描述与参考描述的质量;第三个问题则结合页面描述与字幕,评估上述文本描述的上下文适配性。评分采用三级量表:若文本完美匹配图像内容则选「是」,若文本具备足够解释性则选「可能」,若文本与图像无关或图像不适宜则选「否」。
#### 标注者身份(Who are the annotators?)
[需补充更多信息]
### 个人与敏感信息(Personal and Sensitive Information)
[需补充更多信息]
## 数据集使用注意事项(Considerations for Using the Data)
### 数据集的社会影响(Social Impact of Dataset)
[需补充更多信息]
### 偏差讨论(Discussion of Biases)
引自[论文第3.4节](https://arxiv.org/abs/2103.01913):
> 此外我们发现部分图文对出现频率极高,这些图像多为通用素材,与对应维基百科文章的主题关联度较低,常见类型包括旗帜、标识、地图、徽章等。为避免数据集引入偏差,我们对这类图像进行了大幅欠采样处理。
### 其他已知局限性(Other Known Limitations)
[需补充更多信息]
## 附加信息(Additional Information)
### 数据集维护者(Dataset Curators)
[需补充更多信息]
### 许可证信息(Licensing Information)
[需补充更多信息]
### 引用信息(Citation Information)
bibtex
@article{srinivasan2021wit,
title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning},
author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
journal={arXiv preprint arXiv:2103.01913},
year={2021}
}
### 贡献致谢(Contributions)
感谢[@thomasw21](https://github.com/thomasw21)、[@nateraw](https://github.com/nateraw)与[hassiahk](https://github.com/hassiahk)为本数据集添加支持。
提供机构:
google
原始信息汇总
数据集概述
数据集名称
- 名称: Wikipedia-based Image Text (WIT)
- 别名: WIT
数据集基本信息
- 类型: 多模态多语言数据集
- 规模: 包含37.6 million的图像-文本对,涉及11.5 million独特图像,覆盖108种Wikipedia语言
- 语言: 支持多种语言,包括但不限于英语、中文、阿拉伯语等
- 许可证: cc-by-sa-3.0
数据集特点
- 规模: 目前最大的多模态数据集
- 多语言性: 覆盖超过100种语言,是首个此类数据集
- 内容多样性: 包含多种概念和现实世界实体
- 应用挑战性: 提供具有挑战性的真实世界测试集
数据集结构
- 数据实例: 每个实例包含语言、页面URL、图像URL等详细信息
- 数据字段: 包括语言、页面URL、图像URL、页面标题等
- 数据分割: 所有数据存储在
train分割中,共37046386行
数据集创建
- 数据来源: 从Wikipedia内容页面提取
- 注释过程: 自动生成,经过人工验证
支持的任务
- 图像标题生成: 训练模型以预测给定图像的标题
- 文本检索: 构建模型以检索与图像最接近的文本
使用注意事项
- 数据偏差: 已采取措施减少常见图像的过度采样,以避免数据偏差
附加信息
- 引用信息: 引用格式请参考提供的BibTeX条目
- 贡献者: 感谢多位GitHub用户对该数据集的贡献
搜集汇总
数据集介绍
构建方式
WIT数据集的构建,是基于从Wikipedia文章和Wikimedia图像链接中提取与图像关联的多种文本。首先,从Wikipedia的内容页面中获取数据,并经过严格的过滤,保留高质量的图像文本对。然后,通过自动化的过程构建数据集,并由人工验证其质量。为了确保文本描述与图像内容的高度相关性,研究人员设计了一个由人工标注者参与的评价体系,对文本描述进行质量评估。
特点
WIT数据集具有几个显著特点。首先,它是目前最大的多模态数据集,包含超过3.76亿个图像文本示例,覆盖了108种Wikipedia语言。其次,它是首个大规模的多语言数据集,每种语言都有超过12,000个示例,53种语言有超过10万个图像文本对。此外,WIT数据集还包含了一个多样化的概念和现实世界实体的集合,并提供了具有挑战性的现实世界测试集。
使用方法
使用WIT数据集的方法相对简单。数据集默认不下载图片,而是提供了图片的URL。要获取图片,可以使用Python的`urllib`库和`PIL`库来下载和打开图片。此外,可以使用`datasets`库来加载和操作数据集,并利用`map`函数对数据进行预处理,如图片下载等。在使用过程中,可以根据不同的任务需求,选择合适的数据字段和分割方式。
背景与挑战
背景概述
WIT数据集,全称为Wikipedia-based Image Text,是一个庞大的多模态多语言数据集。该数据集由谷歌研究团队创建,于2021年发布。它包含来自108种维基百科语言的3760万个实体丰富的图像-文本示例,以及1150万个独特的图像。WIT数据集的创建旨在解决现有数据集中语言覆盖不足的问题,并推动多语言多模态学习的研究。该数据集为机器学习模型提供了丰富的视觉和文本信息,使其能够更好地理解图像和文本之间的关系,并应用于图像字幕和文本检索等任务。
当前挑战
WIT数据集在构建过程中面临着一些挑战。首先,由于数据集包含多种语言,因此需要解决语言多样性的问题。其次,数据集的规模庞大,需要高效的数据处理和存储方案。此外,数据集中可能存在一些与特定文化或地区相关的偏见,需要通过数据清洗和平衡来减少这些偏见的影响。最后,数据集的更新和维护也需要持续进行,以保持其质量和相关性。
常用场景
经典使用场景
在多模态机器学习领域,WIT数据集被广泛用于模型预训练。其庞大的图像-文本实例集和多语言覆盖使其成为训练和评估多模态模型性能的理想选择。WIT数据集支持的任务包括图像字幕生成和文本检索,其中图像字幕生成任务尤为经典。研究者们利用WIT数据集中的丰富文本和图像实例,训练模型以自动生成与图像内容相符的描述性文本,这对于智能图像识别、自动报告生成等领域具有重要意义。
衍生相关工作
基于WIT数据集的研究衍生了许多经典工作。例如,一些研究者利用WIT数据集探索了多模态预训练模型的跨语言迁移学习能力,展示了WIT数据集在多语言多模态学习中的重要作用。此外,一些研究者利用WIT数据集研究了图像-文本对应关系的学习机制,为构建更精确的图像字幕生成模型提供了理论依据。
数据集最近研究
最新研究方向
WIT数据集作为目前最大的多模态多语言数据集,为跨语言和跨模态的机器学习研究提供了丰富的资源。该数据集在图像描述、文本检索等领域具有广泛的应用前景。在图像描述任务中,研究者可以利用WIT数据集训练模型,使其能够根据图像生成准确的文本描述。在文本检索任务中,WIT数据集可以帮助模型更好地理解图像内容,从而实现更准确的文本检索。此外,WIT数据集的多语言特性使得它在跨语言信息检索、机器翻译等领域也具有重要的研究价值。
以上内容由遇见数据集搜集并总结生成



