five

google/wit|多模态学习数据集|自然语言处理数据集

收藏
hugging_face2022-07-04 更新2024-03-04 收录
多模态学习
自然语言处理
下载链接:
https://hf-mirror.com/datasets/google/wit
下载链接
链接失效反馈
资源简介:
--- annotations_creators: - machine-generated language_creators: - found language: - af - ar - ast - azb - be - bg - bn - br - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gl - hr - hu - hy - id - it - iw - ja - ka - ko - la - lt - lv - mk - ml - ms - nl - nn - 'no' - pl - pt - ro - ru - sk - sl - sr - sv - th - tr - uk - ur - vi - vo - zh license: - cc-by-sa-3.0 multilinguality: - multilingual paperswithcode_id: wit pretty_name: Wikipedia-based Image Text size_categories: - 10M<n<100M source_datasets: - original - extended|wikipedia task_categories: - text-retrieval - image-to-text task_ids: - text-retrieval-other-text-image-retrieval - image-captioning --- # Dataset Card for WIT ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Preprocessing](#dataset-preprocessing) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [WIT homepage](https://github.com/google-research-datasets/wit) - **Repository:** [WIT repository](https://github.com/google-research-datasets/wit) - **Paper:** [WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning ](https://arxiv.org/abs/2103.01913) - **Leaderboard:** [WIT leaderboard](https://www.kaggle.com/c/wikipedia-image-caption) - **Point of Contact:** [WIT e-mail](mailto:wit-dataset@google.com) ### Dataset Summary Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. A few unique advantages of WIT: * The largest multimodal dataset (time of this writing) by the number of image-text examples. * A massively multilingual (first of its kind) with coverage for over 100+ languages. * A collection of diverse set of concepts and real world entities. * Brings forth challenging real-world test sets. ### Dataset Preprocessing This dataset doesn't download the images locally by default. Instead, it exposes URLs to the images. To fetch the images, use the following code: ```python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": get_datasets_user_agent()}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("wit") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ``` ### Supported Tasks and Leaderboards - `image-captioning`: This dataset can be used to train a model for image captioning where the goal is to predict a caption given the image. - `text-retrieval`: The goal in this task is to build a model that retrieves the text closest to an image. In these tasks, any combination of the `caption_reference_description`, `caption_attribution_description` and `caption_alt_text_description` fields can be used as the input text/caption. ### Languages The dataset contains examples from all Wikipedia languages, with the following stats: Image-Text | # Lang | Uniq. Images | # Lang ------------ | ------ | ------------- | ------ total > 1M | 9 | images > 1M | 6 total > 500K | 10 | images > 500K | 12 total > 100K | 36 | images > 100K | 35 total > 50K | 15 | images > 50K | 17 total > 14K | 38 | images > 13K | 38 ## Dataset Structure ### Data Instances ``` { 'language': 'en', 'page_url': 'https://en.wikipedia.org/wiki/Oxydactylus', 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/5/5f/Oxydactylus_longipes_fm.jpg', 'page_title': 'Oxydactylus', 'section_title': None, 'hierarchical_section_title': 'Oxydactylus', 'caption_reference_description': None, 'caption_attribution_description': 'English: Mounted skeleton of Oxydactylus longipes in the Field Museum of Natural History.', 'caption_alt_text_description': None, 'mime_type': 'image/jpeg', 'original_height': 3564, 'original_width': 2748, 'is_main_image': True, 'attribution_passes_lang_id': True, 'page_changed_recently': True, 'context_page_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene, existing for approximately 14 million years. The name is from the Ancient Greek οξύς and δάκτυλος.\nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.', 'context_section_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene (28.4–13.7 mya), existing for approximately 14 million years. The name is from the Ancient Greek οξύς (oxys, "sharp")and δάκτυλος (daktylos, "finger").\n \nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.' } ``` ### Data Fields - `language`: Language code depicting wikipedia language of the page - `page_url`: URL to wikipedia page - `image_url`: URL to wikipedia image - `page_title`: Wikipedia page's title - `section_title`: Section's title - `hierarchical_section_title`: Hierarchical section's title - `caption_reference_description`: This is the caption that is visible on the wiki page directly below the image. - `caption_attribution_description`: This is the text found on the Wikimedia page of the image. This text is common to all occurrences of that image across all Wikipedias and thus can be in a language different to the original page article. - `caption_alt_text_description`: This is the “alt” text associated with the image. While not visible in general, it is commonly used for accessibility / screen readers - `mime_type`: Mime type associated to the image. - `original_height`: Image height - `original_width`: Image width - `is_main_image`: Flag determining if the image is the first image of the page. Usually displayed on the top-right part of the page when using web browsers. - `attribution_passes_lang_id`: Compared `language` field with the attribution language (written in the prefix of the attribution description). - `page_changed_recently`: [More Information Needed] - `context_page_description`: Page description corresponds to the short description of the page. It provides a concise explanation of the scope of the page. - `context_section_description`: Text within the image's section. <p align='center'> <img width='75%' src='https://production-media.paperswithcode.com/datasets/Screenshot_2021-03-04_at_14.26.02.png' alt="Half Dome" /> </br> <b>Figure: WIT annotation example. </b> </p> Details on the field content can be found directly in the [paper, figure 5 and table 12.](https://arxiv.org/abs/2103.01913) ### Data Splits All data is held in `train` split, with a total of 37046386 rows. ## Dataset Creation ### Curation Rationale From the [repository](https://github.com/google-research-datasets/wit#motivation): > Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding. > > To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets. > > The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs). ### Source Data #### Initial Data Collection and Normalization From the [paper, section 3.1](https://arxiv.org/abs/2103.01913): > We started with all Wikipedia content pages (i.e., ignoring other pages that have discussions, comments and such). These number about ∼124M pages across 279 languages. #### Who are the source language producers? Text was extracted from Wikipedia. ### Annotations #### Annotation process WIT was constructed using an automatic process. However it was human-validated. From the [paper, section 3.7](https://arxiv.org/abs/2103.01913): > To further verify the quality of the WIT dataset we performed a study using (crowd-sourced) human annotators. As seen in Fig. 3, we asked raters to answer 3 questions. Given an image and the page title, raters first evaluate the quality of the attribution description and reference description in the first two questions (order randomized). The third question understands the contextual quality of these text descriptions given the page description and caption. Each response is on a 3-point scale: "Yes" if the text perfectly describes the image, "Maybe" if it is sufficiently explanatory and "No" if it is irrelevant or the image is inappropriate. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases From the [paper, section 3.4](https://arxiv.org/abs/2103.01913): > Lastly we found that certain image-text pairs occurred very frequently. These were often generic images that did not have much to do with the main article page. Common examples included flags, logos, maps, insignia and such. To prevent biasing the data, we heavily under-sampled all such images ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ```bibtex @article{srinivasan2021wit, title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc}, journal={arXiv preprint arXiv:2103.01913}, year={2021} } ``` ### Contributions Thanks to [@thomasw21](https://github.com/thomasw21), [@nateraw](https://github.com/nateraw) and [hassiahk](https://github.com/hassiahk) for adding this dataset.
提供机构:
google
原始信息汇总

数据集概述

数据集名称

  • 名称: Wikipedia-based Image Text (WIT)
  • 别名: WIT

数据集基本信息

  • 类型: 多模态多语言数据集
  • 规模: 包含37.6 million的图像-文本对,涉及11.5 million独特图像,覆盖108种Wikipedia语言
  • 语言: 支持多种语言,包括但不限于英语、中文、阿拉伯语等
  • 许可证: cc-by-sa-3.0

数据集特点

  • 规模: 目前最大的多模态数据集
  • 多语言性: 覆盖超过100种语言,是首个此类数据集
  • 内容多样性: 包含多种概念和现实世界实体
  • 应用挑战性: 提供具有挑战性的真实世界测试集

数据集结构

  • 数据实例: 每个实例包含语言、页面URL、图像URL等详细信息
  • 数据字段: 包括语言、页面URL、图像URL、页面标题等
  • 数据分割: 所有数据存储在train分割中,共37046386行

数据集创建

  • 数据来源: 从Wikipedia内容页面提取
  • 注释过程: 自动生成,经过人工验证

支持的任务

  • 图像标题生成: 训练模型以预测给定图像的标题
  • 文本检索: 构建模型以检索与图像最接近的文本

使用注意事项

  • 数据偏差: 已采取措施减少常见图像的过度采样,以避免数据偏差

附加信息

  • 引用信息: 引用格式请参考提供的BibTeX条目
  • 贡献者: 感谢多位GitHub用户对该数据集的贡献
AI搜集汇总
数据集介绍
构建方式
WIT数据集的构建,是基于从Wikipedia文章和Wikimedia图像链接中提取与图像关联的多种文本。首先,从Wikipedia的内容页面中获取数据,并经过严格的过滤,保留高质量的图像文本对。然后,通过自动化的过程构建数据集,并由人工验证其质量。为了确保文本描述与图像内容的高度相关性,研究人员设计了一个由人工标注者参与的评价体系,对文本描述进行质量评估。
特点
WIT数据集具有几个显著特点。首先,它是目前最大的多模态数据集,包含超过3.76亿个图像文本示例,覆盖了108种Wikipedia语言。其次,它是首个大规模的多语言数据集,每种语言都有超过12,000个示例,53种语言有超过10万个图像文本对。此外,WIT数据集还包含了一个多样化的概念和现实世界实体的集合,并提供了具有挑战性的现实世界测试集。
使用方法
使用WIT数据集的方法相对简单。数据集默认不下载图片,而是提供了图片的URL。要获取图片,可以使用Python的`urllib`库和`PIL`库来下载和打开图片。此外,可以使用`datasets`库来加载和操作数据集,并利用`map`函数对数据进行预处理,如图片下载等。在使用过程中,可以根据不同的任务需求,选择合适的数据字段和分割方式。
背景与挑战
背景概述
WIT数据集,全称为Wikipedia-based Image Text,是一个庞大的多模态多语言数据集。该数据集由谷歌研究团队创建,于2021年发布。它包含来自108种维基百科语言的3760万个实体丰富的图像-文本示例,以及1150万个独特的图像。WIT数据集的创建旨在解决现有数据集中语言覆盖不足的问题,并推动多语言多模态学习的研究。该数据集为机器学习模型提供了丰富的视觉和文本信息,使其能够更好地理解图像和文本之间的关系,并应用于图像字幕和文本检索等任务。
当前挑战
WIT数据集在构建过程中面临着一些挑战。首先,由于数据集包含多种语言,因此需要解决语言多样性的问题。其次,数据集的规模庞大,需要高效的数据处理和存储方案。此外,数据集中可能存在一些与特定文化或地区相关的偏见,需要通过数据清洗和平衡来减少这些偏见的影响。最后,数据集的更新和维护也需要持续进行,以保持其质量和相关性。
常用场景
经典使用场景
在多模态机器学习领域,WIT数据集被广泛用于模型预训练。其庞大的图像-文本实例集和多语言覆盖使其成为训练和评估多模态模型性能的理想选择。WIT数据集支持的任务包括图像字幕生成和文本检索,其中图像字幕生成任务尤为经典。研究者们利用WIT数据集中的丰富文本和图像实例,训练模型以自动生成与图像内容相符的描述性文本,这对于智能图像识别、自动报告生成等领域具有重要意义。
衍生相关工作
基于WIT数据集的研究衍生了许多经典工作。例如,一些研究者利用WIT数据集探索了多模态预训练模型的跨语言迁移学习能力,展示了WIT数据集在多语言多模态学习中的重要作用。此外,一些研究者利用WIT数据集研究了图像-文本对应关系的学习机制,为构建更精确的图像字幕生成模型提供了理论依据。
数据集最近研究
最新研究方向
WIT数据集作为目前最大的多模态多语言数据集,为跨语言和跨模态的机器学习研究提供了丰富的资源。该数据集在图像描述、文本检索等领域具有广泛的应用前景。在图像描述任务中,研究者可以利用WIT数据集训练模型,使其能够根据图像生成准确的文本描述。在文本检索任务中,WIT数据集可以帮助模型更好地理解图像内容,从而实现更准确的文本检索。此外,WIT数据集的多语言特性使得它在跨语言信息检索、机器翻译等领域也具有重要的研究价值。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4099个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

PlantVillage

在这个数据集中,39 种不同类别的植物叶子和背景图像可用。包含 61,486 张图像的数据集。我们使用了六种不同的增强技术来增加数据集的大小。这些技术是图像翻转、伽玛校正、噪声注入、PCA 颜色增强、旋转和缩放。

OpenDataLab 收录

Wind Turbine Data

该数据集包含风力涡轮机的运行数据,包括风速、风向、发电量等参数。数据记录了多个风力涡轮机在不同时间点的运行状态,适用于风能研究和风力发电系统的优化分析。

www.kaggle.com 收录

UAVid

UAVId 是一个高分辨率的无人机语义分割数据集作为补充,它带来了新的挑战,包括大规模变化、运动物体识别和时间一致性保持。 UAV 数据集由 30 个视频序列组成,这些视频序列在倾斜视图中捕获 4K 高分辨率图像。总共有 300 张图像被密集标记为 8 个类别,用于语义标记任务。

OpenDataLab 收录

HAM10000

HAM10000数据集是一个全面收集的皮肤镜图像集合,用于皮肤病变分类,广泛应用于医学影像和机器学习领域。该数据集包含多种皮肤病变,旨在推动皮肤病学研究,特别是皮肤癌的诊断。数据集由10,000张高分辨率的皮肤病变图像组成,来源多样,有助于训练稳健的机器学习模型,使其能够很好地泛化到未见过的数据。数据集的主要挑战是其显著的不平衡性。

github 收录

China Air Quality Historical Data

该数据集包含了中国多个城市的空气质量历史数据,涵盖了PM2.5、PM10、SO2、NO2、CO、O3等污染物浓度以及空气质量指数(AQI)等信息。数据按小时记录,提供了详细的空气质量监测数据。

www.cnemc.cn 收录