five

google/wit|多模态学习数据集|自然语言处理数据集

收藏
hugging_face2022-07-04 更新2024-03-04 收录
多模态学习
自然语言处理
下载链接:
https://hf-mirror.com/datasets/google/wit
下载链接
链接失效反馈
资源简介:
--- annotations_creators: - machine-generated language_creators: - found language: - af - ar - ast - azb - be - bg - bn - br - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gl - hr - hu - hy - id - it - iw - ja - ka - ko - la - lt - lv - mk - ml - ms - nl - nn - 'no' - pl - pt - ro - ru - sk - sl - sr - sv - th - tr - uk - ur - vi - vo - zh license: - cc-by-sa-3.0 multilinguality: - multilingual paperswithcode_id: wit pretty_name: Wikipedia-based Image Text size_categories: - 10M<n<100M source_datasets: - original - extended|wikipedia task_categories: - text-retrieval - image-to-text task_ids: - text-retrieval-other-text-image-retrieval - image-captioning --- # Dataset Card for WIT ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Preprocessing](#dataset-preprocessing) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [WIT homepage](https://github.com/google-research-datasets/wit) - **Repository:** [WIT repository](https://github.com/google-research-datasets/wit) - **Paper:** [WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning ](https://arxiv.org/abs/2103.01913) - **Leaderboard:** [WIT leaderboard](https://www.kaggle.com/c/wikipedia-image-caption) - **Point of Contact:** [WIT e-mail](mailto:wit-dataset@google.com) ### Dataset Summary Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. A few unique advantages of WIT: * The largest multimodal dataset (time of this writing) by the number of image-text examples. * A massively multilingual (first of its kind) with coverage for over 100+ languages. * A collection of diverse set of concepts and real world entities. * Brings forth challenging real-world test sets. ### Dataset Preprocessing This dataset doesn't download the images locally by default. Instead, it exposes URLs to the images. To fetch the images, use the following code: ```python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": get_datasets_user_agent()}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("wit") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ``` ### Supported Tasks and Leaderboards - `image-captioning`: This dataset can be used to train a model for image captioning where the goal is to predict a caption given the image. - `text-retrieval`: The goal in this task is to build a model that retrieves the text closest to an image. In these tasks, any combination of the `caption_reference_description`, `caption_attribution_description` and `caption_alt_text_description` fields can be used as the input text/caption. ### Languages The dataset contains examples from all Wikipedia languages, with the following stats: Image-Text | # Lang | Uniq. Images | # Lang ------------ | ------ | ------------- | ------ total > 1M | 9 | images > 1M | 6 total > 500K | 10 | images > 500K | 12 total > 100K | 36 | images > 100K | 35 total > 50K | 15 | images > 50K | 17 total > 14K | 38 | images > 13K | 38 ## Dataset Structure ### Data Instances ``` { 'language': 'en', 'page_url': 'https://en.wikipedia.org/wiki/Oxydactylus', 'image_url': 'https://upload.wikimedia.org/wikipedia/commons/5/5f/Oxydactylus_longipes_fm.jpg', 'page_title': 'Oxydactylus', 'section_title': None, 'hierarchical_section_title': 'Oxydactylus', 'caption_reference_description': None, 'caption_attribution_description': 'English: Mounted skeleton of Oxydactylus longipes in the Field Museum of Natural History.', 'caption_alt_text_description': None, 'mime_type': 'image/jpeg', 'original_height': 3564, 'original_width': 2748, 'is_main_image': True, 'attribution_passes_lang_id': True, 'page_changed_recently': True, 'context_page_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene, existing for approximately 14 million years. The name is from the Ancient Greek οξύς and δάκτυλος.\nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.', 'context_section_description': 'Oxydactylus is an extinct genus of camelid endemic to North America. It lived from the Late Oligocene to the Middle Miocene (28.4–13.7 mya), existing for approximately 14 million years. The name is from the Ancient Greek οξύς (oxys, "sharp")and δάκτυλος (daktylos, "finger").\n \nThey had very long legs and necks, and were probably adapted to eating high vegetation, much like modern giraffes. Unlike modern camelids, they had hooves, rather than tough sole-pads, and splayed toes.' } ``` ### Data Fields - `language`: Language code depicting wikipedia language of the page - `page_url`: URL to wikipedia page - `image_url`: URL to wikipedia image - `page_title`: Wikipedia page's title - `section_title`: Section's title - `hierarchical_section_title`: Hierarchical section's title - `caption_reference_description`: This is the caption that is visible on the wiki page directly below the image. - `caption_attribution_description`: This is the text found on the Wikimedia page of the image. This text is common to all occurrences of that image across all Wikipedias and thus can be in a language different to the original page article. - `caption_alt_text_description`: This is the “alt” text associated with the image. While not visible in general, it is commonly used for accessibility / screen readers - `mime_type`: Mime type associated to the image. - `original_height`: Image height - `original_width`: Image width - `is_main_image`: Flag determining if the image is the first image of the page. Usually displayed on the top-right part of the page when using web browsers. - `attribution_passes_lang_id`: Compared `language` field with the attribution language (written in the prefix of the attribution description). - `page_changed_recently`: [More Information Needed] - `context_page_description`: Page description corresponds to the short description of the page. It provides a concise explanation of the scope of the page. - `context_section_description`: Text within the image's section. <p align='center'> <img width='75%' src='https://production-media.paperswithcode.com/datasets/Screenshot_2021-03-04_at_14.26.02.png' alt="Half Dome" /> </br> <b>Figure: WIT annotation example. </b> </p> Details on the field content can be found directly in the [paper, figure 5 and table 12.](https://arxiv.org/abs/2103.01913) ### Data Splits All data is held in `train` split, with a total of 37046386 rows. ## Dataset Creation ### Curation Rationale From the [repository](https://github.com/google-research-datasets/wit#motivation): > Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding. > > To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets. > > The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs). ### Source Data #### Initial Data Collection and Normalization From the [paper, section 3.1](https://arxiv.org/abs/2103.01913): > We started with all Wikipedia content pages (i.e., ignoring other pages that have discussions, comments and such). These number about ∼124M pages across 279 languages. #### Who are the source language producers? Text was extracted from Wikipedia. ### Annotations #### Annotation process WIT was constructed using an automatic process. However it was human-validated. From the [paper, section 3.7](https://arxiv.org/abs/2103.01913): > To further verify the quality of the WIT dataset we performed a study using (crowd-sourced) human annotators. As seen in Fig. 3, we asked raters to answer 3 questions. Given an image and the page title, raters first evaluate the quality of the attribution description and reference description in the first two questions (order randomized). The third question understands the contextual quality of these text descriptions given the page description and caption. Each response is on a 3-point scale: "Yes" if the text perfectly describes the image, "Maybe" if it is sufficiently explanatory and "No" if it is irrelevant or the image is inappropriate. #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases From the [paper, section 3.4](https://arxiv.org/abs/2103.01913): > Lastly we found that certain image-text pairs occurred very frequently. These were often generic images that did not have much to do with the main article page. Common examples included flags, logos, maps, insignia and such. To prevent biasing the data, we heavily under-sampled all such images ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ```bibtex @article{srinivasan2021wit, title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, author={Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc}, journal={arXiv preprint arXiv:2103.01913}, year={2021} } ``` ### Contributions Thanks to [@thomasw21](https://github.com/thomasw21), [@nateraw](https://github.com/nateraw) and [hassiahk](https://github.com/hassiahk) for adding this dataset.
提供机构:
google
原始信息汇总

数据集概述

数据集名称

  • 名称: Wikipedia-based Image Text (WIT)
  • 别名: WIT

数据集基本信息

  • 类型: 多模态多语言数据集
  • 规模: 包含37.6 million的图像-文本对,涉及11.5 million独特图像,覆盖108种Wikipedia语言
  • 语言: 支持多种语言,包括但不限于英语、中文、阿拉伯语等
  • 许可证: cc-by-sa-3.0

数据集特点

  • 规模: 目前最大的多模态数据集
  • 多语言性: 覆盖超过100种语言,是首个此类数据集
  • 内容多样性: 包含多种概念和现实世界实体
  • 应用挑战性: 提供具有挑战性的真实世界测试集

数据集结构

  • 数据实例: 每个实例包含语言、页面URL、图像URL等详细信息
  • 数据字段: 包括语言、页面URL、图像URL、页面标题等
  • 数据分割: 所有数据存储在train分割中,共37046386行

数据集创建

  • 数据来源: 从Wikipedia内容页面提取
  • 注释过程: 自动生成,经过人工验证

支持的任务

  • 图像标题生成: 训练模型以预测给定图像的标题
  • 文本检索: 构建模型以检索与图像最接近的文本

使用注意事项

  • 数据偏差: 已采取措施减少常见图像的过度采样,以避免数据偏差

附加信息

  • 引用信息: 引用格式请参考提供的BibTeX条目
  • 贡献者: 感谢多位GitHub用户对该数据集的贡献
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

Materials Project

材料项目是一组标有不同属性的化合物。数据集链接: MP 2018.6.1(69,239 个材料) MP 2019.4.1(133,420 个材料)

OpenDataLab 收录

CHARLS

中国健康与养老追踪调查(CHARLS)数据集,旨在收集反映中国45岁及以上中老年人家庭和个人的高质量微观数据,用以分析人口老龄化问题,内容包括健康状况、经济状况、家庭结构和社会支持等。

charls.pku.edu.cn 收录

HazyDet

HazyDet是由解放军工程大学等机构创建的一个大规模数据集,专门用于雾霾场景下的无人机视角物体检测。该数据集包含383,000个真实世界实例,收集自自然雾霾环境和正常场景中人工添加的雾霾效果,以模拟恶劣天气条件。数据集的创建过程结合了深度估计和大气散射模型,确保了数据的真实性和多样性。HazyDet主要应用于无人机在恶劣天气条件下的物体检测,旨在提高无人机在复杂环境中的感知能力。

arXiv 收录

猫狗图像数据集

该数据集包含猫和狗的图像,每类各12500张。训练集和测试集分别包含10000张和2500张图像,用于模型的训练和评估。

github 收录

Solar Radiation Data

该数据集包含全球多个地点的太阳辐射数据,涵盖了不同时间段和气象条件下的辐射强度。数据包括直接辐射、散射辐射和总辐射等指标,适用于太阳能资源评估和气候研究。

www.nrel.gov 收录