cmu-lti/machine-translation-for-vision

Name: cmu-lti/machine-translation-for-vision
Creator: cmu-lti
Published: 2026-03-03 20:14:15
License: 暂无描述

Hugging Face2026-03-03 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/cmu-lti/machine-translation-for-vision

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - image-to-text - text-to-image language: - en tags: - image-transcreation - cultural-adaptation - vision-language - cross-cultural size_categories: - n<1K dataset_info: features: - name: id dtype: string - name: category dtype: string - name: text dtype: string - name: source_country dtype: string - name: image_path dtype: string - name: image dtype: image - name: target_countries dtype: string splits: - name: concept num_bytes: 815586277.0 num_examples: 595 - name: application num_bytes: 96851257.0 num_examples: 101 download_size: 905892844 dataset_size: 912437534.0 --- # Machine Translation for Vision (MTV) Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we introduce a new task of translating images to make them culturally relevant (**image transcreation**). For example, a math worksheet teaching children how to count using halloween-themed objects in the US would change to a worksheet using diwali-themed objects to teach the same concept in India. More details on the dataset and the task can be found in our [paper](https://arxiv.org/abs/2404.01247). We also won the [Best Paper](https://2024.emnlp.org/program/best_papers/) award at EMNLP 2024 for this work! ## Dataset Overview Our test set contains images paired with concepts that need to be transcreated to different cultural contexts. | Property | Value | |----------|-------| | **Total Images** | 696 | | **Splits** | `concept` (595), `application` (101) | | **License** | MIT | ## Splits - **`concept`**: 595 images focusing on single concepts that are cross-culturally coherent. - **`application`**: 101 images sourced from real-world use cases (e.g., educational materials, storybooks). Used for evaluating practical applicability. ## Dataset Fields | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique identifier for each image | | `image` | PIL.Image | Image data as a PIL Image object | | `category` | string | Category classification (see below) | | `text` | string | Descriptive text for the image (see below) | | `image_path` | string | URL path to the original image file | | `source_country` | string | Country of origin for the image | | `target_countries` | string | Comma-separated list of target countries for adaptation (see below) | ### Field Details by Split **Concept Split:** - `category`: The semantic category of the concept (e.g., food, beverages, housing, clothing, festivals, etc.) - `text`: The name of the object depicted in the image (e.g., "pancakes", "yerba mate", "kimono") - `target_countries`: All 7 countries except the source country (6 targets per image) **Application Split:** - `category`: Either "education" or "stories" - `text`: For education images, the learning concept being taught (e.g., "counting", "addition"). For story images, the accompanying text/caption from the storybook. - `target_countries`: All 7 countries (brazil, india, japan, nigeria, portugal, turkey, united-states) ## Loading the Dataset ```python from datasets import load_dataset # Load the entire dataset dataset = load_dataset("cmu-lti/machine-translation-for-vision") # Access specific splits concept_data = dataset["concept"] application_data = dataset["application"] # Or load a single split concept_only = load_dataset("cmu-lti/machine-translation-for-vision", split="concept") ``` ## Example Usage ```python from datasets import load_dataset dataset = load_dataset("cmu-lti/machine-translation-for-vision") # View a sample from the concept split concept_sample = dataset["concept"][0] print(f"ID: {concept_sample['id']}") print(f"Category: {concept_sample['category']}") print(f"Text: {concept_sample['text']}") print(f"Source Country: {concept_sample['source_country']}") print(f"Target Countries: {concept_sample['target_countries']}") # View a sample from the application split app_sample = dataset["application"][0] print(f"ID: {app_sample['id']}") print(f"Category: {app_sample['category']}") print(f"Text: {app_sample['text']}") print(f"Source Country: {app_sample['source_country']}") print(f"Target Countries: {app_sample['target_countries']}") # Access the image directly (already a PIL Image) image = concept_sample['image'] image.show() ``` ## Key Findings Evaluation of state-of-the-art generative models on this benchmark reveals: - Current image-editing models perform poorly, with success rates as low as 5% on concept images for certain countries - Complete failure on application images for some regions - Incorporating language models and retrieval systems improves outcomes ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{khanuja-etal-2024-image, title = "An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance", author = "Khanuja, Simran and Ramamoorthy, Sathyanarayanan and Song, Yueqi and Neubig, Graham", editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung", booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2024", address = "Miami, Florida, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.emnlp-main.573/", doi = "10.18653/v1/2024.emnlp-main.573", pages = "10258--10279" } ``` ## Links - **Paper**: [arXiv:2404.01247](https://arxiv.org/abs/2404.01247) - **Code**: [GitHub](https://github.com/simran-khanuja/image-transcreation) ## License This dataset is released under the MIT License.

提供机构：

cmu-lti

5,000+

优质数据集

54 个

任务类型

进入经典数据集