cmu-lti/machine-translation-for-vision
收藏Hugging Face2026-03-03 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/cmu-lti/machine-translation-for-vision
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- image-to-text
- text-to-image
language:
- en
tags:
- image-transcreation
- cultural-adaptation
- vision-language
- cross-cultural
size_categories:
- n<1K
dataset_info:
features:
- name: id
dtype: string
- name: category
dtype: string
- name: text
dtype: string
- name: source_country
dtype: string
- name: image_path
dtype: string
- name: image
dtype: image
- name: target_countries
dtype: string
splits:
- name: concept
num_bytes: 815586277.0
num_examples: 595
- name: application
num_bytes: 96851257.0
num_examples: 101
download_size: 905892844
dataset_size: 912437534.0
---
# Machine Translation for Vision (MTV)
Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech
and text. In this work, we introduce a new task of translating images to make them culturally relevant (**image transcreation**). For example, a math worksheet teaching children how to count using halloween-themed objects in the US would change to a worksheet using diwali-themed objects to teach the same concept in India.
More details on the dataset and the task can be found in our [paper](https://arxiv.org/abs/2404.01247). We also won the [Best Paper](https://2024.emnlp.org/program/best_papers/) award at EMNLP 2024 for this work!
## Dataset Overview
Our test set contains images paired with concepts that need to be transcreated to different cultural contexts.
| Property | Value |
|----------|-------|
| **Total Images** | 696 |
| **Splits** | `concept` (595), `application` (101) |
| **License** | MIT |
## Splits
- **`concept`**: 595 images focusing on single concepts that are cross-culturally coherent.
- **`application`**: 101 images sourced from real-world use cases (e.g., educational materials, storybooks). Used for evaluating practical applicability.
## Dataset Fields
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique identifier for each image |
| `image` | PIL.Image | Image data as a PIL Image object |
| `category` | string | Category classification (see below) |
| `text` | string | Descriptive text for the image (see below) |
| `image_path` | string | URL path to the original image file |
| `source_country` | string | Country of origin for the image |
| `target_countries` | string | Comma-separated list of target countries for adaptation (see below) |
### Field Details by Split
**Concept Split:**
- `category`: The semantic category of the concept (e.g., food, beverages, housing, clothing, festivals, etc.)
- `text`: The name of the object depicted in the image (e.g., "pancakes", "yerba mate", "kimono")
- `target_countries`: All 7 countries except the source country (6 targets per image)
**Application Split:**
- `category`: Either "education" or "stories"
- `text`: For education images, the learning concept being taught (e.g., "counting", "addition"). For story images, the accompanying text/caption from the storybook.
- `target_countries`: All 7 countries (brazil, india, japan, nigeria, portugal, turkey, united-states)
## Loading the Dataset
```python
from datasets import load_dataset
# Load the entire dataset
dataset = load_dataset("cmu-lti/machine-translation-for-vision")
# Access specific splits
concept_data = dataset["concept"]
application_data = dataset["application"]
# Or load a single split
concept_only = load_dataset("cmu-lti/machine-translation-for-vision", split="concept")
```
## Example Usage
```python
from datasets import load_dataset
dataset = load_dataset("cmu-lti/machine-translation-for-vision")
# View a sample from the concept split
concept_sample = dataset["concept"][0]
print(f"ID: {concept_sample['id']}")
print(f"Category: {concept_sample['category']}")
print(f"Text: {concept_sample['text']}")
print(f"Source Country: {concept_sample['source_country']}")
print(f"Target Countries: {concept_sample['target_countries']}")
# View a sample from the application split
app_sample = dataset["application"][0]
print(f"ID: {app_sample['id']}")
print(f"Category: {app_sample['category']}")
print(f"Text: {app_sample['text']}")
print(f"Source Country: {app_sample['source_country']}")
print(f"Target Countries: {app_sample['target_countries']}")
# Access the image directly (already a PIL Image)
image = concept_sample['image']
image.show()
```
## Key Findings
Evaluation of state-of-the-art generative models on this benchmark reveals:
- Current image-editing models perform poorly, with success rates as low as 5% on concept images for certain countries
- Complete failure on application images for some regions
- Incorporating language models and retrieval systems improves outcomes
## Citation
If you use this dataset, please cite:
```bibtex
@inproceedings{khanuja-etal-2024-image,
title = "An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance",
author = "Khanuja, Simran and
Ramamoorthy, Sathyanarayanan and
Song, Yueqi and
Neubig, Graham",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.573/",
doi = "10.18653/v1/2024.emnlp-main.573",
pages = "10258--10279"
}
```
## Links
- **Paper**: [arXiv:2404.01247](https://arxiv.org/abs/2404.01247)
- **Code**: [GitHub](https://github.com/simran-khanuja/image-transcreation)
## License
This dataset is released under the MIT License.
提供机构:
cmu-lti



