taigatakano/transit-en-ja-5M
收藏Hugging Face2026-01-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/taigatakano/transit-en-ja-5M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- translation
language:
- en
- ja
size_categories:
- 1M<n<10M
---
# 🛬 Transit-EnJa-5M 🛫
## Overview
**Transit-EnJa-5M** is an English–Japanese translation dataset built from a subset of the **C4** corpus. English source texts are filtered to be **64 tokens or fewer**, then translated into Japanese using **multiple large language models (LLMs)**.
This dataset is intended for training and evaluating machine translation systems (and related bilingual modeling tasks) under short-input constraints.
## Dataset Composition
The release includes the following splits:
* **Train:** 5,000,000 pairs
* **Validation / Eval:** 400,000 pairs
* **Test:** 100,000 pairs
A larger version may be released in the future.
## Data Creation Pipeline
1. **Source selection:** Samples are drawn from a subset of C4.
2. **Length filtering:** English inputs are filtered to **≤ 64 tokens**.
3. **Translation:** Each English input is translated into Japanese using **multiple LLMs**.
4. **Mechanical completeness check:** We programmatically verify that **all records have a translation** (i.e., no missing Japanese outputs).
## Data Format
Each example is a parallel pair:
* `en`: English text (source)
* `ja`: Japanese text (translation)
(Exact file format and field names may vary by hosting platform; please refer to the dataset files for the authoritative schema.)
## Intended Use
* Supervised **EN→JA** translation training
* Short-text translation benchmarking
* Data augmentation for bilingual or multilingual models
* Evaluation of robustness under length constraints (≤ 64 tokens)
## Limitations and Notes
* Translations are **LLM-generated** and may contain occasional errors, unnatural phrasing, or hallucinations.
* C4-derived text may include noisy or imperfect web content.
* The “completeness check” confirms translation presence, **not translation quality**.
## Licensing
This dataset is released under **ODC-By** (Open Data Commons Attribution License).
提供机构:
taigatakano



