five

taigatakano/transit-en-ja-5M

收藏
Hugging Face2026-01-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/taigatakano/transit-en-ja-5M
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - translation language: - en - ja size_categories: - 1M<n<10M --- # 🛬 Transit-EnJa-5M 🛫 ## Overview **Transit-EnJa-5M** is an English–Japanese translation dataset built from a subset of the **C4** corpus. English source texts are filtered to be **64 tokens or fewer**, then translated into Japanese using **multiple large language models (LLMs)**. This dataset is intended for training and evaluating machine translation systems (and related bilingual modeling tasks) under short-input constraints. ## Dataset Composition The release includes the following splits: * **Train:** 5,000,000 pairs * **Validation / Eval:** 400,000 pairs * **Test:** 100,000 pairs A larger version may be released in the future. ## Data Creation Pipeline 1. **Source selection:** Samples are drawn from a subset of C4. 2. **Length filtering:** English inputs are filtered to **≤ 64 tokens**. 3. **Translation:** Each English input is translated into Japanese using **multiple LLMs**. 4. **Mechanical completeness check:** We programmatically verify that **all records have a translation** (i.e., no missing Japanese outputs). ## Data Format Each example is a parallel pair: * `en`: English text (source) * `ja`: Japanese text (translation) (Exact file format and field names may vary by hosting platform; please refer to the dataset files for the authoritative schema.) ## Intended Use * Supervised **EN→JA** translation training * Short-text translation benchmarking * Data augmentation for bilingual or multilingual models * Evaluation of robustness under length constraints (≤ 64 tokens) ## Limitations and Notes * Translations are **LLM-generated** and may contain occasional errors, unnatural phrasing, or hallucinations. * C4-derived text may include noisy or imperfect web content. * The “completeness check” confirms translation presence, **not translation quality**. ## Licensing This dataset is released under **ODC-By** (Open Data Commons Attribution License).
提供机构:
taigatakano
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作