premio-ai/OpenSubtitles_Translations_Dataset
收藏Hugging Face2024-03-27 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/premio-ai/OpenSubtitles_Translations_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多种语言对的平行语料,主要涉及阿拉伯语与其他语言的对照文本。每个语言对的配置包括阿拉伯语和目标语言的文本数据,数据以训练集的形式提供,并包含每个语言对的字节大小和样本数量。该数据集可能用于机器翻译、跨语言文本分析等任务。
该数据集包含多种语言对的平行语料,主要涉及阿拉伯语与其他语言的对照文本。每个语言对的配置包括阿拉伯语和目标语言的文本数据,数据以训练集的形式提供,并包含每个语言对的字节大小和样本数量。该数据集可能用于机器翻译、跨语言文本分析等任务。
提供机构:
premio-ai
原始信息汇总
数据集概述
本数据集包含多种语言对的文本数据,主要用于训练和评估跨语言模型。每个语言对由两种语言组成,其中一种是阿拉伯语(Arabic),另一种是其他语言。数据集分为训练集,每个语言对的数据量和示例数各不相同。
数据集详细信息
语言对配置
| 配置名称 | 包含语言 | 训练集大小(字节) | 训练示例数 |
|---|---|---|---|
| ar-af | Arabic, Afrikaans | 1117179 | 12336 |
| ar-bg | Arabic, Bulgarian | 2691099350 | 23076891 |
| ar-bn | Arabic, Bengali | 42626041 | 331701 |
| ar-br | Arabic, Breton | 1219315 | 14021 |
| ar-bs | Arabic, Bosnian | 813596963 | 8898484 |
| ar-ca | Arabic, Catalan | 33248285 | 349709 |
| ar-cs | Arabic, Czech | 2246574423 | 24068513 |
| ar-da | Arabic, Danish | 1003984086 | 10762663 |
| ar-de | Arabic, German | 1213035477 | 12439023 |
| ar-el | Arabic, Greek | 2740560489 | 22468462 |
| ar-en | Arabic, English | 2705676703 | 29823188 |
| ar-eo | Arabic, Esperanto | 2208348 | 26017 |
| ar-es | Arabic, Spanish | 2519883058 | 26641247 |
| ar-et | Arabic, Estonian | 888760073 | 9692040 |
| ar-eu | Arabic, Basque | 51963583 | 578303 |
| ar-fa | Arabic, Persian | 588980156 | 5493576 |
| ar-fi | Arabic, Finnish | 1636558522 | 17120182 |
| ar-fr | Arabic, French | 1949126480 | 20181740 |
| ar-gl | Arabic, Galician | 7395920 | 77718 |
| ar-he | Arabic, Hebrew | 2230264791 | 20577019 |
| ar-hi | Arabic, Hindi | 9471601 | 70935 |
| ar-hr | Arabic, Croatian | 1828664822 | 20034003 |
| ar-hu | Arabic, Hungarian | 2236530271 | 23767831 |
| ar-hy | Arabic, Armenian | 324056 | 2308 |
| ar-id | Arabic, Indonesian | 644477924 | 6950290 |
| ar-is | Arabic, Icelandic | 104565703 | 1105868 |
| ar-it | Arabic, Italian | 1930855054 | 20022861 |
| ar-ja | Arabic, Japanese | 172469229 | 1834940 |
| ar-ka | Arabic, Georgian | 20989397 | 161654 |
| ar-kk | Arabic, Kazakh | 113361 | 1279 |
| ar-ko | Arabic, Korean | 117564361 | 1249195 |
| ar-lt | Arabic, Lithuanian | 104544056 | 1177564 |
| ar-lv | Arabic, Latvian | 39523172 | 433544 |
| ar-mk | Arabic, Macedonian | 307474129 | 2699946 |
| ar-ml | Arabic, Malayalam | 49428465 | 323386 |
| ar-ms | Arabic, Malay | 139545063 | 1542856 |
| ar-nl | Arabic, Dutch | 2046940618 | 21221483 |
| ar-no | Arabic, Norwegian | 550777001 | 5954781 |
| ar-pl | Arabic, Polish | 2280330806 | 24043342 |
| ar-pt | Arabic, Portuguese(Portugal) | 1920060253 | 20343173 |
| ar-pt_br | Arabic, Portuguese(Brazil) | 2575696542 | 27512239 |
| ar-ro | Arabic, Romanian | 2451509007 | 26173933 |
| ar-ru | Arabic, Russian | 1771210582 | 14885701 |
| ar-si | Arabic, Sinhala | 65211137 | 483959 |
| ar-sk | Arabic, Slovak | 547467714 | 5914026 |
| ar-sl | Arabic, Slovenian | 1311581330 | 14469640 |
| ar-sq | Arabic, Albanian | 142141534 | 1548085 |
| ar-sr | Arabic, Serbian | 1973662770 | 21116415 |
| ar-sv | Arabic, Swedish | 1163815168 | 12276924 |
| ar-ta | Arabic, Tamil | 3612828 | 24676 |
| ar-te | Arabic, Telugu | 2647246 | 19326 |
| ar-th | Arabic, Thai | 406889596 | 2947486 |
| ar-tl | Arabic, Tagalog | 702382 | 7578 |
| ar-tr | Arabic, Turkish | 2501711267 | 26528738 |
| ar-uk | Arabic, Ukrainian | 67739320 | 591338 |
| ar-ur | Arabic, Urdu | 2806064 | 25650 |
| ar-vi | Arabic, Vietnamese | 295250936 | 2875003 |
| ar-zh_cn | Arabic, Chinese(China) | 687718260 | 7813702 |
| ar-zh_tw | Arabic, Chinese(Taiwan) | 318679989 | 3722492 |
数据文件路径
每个语言对的训练数据文件路径遵循以下格式:
{config_name}/train-*
例如,阿拉伯语和法语的训练数据文件路径为:
ar-fr/train-*



