aiana94/polynews-parallel
收藏Hugging Face2024-06-21 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/aiana94/polynews-parallel
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个多语言的新闻平行语料库,支持多种语言的翻译和文本检索任务。数据集来源于mafand、wmt-news和globalvoices,涵盖了从1K到10K不等的规模。数据集的主要用途是用于机器翻译和文本检索,特别是新闻领域的多语言处理。
This dataset is a multilingual parallel news corpus, supporting translation and text retrieval tasks across multiple languages. The dataset is sourced from mafand, wmt-news, and globalvoices, with a size ranging from 1K to 10K. The primary use of the dataset is for machine translation and text retrieval, particularly in the context of multilingual news processing.
提供机构:
aiana94
原始信息汇总
数据集概述
基本信息
- 许可证: cc-by-nc-4.0
- 任务类别:
- 翻译
- 文本检索
- 语言:
- am, ar, ay, bm, bbj, bn, bg, ca, cs, ku, da, de, el, en, et, ee, fil, fi, fr, fon, gu, ha, he, hi, hu, ig, id, it, ja, kk, km, ko, lv, lt, lg, luo, mk, mos, my, nl, ne, or, pa, pcm, fa, pl, pt, mg, ro, ru, es, sr, sq, sw, sv, tet, tn, tr, tw, ur, wo, yo, zh, zu
- 多语言性:
- 翻译
- 多语言
- 数据集名称: PolyNewsParallel
- 数据集大小: 1K<n<10K
- 源数据集:
- mafand
- wmt-news
- globalvoices
- 标签:
- news
- polynews-parallel
- mafand
- globalvoices
- wmtnews
配置详情
- config_name: ces_Latn-tur_Latn
- 数据文件:
- split: train
- path: data/ces_Latn-tur_Latn/train.parquet.gzip
- 数据文件:
- config_name: mya_Mymr-rus_Cyrl
- 数据文件:
- split: train
- path: data/mya_Mymr-rus_Cyrl/train.parquet.gzip
- 数据文件:
- config_name: plt_Latn-nld_Latn
- 数据文件:
- split: train
- path: data/plt_Latn-nld_Latn/train.parquet.gzip
- 数据文件:
- config_name: hun_Latn-jpn_Jpan
- 数据文件:
- split: train
- path: data/hun_Latn-jpn_Jpan/train.parquet.gzip
- 数据文件:
- config_name: bul_Cyrl-swh_Latn
- 数据文件:
- split: train
- path: data/bul_Cyrl-swh_Latn/train.parquet.gzip
- 数据文件:
- config_name: amh_Ethi-deu_Latn
- 数据文件:
- split: train
- path: data/amh_Ethi-deu_Latn/train.parquet.gzip
- 数据文件:
- config_name: cat_Latn-ell_Grek
- 数据文件:
- split: train
- path: data/cat_Latn-ell_Grek/train.parquet.gzip
- 数据文件:
- config_name: cat_Latn-nld_Latn
- 数据文件:
- split: train
- path: data/cat_Latn-nld_Latn/train.parquet.gzip
- 数据文件:
- config_name: deu_Latn-eng_Latn
- 数据文件:
- split: train
- path: data/deu_Latn-eng_Latn/train.parquet.gzip
- 数据文件:
- config_name: ben_Beng-tet_Latn
- 数据文件:
- split: train
- path: data/ben_Beng-tet_Latn/train.parquet.gzip
- 数据文件:
- config_name: bul_Cyrl-srp_Latn
- 数据文件:
- split: train
- path: data/bul_Cyrl-srp_Latn/train.parquet.gzip
- 数据文件:
- config_name: arb_Arab-tur_Latn
- 数据文件:
- split: train
- path: data/arb_Arab-tur_Latn/train.parquet.gzip
- 数据文件:
- config_name: bul_Cyrl-ita_Latn
- 数据文件:
- split: train
- path: data/bul_Cyrl-ita_Latn/train.parquet.gzip
- 数据文件:
- config_name: ayr_Latn-plt_Latn
- 数据文件:
- split: train
- path: data/ayr_Latn-plt_Latn/train.parquet.gzip
- 数据文件:
- config_name: hin_Deva-ita_Latn
- 数据文件:
- split: train
- path: data/hin_Deva-ita_Latn/train.parquet.gzip
- 数据文件:
- config_name: cat_Latn-hun_Latn
- 数据文件:
- split: train
- path: data/cat_Latn-hun_Latn/train.parquet.gzip
- 数据文件:
- config_name: cat_Latn-npi_Deva
- 数据文件:
- split: train
- path: data/cat_Latn-npi_Deva/train.parquet.gzip
- 数据文件:
- config_name: ces_Latn-ind_Latn
- 数据文件:
- split: train
- path: data/ces_Latn-ind_Latn/train.parquet.gzip
- 数据文件:
- config_name: ces_Latn-nld_Latn
- 数据文件:
- split: train
- path: data/ces_Latn-nld_Latn/train.parquet.gzip
- 数据文件:
- config_name: arb_Arab-jpn_Jpan
- 数据文件:
- split: train
- path: data/arb_Arab-jpn_Jpan/train.parquet.gzip
- 数据文件:
- config_name: eng_Latn-ibo_Latn
- 数据文件:
- split: train
- path: data/eng_Latn-ibo_Latn/train.parquet.gzip
- 数据文件:
- config_name: ben_Beng-cat_Latn
- 数据文件:
- split: train
- path: data/ben_Beng-cat_Latn/train.parquet.gzip
- 数据文件:
- config_name: srp_Latn-tur_Latn
- 数据文件:
- split: train
- path: data/srp_Latn-tur_Latn/train.parquet.gzip
- 数据文件:
- config_name: ben_Beng-swh_Latn
- 数据文件:
- split: train
- path: data/ben_Beng-swh_Latn/train.parquet.gzip
- 数据文件:
- config_name: deu_Latn-ron_Latn
- 数据文件:
- split: train
- path: data/deu_Latn-ron_Latn/train.parquet.gzip
- 数据文件:
- config_name: heb_Hebr-ita_Latn
- 数据文件:
- split: train
- path: data/heb_Hebr-ita_Latn/train.parquet.gzip
- 数据文件:
- config_name: pes_Arab-srp_Latn
- 数据文件:
- split: train
- path: data/pes_Arab-srp_Latn/train.parquet.gzip
- 数据文件:
- config_name: eng_Latn-fin_Latn
- 数据文件:
- split: train
- path: data/eng_Latn-fin_Latn/train.parquet.gzip
- 数据文件:
- config_name: ben_Beng-heb_Hebr
- 数据文件:
- split: train
- path: data/ben_Beng-heb_Hebr/train.parquet.gzip
- 数据文件:
- config_name: bul_Cyrl-jpn_Jpan
- 数据文件:
- split: train
- path: data/bul_Cyrl-jpn_Jpan/train.parquet.gzip
- 数据文件:
- config_name: kor_Hang-zho_Hans
- 数据文件:
- split: train
- path: data/kor_Hang-zho_Hans/train.parquet.gzip
- 数据文件:
- config_name: nld_Latn-zho_Hant
- 数据文件:
- split: train
- path: data/nld_Latn-zho_Hant/train.parquet.gzip
- 数据文件:
- config_name: hun_Latn-ron_Latn
- 数据文件:
- split: train
- path: data/hun_Latn-ron_Latn/train.parquet.gzip
- 数据文件:
- config_name: npi_Deva-pol_Latn
- 数据文件:
- split: train
- path: data/npi_Deva-pol_Latn/train.parquet.gzip
- 数据文件:
- config_name: ayr_Latn-bul_Cyrl
- 数据文件:
- split: train
- path: data/ayr_Latn-bul_Cyrl/train.parquet.gzip
- 数据文件:
- config_name: ita_Latn-urd_Arab
- 数据文件:
- split: train
- path: data/ita_Latn-urd_Arab/train.parquet.gzip
- 数据文件:
- config_name: ayr_Latn-mkd_Cyrl
- 数据文件:
- split: train
- path: data/ayr_Latn-mkd_Cyrl/train.parquet.gzip
- 数据文件:
- config_name: ces_Latn-heb_Hebr
- 数据文件:
- split: train
- path: data/ces_Latn-heb_Hebr/train.parquet.gzip
- 数据文件:
- config_name: ayr_Latn-ron_Latn
- 数据文件:
- split: train
- path: data/ayr_Latn-ron_Latn/train.parquet.gzip
- 数据文件:
- config_name: mya_Mymr-sqi_Latn
- 数据文件:
- split: train
- path: data/mya_Mymr-sqi_Latn/train.parquet.gzip
- 数据文件:
- config_name: fil_Latn-urd_Arab
- 数据文件:
- split: train
- path: data/fil_Latn-urd_Arab/train.parquet.gzip
- 数据文件:
- config_name: sqi_Latn-srp_Latn
- 数据文件:
- split: train
- path: data/sqi_Latn-srp_Latn/train.parquet.gzip
- 数据文件:
- config_name: por_Latn-tur_Latn
- 数据文件:
- split: train
- path: data/por_Latn-tur_Latn/train.parquet.gzip
- 数据文件:
- config_name: plt_Latn-por_Latn
- 数据文件:
- split: train
- path: data/plt_Latn-por_Latn/train.parquet.gzip
- 数据文件:
- config_name: ben_Beng-tur_Latn
- 数据文件:
- split: train
- path: data/ben_Beng-tur_Latn/train.parquet.gzip
- 数据文件:
- config_name: khm_Khmr-zho_Hant
- 数据文件:
- split: train
- path: data/khm_Khmr-zho_Hant/train.parquet.gzip
- 数据文件:
- config_name: ory_Orya-urd_Arab
- 数据文件:
- split: train
- path: data/ory_Orya-urd_Arab/train.parquet.gzip
- 数据文件:
- config_name: ben_Beng-mkd_Cyrl
- 数据文件:
- split: train
- path: data/ben_Beng-mkd_Cyrl/train.parquet.gzip
- 数据文件:
- config_name: eng_Latn-lug_Latn
- 数据文件:
- split: train
- path: data/eng_Latn-lug_Latn/train.parquet.gzip
- 数据文件:
- config_name: hun_Latn-swh_Latn
- 数据文件:
- split: train
- path: data/hun_Latn-swh_Latn/train.parquet.gzip
- 数据文件:
- config_name: spa_Latn-ckb_Arab
- 数据文件:
- split: train
- path: data/spa_Latn-ckb_Arab/train.parquet.gzip
- 数据文件:
- config_name: por_Latn-srp_Latn
- 数据文件:
- split: train
- path: data/por_Latn-srp_Latn/train.parquet.gzip
- 数据文件:
- config_name: kor_Hang-nld_Latn
- 数据文件:
- split: train
- path: data/kor_Hang-nld_Latn/train.parquet.gzip
- 数据文件:
- config_name: amh_Ethi-zho_Hans
- 数据文件:
- split: train
- path: data/amh_Ethi-zho_Hans/train.parquet.gzip
- 数据文件:
- config_name: ron_Latn-swe_Latn
- 数据文件:
- split: train
- path: data/ron_Latn-swe_Latn/train.parquet.gzip
- 数据文件:
- config_name: dan_Latn-kor_Hang
- 数据文件:
- split: train
- path: data/dan_Latn-kor_Hang/train.parquet.gzip
- 数据文件:
- config_name: amh_Ethi-nld_Latn
- 数据文件:
- split: train
- path: data/amh_Ethi-nld_Latn/train.parquet.gzip
- 数据文件:
- config_name: ita_Latn-rus_Cyrl
- 数据文件:
- split: train
- path: data/ita_Latn-rus_Cyrl/train.parquet.gzip
- 数据文件:
- config_name: jpn_Jpan-ory_Orya
- 数据文件:
- split: train
- path: data/jpn_Jpan-ory_Orya/train.parquet.gzip
- 数据文件:
- config_name: ayr_Latn-ita_Latn
- 数据文件:
- split: train
- path: data/ayr_Latn-ita_Latn/train.parquet.gzip
- 数据文件:
- config_name: eng_Latn-pcm_Latn
- 数据文件:
- split: train
- path: data/eng_Latn-pcm_Latn/train.parquet.gzip
- 数据文件:
- config_name: ben_Beng-khm_Khmr
- 数据文件:
- split: train
- path: data/ben_Beng-khm_Khmr/train.parquet.gzip
- 数据文件:
- config_name: ita_Latn-ory_Orya
- 数据文件:
- split: train
- path: data/ita_Latn-ory_Orya/train.parquet.gzip
- 数据文件:
- config_name: hin_Deva-mya_Mymr
- 数据文件:
- split: train
- path: data/hin_Deva-mya_Mymr/train.parquet.gzip
- 数据文件:
- config_name: deu_Latn-khm_Khmr
- 数据文件:
- split: train
- path: data/deu_Latn-khm_Khmr/train.parquet.gzip
- 数据文件:
- config_name: nld_Latn-swe_Latn
- 数据文件:
- split: train
- path: data/nld_Latn-swe_Latn/train.parquet.gzip
- 数据文件:
- config_name: spa_Latn-sqi_Latn
- 数据文件:
- split: train
- path: data/spa_Latn-sqi_Latn/train.parquet.gzip
- 数据文件:
- **config_name: ita_



