five

aiana94/polynews-parallel

收藏
Hugging Face2024-06-21 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/aiana94/polynews-parallel
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个多语言的新闻平行语料库,支持多种语言的翻译和文本检索任务。数据集来源于mafand、wmt-news和globalvoices,涵盖了从1K到10K不等的规模。数据集的主要用途是用于机器翻译和文本检索,特别是新闻领域的多语言处理。

This dataset is a multilingual parallel news corpus, supporting translation and text retrieval tasks across multiple languages. The dataset is sourced from mafand, wmt-news, and globalvoices, with a size ranging from 1K to 10K. The primary use of the dataset is for machine translation and text retrieval, particularly in the context of multilingual news processing.
提供机构:
aiana94
原始信息汇总

数据集概述

基本信息

  • 许可证: cc-by-nc-4.0
  • 任务类别:
    • 翻译
    • 文本检索
  • 语言:
    • am, ar, ay, bm, bbj, bn, bg, ca, cs, ku, da, de, el, en, et, ee, fil, fi, fr, fon, gu, ha, he, hi, hu, ig, id, it, ja, kk, km, ko, lv, lt, lg, luo, mk, mos, my, nl, ne, or, pa, pcm, fa, pl, pt, mg, ro, ru, es, sr, sq, sw, sv, tet, tn, tr, tw, ur, wo, yo, zh, zu
  • 多语言性:
    • 翻译
    • 多语言
  • 数据集名称: PolyNewsParallel
  • 数据集大小: 1K<n<10K
  • 源数据集:
    • mafand
    • wmt-news
    • globalvoices
  • 标签:
    • news
    • polynews-parallel
    • mafand
    • globalvoices
    • wmtnews

配置详情

  • config_name: ces_Latn-tur_Latn
    • 数据文件:
      • split: train
      • path: data/ces_Latn-tur_Latn/train.parquet.gzip
  • config_name: mya_Mymr-rus_Cyrl
    • 数据文件:
      • split: train
      • path: data/mya_Mymr-rus_Cyrl/train.parquet.gzip
  • config_name: plt_Latn-nld_Latn
    • 数据文件:
      • split: train
      • path: data/plt_Latn-nld_Latn/train.parquet.gzip
  • config_name: hun_Latn-jpn_Jpan
    • 数据文件:
      • split: train
      • path: data/hun_Latn-jpn_Jpan/train.parquet.gzip
  • config_name: bul_Cyrl-swh_Latn
    • 数据文件:
      • split: train
      • path: data/bul_Cyrl-swh_Latn/train.parquet.gzip
  • config_name: amh_Ethi-deu_Latn
    • 数据文件:
      • split: train
      • path: data/amh_Ethi-deu_Latn/train.parquet.gzip
  • config_name: cat_Latn-ell_Grek
    • 数据文件:
      • split: train
      • path: data/cat_Latn-ell_Grek/train.parquet.gzip
  • config_name: cat_Latn-nld_Latn
    • 数据文件:
      • split: train
      • path: data/cat_Latn-nld_Latn/train.parquet.gzip
  • config_name: deu_Latn-eng_Latn
    • 数据文件:
      • split: train
      • path: data/deu_Latn-eng_Latn/train.parquet.gzip
  • config_name: ben_Beng-tet_Latn
    • 数据文件:
      • split: train
      • path: data/ben_Beng-tet_Latn/train.parquet.gzip
  • config_name: bul_Cyrl-srp_Latn
    • 数据文件:
      • split: train
      • path: data/bul_Cyrl-srp_Latn/train.parquet.gzip
  • config_name: arb_Arab-tur_Latn
    • 数据文件:
      • split: train
      • path: data/arb_Arab-tur_Latn/train.parquet.gzip
  • config_name: bul_Cyrl-ita_Latn
    • 数据文件:
      • split: train
      • path: data/bul_Cyrl-ita_Latn/train.parquet.gzip
  • config_name: ayr_Latn-plt_Latn
    • 数据文件:
      • split: train
      • path: data/ayr_Latn-plt_Latn/train.parquet.gzip
  • config_name: hin_Deva-ita_Latn
    • 数据文件:
      • split: train
      • path: data/hin_Deva-ita_Latn/train.parquet.gzip
  • config_name: cat_Latn-hun_Latn
    • 数据文件:
      • split: train
      • path: data/cat_Latn-hun_Latn/train.parquet.gzip
  • config_name: cat_Latn-npi_Deva
    • 数据文件:
      • split: train
      • path: data/cat_Latn-npi_Deva/train.parquet.gzip
  • config_name: ces_Latn-ind_Latn
    • 数据文件:
      • split: train
      • path: data/ces_Latn-ind_Latn/train.parquet.gzip
  • config_name: ces_Latn-nld_Latn
    • 数据文件:
      • split: train
      • path: data/ces_Latn-nld_Latn/train.parquet.gzip
  • config_name: arb_Arab-jpn_Jpan
    • 数据文件:
      • split: train
      • path: data/arb_Arab-jpn_Jpan/train.parquet.gzip
  • config_name: eng_Latn-ibo_Latn
    • 数据文件:
      • split: train
      • path: data/eng_Latn-ibo_Latn/train.parquet.gzip
  • config_name: ben_Beng-cat_Latn
    • 数据文件:
      • split: train
      • path: data/ben_Beng-cat_Latn/train.parquet.gzip
  • config_name: srp_Latn-tur_Latn
    • 数据文件:
      • split: train
      • path: data/srp_Latn-tur_Latn/train.parquet.gzip
  • config_name: ben_Beng-swh_Latn
    • 数据文件:
      • split: train
      • path: data/ben_Beng-swh_Latn/train.parquet.gzip
  • config_name: deu_Latn-ron_Latn
    • 数据文件:
      • split: train
      • path: data/deu_Latn-ron_Latn/train.parquet.gzip
  • config_name: heb_Hebr-ita_Latn
    • 数据文件:
      • split: train
      • path: data/heb_Hebr-ita_Latn/train.parquet.gzip
  • config_name: pes_Arab-srp_Latn
    • 数据文件:
      • split: train
      • path: data/pes_Arab-srp_Latn/train.parquet.gzip
  • config_name: eng_Latn-fin_Latn
    • 数据文件:
      • split: train
      • path: data/eng_Latn-fin_Latn/train.parquet.gzip
  • config_name: ben_Beng-heb_Hebr
    • 数据文件:
      • split: train
      • path: data/ben_Beng-heb_Hebr/train.parquet.gzip
  • config_name: bul_Cyrl-jpn_Jpan
    • 数据文件:
      • split: train
      • path: data/bul_Cyrl-jpn_Jpan/train.parquet.gzip
  • config_name: kor_Hang-zho_Hans
    • 数据文件:
      • split: train
      • path: data/kor_Hang-zho_Hans/train.parquet.gzip
  • config_name: nld_Latn-zho_Hant
    • 数据文件:
      • split: train
      • path: data/nld_Latn-zho_Hant/train.parquet.gzip
  • config_name: hun_Latn-ron_Latn
    • 数据文件:
      • split: train
      • path: data/hun_Latn-ron_Latn/train.parquet.gzip
  • config_name: npi_Deva-pol_Latn
    • 数据文件:
      • split: train
      • path: data/npi_Deva-pol_Latn/train.parquet.gzip
  • config_name: ayr_Latn-bul_Cyrl
    • 数据文件:
      • split: train
      • path: data/ayr_Latn-bul_Cyrl/train.parquet.gzip
  • config_name: ita_Latn-urd_Arab
    • 数据文件:
      • split: train
      • path: data/ita_Latn-urd_Arab/train.parquet.gzip
  • config_name: ayr_Latn-mkd_Cyrl
    • 数据文件:
      • split: train
      • path: data/ayr_Latn-mkd_Cyrl/train.parquet.gzip
  • config_name: ces_Latn-heb_Hebr
    • 数据文件:
      • split: train
      • path: data/ces_Latn-heb_Hebr/train.parquet.gzip
  • config_name: ayr_Latn-ron_Latn
    • 数据文件:
      • split: train
      • path: data/ayr_Latn-ron_Latn/train.parquet.gzip
  • config_name: mya_Mymr-sqi_Latn
    • 数据文件:
      • split: train
      • path: data/mya_Mymr-sqi_Latn/train.parquet.gzip
  • config_name: fil_Latn-urd_Arab
    • 数据文件:
      • split: train
      • path: data/fil_Latn-urd_Arab/train.parquet.gzip
  • config_name: sqi_Latn-srp_Latn
    • 数据文件:
      • split: train
      • path: data/sqi_Latn-srp_Latn/train.parquet.gzip
  • config_name: por_Latn-tur_Latn
    • 数据文件:
      • split: train
      • path: data/por_Latn-tur_Latn/train.parquet.gzip
  • config_name: plt_Latn-por_Latn
    • 数据文件:
      • split: train
      • path: data/plt_Latn-por_Latn/train.parquet.gzip
  • config_name: ben_Beng-tur_Latn
    • 数据文件:
      • split: train
      • path: data/ben_Beng-tur_Latn/train.parquet.gzip
  • config_name: khm_Khmr-zho_Hant
    • 数据文件:
      • split: train
      • path: data/khm_Khmr-zho_Hant/train.parquet.gzip
  • config_name: ory_Orya-urd_Arab
    • 数据文件:
      • split: train
      • path: data/ory_Orya-urd_Arab/train.parquet.gzip
  • config_name: ben_Beng-mkd_Cyrl
    • 数据文件:
      • split: train
      • path: data/ben_Beng-mkd_Cyrl/train.parquet.gzip
  • config_name: eng_Latn-lug_Latn
    • 数据文件:
      • split: train
      • path: data/eng_Latn-lug_Latn/train.parquet.gzip
  • config_name: hun_Latn-swh_Latn
    • 数据文件:
      • split: train
      • path: data/hun_Latn-swh_Latn/train.parquet.gzip
  • config_name: spa_Latn-ckb_Arab
    • 数据文件:
      • split: train
      • path: data/spa_Latn-ckb_Arab/train.parquet.gzip
  • config_name: por_Latn-srp_Latn
    • 数据文件:
      • split: train
      • path: data/por_Latn-srp_Latn/train.parquet.gzip
  • config_name: kor_Hang-nld_Latn
    • 数据文件:
      • split: train
      • path: data/kor_Hang-nld_Latn/train.parquet.gzip
  • config_name: amh_Ethi-zho_Hans
    • 数据文件:
      • split: train
      • path: data/amh_Ethi-zho_Hans/train.parquet.gzip
  • config_name: ron_Latn-swe_Latn
    • 数据文件:
      • split: train
      • path: data/ron_Latn-swe_Latn/train.parquet.gzip
  • config_name: dan_Latn-kor_Hang
    • 数据文件:
      • split: train
      • path: data/dan_Latn-kor_Hang/train.parquet.gzip
  • config_name: amh_Ethi-nld_Latn
    • 数据文件:
      • split: train
      • path: data/amh_Ethi-nld_Latn/train.parquet.gzip
  • config_name: ita_Latn-rus_Cyrl
    • 数据文件:
      • split: train
      • path: data/ita_Latn-rus_Cyrl/train.parquet.gzip
  • config_name: jpn_Jpan-ory_Orya
    • 数据文件:
      • split: train
      • path: data/jpn_Jpan-ory_Orya/train.parquet.gzip
  • config_name: ayr_Latn-ita_Latn
    • 数据文件:
      • split: train
      • path: data/ayr_Latn-ita_Latn/train.parquet.gzip
  • config_name: eng_Latn-pcm_Latn
    • 数据文件:
      • split: train
      • path: data/eng_Latn-pcm_Latn/train.parquet.gzip
  • config_name: ben_Beng-khm_Khmr
    • 数据文件:
      • split: train
      • path: data/ben_Beng-khm_Khmr/train.parquet.gzip
  • config_name: ita_Latn-ory_Orya
    • 数据文件:
      • split: train
      • path: data/ita_Latn-ory_Orya/train.parquet.gzip
  • config_name: hin_Deva-mya_Mymr
    • 数据文件:
      • split: train
      • path: data/hin_Deva-mya_Mymr/train.parquet.gzip
  • config_name: deu_Latn-khm_Khmr
    • 数据文件:
      • split: train
      • path: data/deu_Latn-khm_Khmr/train.parquet.gzip
  • config_name: nld_Latn-swe_Latn
    • 数据文件:
      • split: train
      • path: data/nld_Latn-swe_Latn/train.parquet.gzip
  • config_name: spa_Latn-sqi_Latn
    • 数据文件:
      • split: train
      • path: data/spa_Latn-sqi_Latn/train.parquet.gzip
  • **config_name: ita_
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作