five

SEACrowd/tatoeba

收藏
Hugging Face2024-06-24 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/SEACrowd/tatoeba
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是Tatoeba语料库的一个子集,包含印度尼西亚语、越南语、他加禄语、爪哇语和泰语与英语的翻译对。数据集从2018年11月17日的Tatoeba语料库中提取,每种语言选择了1000个英语句子及其翻译(如果可用)。需要注意的是,英语句子在不同语言对之间并不完全相同,因此结果不能直接跨语言比较。数据集的用途主要是机器翻译任务。

This dataset is a subset of the Tatoeba corpus containing language pairs for Indonesian, Vietnamese, Tagalog, Javanese, and Thai. The data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17. For each language, 1000 English sentences and their translations, if available, were selected. Please note that the English sentences are not identical for all language pairs, meaning the results are not directly comparable across languages. The dataset is primarily used for machine translation tasks.
提供机构:
SEACrowd
原始信息汇总

Tatoeba 数据集概述

基本信息

  • 许可证: Apache 2.0
  • 语言:
    • 印尼语 (ind)
    • 越南语 (vie)
    • 他加禄语 (tgl)
    • 爪哇语 (jav)
    • 泰语 (tha)
    • 英语 (eng)
  • 任务类别: 机器翻译
  • 标签: 机器翻译

数据集描述

  • 该数据集是 Tatoeba 语料库的一个子集,包含印尼语、越南语、他加禄语、爪哇语和泰语的语言对。
  • 原始数据集提取自 Tatoeba 语料库,日期为 2018/11/17。
  • 每种语言选择了 1000 句英语及其翻译(如果可用)。
  • 英语句子在不同语言对中并不完全相同,因此结果在不同语言间不可直接比较。
  • 低资源语言的句子多样性较少。

支持的任务

  • 机器翻译

数据集版本

  • 源版本: 1.0.0
  • SEACrowd 版本: 2024.06.20

许可证

  • Apache 许可证 2.0 (apache-2.0)

引用

  • 使用 Tatoeba 数据集时,请引用以下文献:

    @article{tatoeba, title = {Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond}, author = {Mikel, Artetxe and Holger, Schwenk,}, journal = {arXiv:1812.10464v2}, year = {2018} }

    @article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作