five

DeskDown/ALTDataset_en-to-fil-vi-id-ms-ja-khm

收藏
Hugging Face2022-01-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/DeskDown/ALTDataset_en-to-fil-vi-id-ms-ja-khm
下载链接
链接失效反馈
官方服务:
资源简介:
__Introduction__ The ALT project aims to advance the state-of-the-art Asian natural language processing (NLP) techniques through the open collaboration for developing and using ALT. It was first conducted by NICT and UCSY as described in Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch and Eiichiro Sumita (2016). Then, it was developed under ASEAN IVO as described in this Web page. The process of building ALT began with sampling about 20,000 sentences from English Wikinews, and then these sentences were translated into the other languages. ALT now has 13 languages: Bengali, English, Filipino, Hindi, Bahasa Indonesia, Japanese, Khmer, Lao, Malay, Myanmar (Burmese), Thai, Vietnamese, Chinese (Simplified Chinese). In this dataset you can find parallel corpus of fil, vi, id, ms, ja, khm languages. Dataset is tokenized using mbart50-like tokenizer. (To be added soon) Tokens are padded\truncated at a size of 128.
提供机构:
DeskDown
原始信息汇总

数据集概述

数据集介绍

ALT项目旨在通过开放协作,推动亚洲自然语言处理(NLP)技术的先进水平。该项目最初由NICT和UCSY进行,并在Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch和Eiichiro Sumita(2016)的描述中首次实施。随后,该项目在ASEAN IVO的框架下得到进一步发展。

数据集构建

数据集的构建始于从英语维基新闻中抽取约20,000个句子,然后将这些句子翻译成其他语言。目前,ALT项目涵盖13种语言:孟加拉语、英语、菲律宾语、印地语、印度尼西亚语、日语、高棉语、老挝语、马来语、缅甸语、泰语、越南语、简体中文。

数据集内容

本数据集中包含以下语言的平行语料库:菲律宾语(fil)、越南语(vi)、印度尼西亚语(id)、马来语(ms)、日语(ja)、高棉语(khm)。数据集使用类似mbart50的tokenizer进行分词处理(即将添加),并且token长度被填充或截断至128。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作