five

mutiyama/alt

收藏
Hugging Face2024-01-09 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/mutiyama/alt
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - crowdsourced language: - bn - en - fil - hi - id - ja - km - lo - ms - my - th - vi - zh license: - cc-by-4.0 multilinguality: - multilingual - translation size_categories: - 100K<n<1M - 10K<n<100K source_datasets: - original task_categories: - translation - token-classification task_ids: - parsing paperswithcode_id: alt pretty_name: Asian Language Treebank config_names: - alt-en - alt-jp - alt-km - alt-my - alt-my-transliteration - alt-my-west-transliteration - alt-parallel dataset_info: - config_name: alt-en features: - name: SNT.URLID dtype: string - name: SNT.URLID.SNTID dtype: string - name: url dtype: string - name: status dtype: string - name: value dtype: string splits: - name: train num_bytes: 10075569 num_examples: 17889 - name: validation num_bytes: 544719 num_examples: 988 - name: test num_bytes: 567272 num_examples: 1017 download_size: 3781814 dataset_size: 11187560 - config_name: alt-jp features: - name: SNT.URLID dtype: string - name: SNT.URLID.SNTID dtype: string - name: url dtype: string - name: status dtype: string - name: value dtype: string - name: word_alignment dtype: string - name: jp_tokenized dtype: string - name: en_tokenized dtype: string splits: - name: train num_bytes: 21888277 num_examples: 17202 - name: validation num_bytes: 1181555 num_examples: 953 - name: test num_bytes: 1175592 num_examples: 931 download_size: 10355366 dataset_size: 24245424 - config_name: alt-km features: - name: SNT.URLID dtype: string - name: SNT.URLID.SNTID dtype: string - name: url dtype: string - name: km_pos_tag dtype: string - name: km_tokenized dtype: string splits: - name: train num_bytes: 12015371 num_examples: 18088 - name: validation num_bytes: 655212 num_examples: 1000 - name: test num_bytes: 673733 num_examples: 1018 download_size: 4344096 dataset_size: 13344316 - config_name: alt-my features: - name: SNT.URLID dtype: string - name: SNT.URLID.SNTID dtype: string - name: url dtype: string - name: value dtype: string splits: - name: train num_bytes: 20433243 num_examples: 18088 - name: validation num_bytes: 1111394 num_examples: 1000 - name: test num_bytes: 1135193 num_examples: 1018 download_size: 6569025 dataset_size: 22679830 - config_name: alt-my-transliteration features: - name: en dtype: string - name: my sequence: string splits: - name: train num_bytes: 4249316 num_examples: 84022 download_size: 2163951 dataset_size: 4249316 - config_name: alt-my-west-transliteration features: - name: en dtype: string - name: my sequence: string splits: - name: train num_bytes: 7411911 num_examples: 107121 download_size: 2857511 dataset_size: 7411911 - config_name: alt-parallel features: - name: SNT.URLID dtype: string - name: SNT.URLID.SNTID dtype: string - name: url dtype: string - name: translation dtype: translation: languages: - bg - en - en_tok - fil - hi - id - ja - khm - lo - ms - my - th - vi - zh splits: - name: train num_bytes: 68445916 num_examples: 18088 - name: validation num_bytes: 3710979 num_examples: 1000 - name: test num_bytes: 3814431 num_examples: 1019 download_size: 34707907 dataset_size: 75971326 configs: - config_name: alt-en data_files: - split: train path: alt-en/train-* - split: validation path: alt-en/validation-* - split: test path: alt-en/test-* - config_name: alt-jp data_files: - split: train path: alt-jp/train-* - split: validation path: alt-jp/validation-* - split: test path: alt-jp/test-* - config_name: alt-km data_files: - split: train path: alt-km/train-* - split: validation path: alt-km/validation-* - split: test path: alt-km/test-* - config_name: alt-my data_files: - split: train path: alt-my/train-* - split: validation path: alt-my/validation-* - split: test path: alt-my/test-* - config_name: alt-my-transliteration data_files: - split: train path: alt-my-transliteration/train-* - config_name: alt-my-west-transliteration data_files: - split: train path: alt-my-west-transliteration/train-* - config_name: alt-parallel data_files: - split: train path: alt-parallel/train-* - split: validation path: alt-parallel/validation-* - split: test path: alt-parallel/test-* default: true --- # Dataset Card for Asian Language Treebank (ALT) ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ - **Leaderboard:** - **Paper:** [Introduction of the Asian Language Treebank](https://ieeexplore.ieee.org/abstract/document/7918974) - **Point of Contact:** [ALT info](alt-info@khn.nict.go.jp) ### Dataset Summary The ALT project aims to advance the state-of-the-art Asian natural language processing (NLP) techniques through the open collaboration for developing and using ALT. It was first conducted by NICT and UCSY as described in Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch and Eiichiro Sumita (2016). Then, it was developed under [ASEAN IVO](https://www.nict.go.jp/en/asean_ivo/index.html) as described in this Web page. The process of building ALT began with sampling about 20,000 sentences from English Wikinews, and then these sentences were translated into the other languages. ### Supported Tasks and Leaderboards Machine Translation, Dependency Parsing ### Languages It supports 13 language: * Bengali * English * Filipino * Hindi * Bahasa Indonesia * Japanese * Khmer * Lao * Malay * Myanmar (Burmese) * Thai * Vietnamese * Chinese (Simplified Chinese). ## Dataset Structure ### Data Instances #### ALT Parallel Corpus ``` { "SNT.URLID": "80188", "SNT.URLID.SNTID": "1", "url": "http://en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal", "bg": "[translated sentence]", "en": "[translated sentence]", "en_tok": "[translated sentence]", "fil": "[translated sentence]", "hi": "[translated sentence]", "id": "[translated sentence]", "ja": "[translated sentence]", "khm": "[translated sentence]", "lo": "[translated sentence]", "ms": "[translated sentence]", "my": "[translated sentence]", "th": "[translated sentence]", "vi": "[translated sentence]", "zh": "[translated sentence]" } ``` #### ALT Treebank ``` { "SNT.URLID": "80188", "SNT.URLID.SNTID": "1", "url": "http://en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal", "status": "draft/reviewed", "value": "(S (S (BASENP (NNP Italy)) (VP (VBP have) (VP (VP (VP (VBN defeated) (BASENP (NNP Portugal))) (ADVP (RB 31-5))) (PP (IN in) (NP (BASENP (NNP Pool) (NNP C)) (PP (IN of) (NP (BASENP (DT the) (NN 2007) (NNP Rugby) (NNP World) (NNP Cup)) (PP (IN at) (NP (BASENP (NNP Parc) (FW des) (NNP Princes)) (COMMA ,) (BASENP (NNP Paris) (COMMA ,) (NNP France))))))))))) (PERIOD .))" } ``` #### ALT Myanmar transliteration ``` { "en": "CASINO", "my": [ "ကက်စီနို", "ကစီနို", "ကာစီနို", "ကာဆီနို" ] } ``` ### Data Fields #### ALT Parallel Corpus - SNT.URLID: URL link to the source article listed in [URL.txt](https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ALT-Parallel-Corpus-20191206/URL.txt) - SNT.URLID.SNTID: index number from 1 to 20000. It is a seletected sentence from `SNT.URLID` and bg, en, fil, hi, id, ja, khm, lo, ms, my, th, vi, zh correspond to the target language #### ALT Treebank - status: it indicates how a sentence is annotated; `draft` sentences are annotated by one annotater and `reviewed` sentences are annotated by two annotater The annotatation is different from language to language, please see [their guildlines](https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/) for more detail. ### Data Splits | | train | valid | test | |-----------|-------|-------|-------| | # articles | 1698 | 98 | 97 | | # sentences | 18088 | 1000 | 1018 | ## Dataset Creation ### Curation Rationale The ALT project was initiated by the [National Institute of Information and Communications Technology, Japan](https://www.nict.go.jp/en/) (NICT) in 2014. NICT started to build Japanese and English ALT and worked with the University of Computer Studies, Yangon, Myanmar (UCSY) to build Myanmar ALT in 2014. Then, the Badan Pengkajian dan Penerapan Teknologi, Indonesia (BPPT), the Institute for Infocomm Research, Singapore (I2R), the Institute of Information Technology, Vietnam (IOIT), and the National Institute of Posts, Telecoms and ICT, Cambodia (NIPTICT) joined to make ALT for Indonesian, Malay, Vietnamese, and Khmer in 2015. ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? The dataset is sampled from the English Wikinews in 2014. These will be annotated with word segmentation, POS tags, and syntax information, in addition to the word alignment information by linguistic experts from * National Institute of Information and Communications Technology, Japan (NICT) for Japanses and English * University of Computer Studies, Yangon, Myanmar (UCSY) for Myanmar * the Badan Pengkajian dan Penerapan Teknologi, Indonesia (BPPT) for Indonesian * the Institute for Infocomm Research, Singapore (I2R) for Malay * the Institute of Information Technology, Vietnam (IOIT) for Vietnamese * the National Institute of Posts, Telecoms and ICT, Cambodia for Khmer ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators * National Institute of Information and Communications Technology, Japan (NICT) for Japanses and English * University of Computer Studies, Yangon, Myanmar (UCSY) for Myanmar * the Badan Pengkajian dan Penerapan Teknologi, Indonesia (BPPT) for Indonesian * the Institute for Infocomm Research, Singapore (I2R) for Malay * the Institute of Information Technology, Vietnam (IOIT) for Vietnamese * the National Institute of Posts, Telecoms and ICT, Cambodia for Khmer ### Licensing Information [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) ### Citation Information Please cite the following if you make use of the dataset: Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, Chenchen Ding. (2016) "Introduction of the Asian Language Treebank" Oriental COCOSDA. BibTeX: ``` @inproceedings{riza2016introduction, title={Introduction of the asian language treebank}, author={Riza, Hammam and Purwoadi, Michael and Uliniansyah, Teduh and Ti, Aw Ai and Aljunied, Sharifah Mahani and Mai, Luong Chi and Thang, Vu Tat and Thai, Nguyen Phuong and Chea, Vichet and Sam, Sethserey and others}, booktitle={2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA)}, pages={1--6}, year={2016}, organization={IEEE} } ``` ### Contributions Thanks to [@chameleonTK](https://github.com/chameleonTK) for adding this dataset.
提供机构:
mutiyama
原始信息汇总

数据集概述

数据集名称

  • Asian Language Treebank (ALT)

语言支持

  • 13种语言:Bengali, English, Filipino, Hindi, Bahasa Indonesia, Japanese, Khmer, Lao, Malay, Myanmar (Burmese), Thai, Vietnamese, Chinese (Simplified Chinese)

许可协议

  • Creative Commons Attribution 4.0 International (CC BY 4.0)

多语言性

  • 多语言
  • 翻译

任务类别

  • 翻译
  • token-classification

数据集大小

  • 100K<n<1M
  • 10K<n<100K

数据集创建者

  • National Institute of Information and Communications Technology, Japan (NICT)
  • University of Computer Studies, Yangon, Myanmar (UCSY)
  • Badan Pengkajian dan Penerapan Teknologi, Indonesia (BPPT)
  • Institute for Infocomm Research, Singapore (I2R)
  • Institute of Information Technology, Vietnam (IOIT)
  • National Institute of Posts, Telecoms and ICT, Cambodia

数据集配置

  • config_name: alt-en, alt-jp, alt-km, alt-my, alt-my-transliteration, alt-my-west-transliteration, alt-parallel

数据集特征

  • alt-en
    • features: SNT.URLID, SNT.URLID.SNTID, url, status, value
    • splits: train, validation, test
  • alt-jp
    • features: SNT.URLID, SNT.URLID.SNTID, url, status, value, word_alignment, jp_tokenized, en_tokenized
    • splits: train, validation, test
  • alt-km
    • features: SNT.URLID, SNT.URLID.SNTID, url, km_pos_tag, km_tokenized
    • splits: train, validation, test
  • alt-my
    • features: SNT.URLID, SNT.URLID.SNTID, url, value
    • splits: train, validation, test
  • alt-my-transliteration
    • features: en, my
    • splits: train
  • alt-my-west-transliteration
    • features: en, my
    • splits: train
  • alt-parallel
    • features: SNT.URLID, SNT.URLID.SNTID, url, translation
    • splits: train, validation, test

数据集下载和大小

  • alt-en
    • download_size: 3781814
    • dataset_size: 11187560
  • alt-jp
    • download_size: 10355366
    • dataset_size: 24245424
  • alt-km
    • download_size: 4344096
    • dataset_size: 13344316
  • alt-my
    • download_size: 6569025
    • dataset_size: 22679830
  • alt-my-transliteration
    • download_size: 2163951
    • dataset_size: 4249316
  • alt-my-west-transliteration
    • download_size: 2857511
    • dataset_size: 7411911
  • alt-parallel
    • download_size: 34707907
    • dataset_size: 75971326

引用信息

@inproceedings{riza2016introduction, title={Introduction of the asian language treebank}, author={Riza, Hammam and Purwoadi, Michael and Uliniansyah, Teduh and Ti, Aw Ai and Aljunied, Sharifah Mahani and Mai, Luong Chi and Thang, Vu Tat and Thai, Nguyen Phuong and Chea, Vichet and Sam, Sethserey and others}, booktitle={2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA)}, pages={1--6}, year={2016}, organization={IEEE} }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Asian Language Treebank (ALT)是一个多语言数据集,包含从英文Wikinews采样的约20,000个句子,并翻译成13种亚洲语言,支持机器翻译和依存句法分析任务。数据集由多个亚洲国家的机构合作创建,采用CC BY 4.0许可。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作