ai4bharat/Aksharantar

Name: ai4bharat/Aksharantar
Creator: ai4bharat
Published: 2023-08-31 07:05:34
License: 暂无描述

Hugging Face2023-08-31 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ai4bharat/Aksharantar

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language_creators: - crowdsourced - expert-generated - machine-generated - found - other language: - asm - ben - brx - doi - guj - hin - kan - kas - kok - mai - mal - mar - mni - nep - ori - pan - san - sid - tam - tel - urd license: cc multilinguality: - multilingual pretty_name: Aksharantar source_datasets: - original task_categories: - text-generation task_ids: [] --- # Dataset Card for Aksharantar ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://indicnlp.ai4bharat.org/indic-xlit/ - **Repository:** https://github.com/AI4Bharat/IndicXlit/ - **Paper:** [Aksharantar: Towards building open transliteration tools for the next billion users](https://arxiv.org/abs/2205.03018) - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Aksharantar is the largest publicly available transliteration dataset for 20 Indic languages. The corpus has 26M Indic language-English transliteration pairs. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages |  |  |  |  |  |  | | -------------- | -------------- | -------------- | --------------- | -------------- | ------------- | | Assamese (asm) | Hindi (hin) | Maithili (mai) | Marathi (mar) | Punjabi (pan) | Tamil (tam) | | Bengali (ben) | Kannada (kan) | Malayalam (mal)| Nepali (nep) | Sanskrit (san) | Telugu (tel) | | Bodo(brx) | Kashmiri (kas) | Manipuri (mni) | Oriya (ori) | Sindhi (snd) | Urdu (urd) | | Gujarati (guj) | Konkani (kok) | Dogri (doi) | ## Dataset Structure ### Data Instances ``` A random sample from Hindi (hin) Train dataset. { 'unique_identifier': 'hin1241393', 'native word': 'स्वाभिमानिक', 'english word': 'swabhimanik', 'source': 'IndicCorp', 'score': -0.1028788579 } ``` ### Data Fields - `unique_identifier` (string): 3-letter language code followed by a unique number in each set (Train, Test, Val). - `native word` (string): A word in Indic language. - `english word` (string): Transliteration of native word in English (Romanised word). - `source` (string): Source of the data. - `score` (num): Character level log probability of indic word given roman word by IndicXlit (model). Pairs with average threshold of the 0.35 are considered. For created data sources, depending on the destination/sampling method of a pair in a language, it will be one of: - Dakshina Dataset - IndicCorp - Samanantar - Wikidata - Existing sources - Named Entities Indian (AK-NEI) - Named Entities Foreign (AK-NEF) - Data from Uniform Sampling method. (Ak-Uni) - Data from Most Frequent words sampling method. (Ak-Freq) ### Data Splits | Subset | asm-en | ben-en | brx-en | guj-en | hin-en | kan-en | kas-en | kok-en | mai-en | mal-en | mni-en | mar-en | nep-en | ori-en | pan-en | san-en | sid-en | tam-en | tel-en | urd-en | |:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:| | Training | 179K | 1231K | 36K | 1143K | 1299K | 2907K | 47K | 613K | 283K | 4101K | 10K | 1453K | 2397K | 346K | 515K | 1813K | 60K | 3231K | 2430K | 699K | | Validation | 4K | 11K | 3K | 12K | 6K | 7K | 4K | 4K | 4K | 8K | 3K | 8K | 3K | 3K | 9K | 3K | 8K | 9K | 8K | 12K | | Test | 5531 | 5009 | 4136 | 7768 | 5693 | 6396 | 7707 | 5093 | 5512 | 6911 | 4925 | 6573 | 4133 | 4256 | 4316 | 5334 | - | 4682 | 4567 | 4463 | ## Dataset Creation Information in the paper. [Aksharantar: Towards building open transliteration tools for the next billion users](https://arxiv.org/abs/2205.03018) ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Information in the paper. [Aksharantar: Towards building open transliteration tools for the next billion users](https://arxiv.org/abs/2205.03018) #### Who are the source language producers? [More Information Needed] ### Annotations Information in the paper. [Aksharantar: Towards building open transliteration tools for the next billion users](https://arxiv.org/abs/2205.03018) #### Annotation process Information in the paper. [Aksharantar: Towards building open transliteration tools for the next billion users](https://arxiv.org/abs/2205.03018) #### Who are the annotators? Information in the paper. [Aksharantar: Towards building open transliteration tools for the next billion users](https://arxiv.org/abs/2205.03018) ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information  This data is released under the following licensing scheme: - Manually collected data: Released under CC-BY license. - Mined dataset (from Samanantar and IndicCorp): Released under CC0 license. - Existing sources: Released under CC0 license. **CC-BY License** <a rel="license" float="left" href="https://creativecommons.org/about/cclicenses/"> <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png" style="border-style: none;" alt="CC-BY" width="100"/> </a>  **CC0 License Statement** <a rel="license" float="left" href="https://creativecommons.org/about/cclicenses/"> <img src="https://licensebuttons.net/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0" width="100"/> </a> - We do not own any of the text from which this data has been extracted. - We license the actual packaging of the mined data under the [Creative Commons CC0 license (“no rights reserved”)](http://creativecommons.org/publicdomain/zero/1.0). - To the extent possible under law, <a rel="dct:publisher" href="https://indicnlp.ai4bharat.org/aksharantar/"> AI4Bharat</a> has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources. - This work is published from: India. ### Citation Information ``` @misc{madhani2022aksharantar, title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users}, author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, year={2022}, eprint={}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions

提供机构：

ai4bharat

原始信息汇总

数据集概述

数据集名称

名称: Aksharantar
描述: 最大的公开可用转写数据集，包含20种印度语言与英语的转写对，共计2600万对。

语言

支持语言: 20种印度语言，包括Assamese (asm), Bengali (ben), Bodo (brx), Gujarati (guj), Hindi (hin), Kannada (kan), Kashmiri (kas), Konkani (kok), Maithili (mai), Malayalam (mal), Manipuri (mni), Marathi (mar), Nepali (nep), Oriya (ori), Punjabi (pan), Sanskrit (san), Sindhi (snd), Tamil (tam), Telugu (tel), Urdu (urd)。

数据集结构

数据实例: 每个实例包含唯一标识符、本地语言词、英语词、数据来源和分数。
数据字段:
- unique_identifier: 语言代码+唯一数字。
- native word: 印度语言词汇。
- english word: 英语转写。
- source: 数据来源。
- score: 字符级别的对数概率。

数据分割

训练/验证/测试集: 针对每种语言，详细列出了训练、验证和测试集的实例数量。

许可证

许可证: 数据集根据不同类型分为CC-BY和CC0两种许可证。
- 手动收集数据: CC-BY。
- 挖掘数据集: CC0。
- 现有来源: CC0。

引用信息

引用格式:

@misc{madhani2022aksharantar, title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users}, author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, year={2022}, eprint={}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集