ai4bharat/Bhasha-Abhijnaanam

Name: ai4bharat/Bhasha-Abhijnaanam
Creator: ai4bharat
Published: 2023-06-22 08:01:44
License: 暂无描述

Hugging Face2023-06-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ai4bharat/Bhasha-Abhijnaanam

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 annotations_creators: [] language_creators: - crowdsourced - expert-generated - machine-generated - found - other language: - asm - ben - brx - guj - hin - kan - kas - kok - mai - mal - mar - mni - nep - ori - pan - san - sat - sid - snd - tam - tel - urd multilinguality: - multilingual pretty_name: Bhasha-Abhijnaanam size_categories: [] source_datasets: - original task_categories: - text-generation task_ids: [] --- # Dataset Card for Aksharantar ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** https://github.com/AI4Bharat/IndicLID - **Paper:** [Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages](https://arxiv.org/abs/2305.15814) - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Bhasha-Abhijnaanam is a language identification test set for native-script as well as Romanized text which spans 22 Indic languages. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages |  |  |  |  |  |  | | -------------- | -------------- | -------------- | --------------- | -------------- | ------------- | | Assamese (asm) | Hindi (hin) | Maithili (mai) | Nepali (nep) | Sanskrit (san) | Tamil (tam) | | Bengali (ben) | Kannada (kan) | Malayalam (mal)| Oriya (ori) | Santali (sat) | Telugu (tel) | | Bodo(brx) | Kashmiri (kas) | Manipuri (mni) | Punjabi (pan) | Sindhi (snd) | Urdu (urd) | | Gujarati (guj) | Konkani (kok) | Marathi (mar) ## Dataset Structure ### Data Instances ``` A random sample from Hindi (hin) Test dataset. { "unique_identifier": "hin1", "native sentence": "", "romanized sentence": "", "language": "Hindi", "script": "Devanagari", "source": "Dakshina", } ``` ### Data Fields - `unique_identifier` (string): 3-letter language code followed by a unique number in Test set. - `native sentence` (string): A sentence in Indic language. - `romanized sentence` (string): Transliteration of native sentence in English (Romanized sentence). - `language` (string): Language of native sentence. - `script` (string): Script in which native sentence is written. - `source` (string): Source of the data. For created data sources, depending on the destination/sampling method of a pair in a language, it will be one of: - Dakshina Dataset - Flores-200 - Manually Romanized - Manually generated ### Data Splits | Subset | asm | ben | brx | guj | hin | kan | kas (Perso-Arabic) | kas (Devanagari) | kok | mai | mal | mni (Bengali) | mni (Meetei Mayek) | mar | nep | ori | pan | san | sid | tam | tel | urd | |:------:|:---:|:---:|:---:|:---:|:---:|:---:|:------------------:|:----------------:|:---:|:---:|:---:|:-------------:|:------------------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | Native | 1012 | 5606 | 1500 | 5797 | 5617 | 5859 | 2511 | 1012 | 1500 | 2512 | 5628 | 1012 | 1500 | 5611 | 2512 | 1012 | 5776 | 2510 | 2512 | 5893 | 5779 | 5751 | 6883 | | Romanized | 512 | 4595 | 433 | 4785 | 4606 | 4848 | 450 | 0 | 444 | 439 | 4617 | 0 | 442 | 4603 | 423 | 512 | 4765 | 448 | 0 | 4881 | 4767 | 4741 | 4371 | ## Dataset Creation Information in the paper. [Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages](https://arxiv.org/abs/2305.15814) ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Information in the paper. [Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages](https://arxiv.org/abs/2305.15814) #### Who are the source language producers? [More Information Needed] ### Annotations Information in the paper. [Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages](https://arxiv.org/abs/2305.15814) #### Who are the annotators? Information in the paper. [Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages](https://arxiv.org/abs/2305.15814) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information  This data is released under the following licensing scheme: - Manually collected data: Released under CC0 license. **CC0 License Statement** <a rel="license" float="left" href="https://creativecommons.org/about/cclicenses/"> <img src="https://licensebuttons.net/p/zero/1.0/88x31.png" style="border-style: none;" alt="CC0" width="100"/> </a> - We do not own any of the text from which this data has been extracted. - We license the actual packaging of manually collected data under the [Creative Commons CC0 license (“no rights reserved”)](http://creativecommons.org/publicdomain/zero/1.0). - To the extent possible under law, <a rel="dct:publisher" href="https://indicnlp.ai4bharat.org/"> AI4Bharat</a> has waived all copyright and related or neighboring rights to Aksharantar manually collected data and existing sources. - This work is published from: India. ### Citation Information ``` @misc{madhani2023bhashaabhijnaanam, title={Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages}, author={Yash Madhani and Mitesh M. Khapra and Anoop Kunchukuttan}, year={2023}, eprint={2305.15814}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions ---

提供机构：

ai4bharat

原始信息汇总

数据集概述

数据集名称

名称: Bhasha-Abhijnaanam
别名: Aksharantar

数据集描述

概述: Bhasha-Abhijnaanam是一个针对22种印度语言的本地脚本和罗马化文本的语言识别测试集。

支持的任务

任务: 文本生成

语言

支持语言: 包括Assamese (asm), Bengali (ben), Bodo (brx), Gujarati (guj), Hindi (hin), Kannada (kan), Kashmiri (kas), Konkani (kok), Maithili (mai), Malayalam (mal), Marathi (mar), Manipuri (mni), Nepali (nep), Oriya (ori), Punjabi (pan), Sanskrit (san), Sindhi (snd), Tamil (tam), Telugu (tel), Urdu (urd)等22种语言。

数据集结构

数据实例: 每个实例包含唯一标识符、本地语句、罗马化语句、语言、脚本和数据来源。
数据字段: 包括唯一标识符、本地语句、罗马化语句、语言、脚本和数据来源。
数据分割: 数据集分为本地和罗马化两个子集，每个子集包含不同语言的数据量。

数据集创建

数据来源: 原始数据
许可证: CC0-1.0

引用信息

@misc{madhani2023bhashaabhijnaanam, title={Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages}, author={Yash Madhani and Mitesh M. Khapra and Anoop Kunchukuttan}, year={2023}, eprint={2305.15814}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集