papluca/language-identification
收藏Hugging Face2022-07-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/papluca/language-identification
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language_creators: []
language:
- ar
- bg
- de
- el
- en
- es
- fr
- hi
- it
- ja
- nl
- pl
- pt
- ru
- sw
- th
- tr
- ur
- vi
- zh
license: []
multilinguality:
- multilingual
pretty_name: Language Identification dataset
size_categories:
- unknown
source_datasets:
- extended|amazon_reviews_multi
- extended|xnli
- extended|stsb_multi_mt
task_categories:
- text-classification
task_ids:
- multi-class-classification
---
# Dataset Card for Language Identification dataset
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
The Language Identification dataset is a collection of 90k samples consisting of text passages and corresponding language label.
This dataset was created by collecting data from 3 sources: [Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi), [XNLI](https://huggingface.co/datasets/xnli), and [STSb Multi MT](https://huggingface.co/datasets/stsb_multi_mt).
### Supported Tasks and Leaderboards
The dataset can be used to train a model for language identification, which is a **multi-class text classification** task.
The model [papluca/xlm-roberta-base-language-detection](https://huggingface.co/papluca/xlm-roberta-base-language-detection), which is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base), was trained on this dataset and currently achieves 99.6% accuracy on the test set.
### Languages
The Language Identification dataset contains text in 20 languages, which are:
`arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh)`
## Dataset Structure
### Data Instances
For each instance, there is a string for the text and a string for the label (the language tag). Here is an example:
`{'labels': 'fr', 'text': 'Conforme à la description, produit pratique.'}`
### Data Fields
- **labels:** a string indicating the language label.
- **text:** a string consisting of one or more sentences in one of the 20 languages listed above.
### Data Splits
The Language Identification dataset has 3 splits: *train*, *valid*, and *test*.
The train set contains 70k samples, while the validation and test sets 10k each.
All splits are perfectly balanced: the train set contains 3500 samples per language, while the validation and test sets 500.
## Dataset Creation
### Curation Rationale
This dataset was built during *The Hugging Face Course Community Event*, which took place in November 2021, with the goal of collecting a dataset with enough samples for each language to train a robust language detection model.
### Source Data
The Language Identification dataset was created by collecting data from 3 sources: [Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi), [XNLI](https://huggingface.co/datasets/xnli), and [STSb Multi MT](https://huggingface.co/datasets/stsb_multi_mt).
### Personal and Sensitive Information
The dataset does not contain any personal information about the authors or the crowdworkers.
## Considerations for Using the Data
### Social Impact of Dataset
This dataset was developed as a benchmark for evaluating (balanced) multi-class text classification models.
### Discussion of Biases
The possible biases correspond to those of the 3 datasets on which this dataset is based.
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
Thanks to [@LucaPapariello](https://github.com/LucaPapariello) for adding this dataset.
提供机构:
papluca
原始信息汇总
数据集概述
数据集名称
- 名称: Language Identification dataset
- 别名: 语言识别数据集
数据集概要
- 描述: 该数据集包含90,000个样本,每个样本包含一段文本及其对应的语言标签。
- 来源: 数据集由三个源数据集构成,分别是Multilingual Amazon Reviews Corpus、XNLI和STSb Multi MT。
支持的任务和评测指标
- 任务: 多类文本分类
- 评测指标: 模型papluca/xlm-roberta-base-language-detection在该数据集上训练后,测试集准确率达到99.6%。
语言
- 包含语言: 阿拉伯语(ar)、保加利亚语(bg)、德语(de)、希腊语(el)、英语(en)、西班牙语(es)、法语(fr)、印地语(hi)、意大利语(it)、日语(ja)、荷兰语(nl)、波兰语(pl)、葡萄牙语(pt)、俄语(ru)、斯瓦希里语(sw)、泰语(th)、土耳其语(tr)、乌尔都语(ur)、越南语(vi)、中文(zh),共20种语言。
数据集结构
- 数据实例: 每个实例包含文本字段和标签字段。
- 数据字段:
- labels: 字符串,表示语言标签。
- text: 字符串,包含一个或多个句子,使用上述20种语言之一。
- 数据分割: 数据集分为训练集、验证集和测试集,其中训练集包含70,000个样本,验证集和测试集各包含10,000个样本。每个语言在训练集中有3,500个样本,在验证集和测试集中各有500个样本。
数据集创建
- 创建理由: 该数据集是在2021年11月的Hugging Face Course Community Event期间创建的,目的是收集足够每个语言的样本以训练一个强大的语言检测模型。
- 源数据: 数据集由Multilingual Amazon Reviews Corpus、XNLI和STSb Multi MT三个数据集的数据构成。
使用数据集的考虑
- 社会影响: 该数据集作为评估平衡多类文本分类模型的基准。
- 偏见讨论: 可能的偏见与构成该数据集的三个源数据集的偏见相对应。
贡献者
- 贡献者: Luca Papariello (@LucaPapariello)
其他信息
- 许可证信息: 未提供
- 引用信息: 未提供
- 数据集管理员: 未提供
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



