strombergnlp/nordic_langid

Name: strombergnlp/nordic_langid
Creator: strombergnlp
Published: 2022-10-25 21:42:02
License: 暂无描述

Hugging Face2022-10-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/strombergnlp/nordic_langid

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - da - nn - nb - fo - is - sv license: - cc-by-sa-3.0 multilinguality: - multilingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-classification task_ids: [] paperswithcode_id: nordic-langid pretty_name: Nordic Language ID for Distinguishing between Similar Languages tags: - language-identification --- # Dataset Card for nordic_langid ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [https://github.com/StrombergNLP/NordicDSL](https://github.com/StrombergNLP/NordicDSL) - **Repository:** [https://github.com/StrombergNLP/NordicDSL](https://github.com/StrombergNLP/NordicDSL) - **Paper:** [https://aclanthology.org/2021.vardial-1.8/](https://aclanthology.org/2021.vardial-1.8/) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [René Haas](mailto:renha@itu.dk) ### Dataset Summary Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic language: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese and Icelandic. This is the data for the tasks. Two variants are provided: 10K and 50K, with holding 10,000 and 50,000 examples for each language respectively. For more info, see the paper: [Discriminating Between Similar Nordic Languages](https://aclanthology.org/2021.vardial-1.8/). ### Supported Tasks and Leaderboards * ### Languages This dataset is in six similar Nordic language: - Danish, `da` - Faroese, `fo` - Icelandic, `is` - Norwegian Bokmål, `nb` - Norwegian Nynorsk, `nn` - Swedish, `sv` ## Dataset Structure The dataset has two parts, one with 10K samples per language and another with 50K per language. The original splits and data allocation used in the paper is presented here. ### Data Instances [Needs More Information] ### Data Fields - `id`: the sentence's unique identifier, `string` - `sentence`: the test to be classifier, a `string` - `language`: the class, one of `da`, `fo`, `is`, `nb`, `no`, `sv`. ### Data Splits Train and Test splits are provided, divided using the code provided with the paper. ## Dataset Creation ### Curation Rationale Data is taken from Wikipedia and Tatoeba from each of these six languages. ### Source Data #### Initial Data Collection and Normalization **Data collection** Data was scraped from Wikipedia. We downloaded summaries for randomly chosen Wikipedia articles in each of the languages, saved as raw text to six .txt files of about 10MB each. The 50K section is extended with Tatoeba data, which provides a different register to Wikipedia text, and then topped up with more Wikipedia data. **Extracting Sentences** The first pass in sentence tokenisation is splitting by line breaks. We then extract shorter sentences with the sentence tokenizer (sent_tokenize) function from NLTK (Loper and Bird, 2002). This does a better job than just splitting by ’.’ due to the fact that abbreviations, which can appear in a legitimate sentence, typically include a period symbol. **Cleaning characters** The initial data set has many characters that do not belong to the alphabets of the languages we work with. Often the Wikipedia pages for people or places contain names in foreign languages. For example a summary might contain Chinese or Russian characters which are not strong signals for the purpose of discriminating between the target languages. Further, it can be that some characters in the target languages are mis-encoded. These misencodings are also not likely to be intrinsically strong or stable signals. To simplify feature extraction, and to reduce the size of the vocabulary, the raw data is converted to lowercase and stripped of all characters which are not part of the standard alphabet of the six languages using a character whitelist. #### Who are the source language producers? The source language is from Wikipedia contributors and Tatoeba contributors. ### Annotations #### Annotation process The annotations were found. #### Who are the annotators? The annotations were found. They are determined by which language section a contributor posts their content to. ### Personal and Sensitive Information The data hasn't been checked for PII, and is already all public. Tatoeba is is based on translations of synthetic conversational turns and is unlikely to bear personal or sensitive information. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is intended to help correctly identify content in the languages of six minority languages. Existing systems often confuse these, especially Bokmål and Danish or Icelandic and Faroese. However, some dialects are missed (for example Bornholmsk) and the closed nature of the classification task thus excludes speakers of these languages without recognising their existence. ### Discussion of Biases The text comes from only two genres, so might not transfer well to other domains. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information The data here is licensed CC-BY-SA 3.0. If you use this data, you MUST state its origin. ### Citation Information ```` @inproceedings{haas-derczynski-2021-discriminating, title = "Discriminating Between Similar Nordic Languages", author = "Haas, Ren{\'e} and Derczynski, Leon", booktitle = "Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects", month = apr, year = "2021", address = "Kiyv, Ukraine", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.vardial-1.8", pages = "67--75", } ```

提供机构：

strombergnlp

原始信息汇总

数据集概述

数据集名称

名称: Nordic Language ID for Distinguishing between Similar Languages
别名: nordic_langid

数据集基本信息

语言: 丹麦语 (da), 法罗语 (fo), 冰岛语 (is), 挪威博克马尔语 (nb), 挪威新挪威语 (nn), 瑞典语 (sv)
许可证: CC-BY-SA-3.0
多语言性: 多语言
大小: 100K<n<1M
源数据集: 原始数据
任务类别: 文本分类
任务ID: []
论文代码ID: nordic-langid
标签: 语言识别

数据集描述

概述: 该数据集用于区分六种相似的北欧语言：丹麦语、法罗语、冰岛语、挪威博克马尔语、挪威新挪威语和瑞典语。数据集提供了两种变体：10K和50K，分别包含每种语言10,000和50,000个示例。
支持的任务: 语言识别
语言: 数据集包含六种北欧语言。

数据集结构

数据实例: 每个数据实例包含三个字段：id（字符串，句子唯一标识符）、sentence（字符串，待分类的文本）、language（字符串，类别，包括da, fo, is, nb, nn, sv）。
数据分割: 提供训练和测试分割。

数据集创建

数据来源: 数据从维基百科和Tatoeba收集，每种语言的数据分别存储在六个大约10MB的.txt文件中。
数据预处理: 数据经过清洗，转换为小写，并去除所有非标准字母表中的字符。
注释: 注释是通过确定贡献者发布内容的语言部分来确定的。

使用数据集的注意事项

社会影响: 该数据集旨在帮助正确识别六种少数语言的内容，但可能未涵盖所有方言。
偏见讨论: 文本仅来自两种类型，可能不适用于其他领域。
其他已知限制: 需要更多信息。

附加信息

许可证信息: 数据集根据CC-BY-SA 3.0许可证发布，使用时必须声明其来源。
引用信息: 引用时请参考论文 "Discriminating Between Similar Nordic Languages"，作者为René Haas和Leon Derczynski。

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集专注于六种相似北欧语言的自动识别任务，包括丹麦语、瑞典语、挪威尼诺斯克语、挪威博克马尔语、法罗语和冰岛语，旨在解决现有工具对这些语言常出现的误分类问题。数据来源于Wikipedia和Tatoeba，提供两个版本（每个语言10K和50K样本），总样本量约359,176行，并包含训练和测试分割，适用于文本分类和语言识别研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集