five

laurievb/open-lid-dataset

收藏
Hugging Face2023-11-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/laurievb/open-lid-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: text dtype: string - name: language dtype: class_label: names: '0': plt_Latn '1': sun_Latn '2': ukr_Cyrl '3': spa_Latn '4': por_Latn '5': mya_Mymr '6': mkd_Cyrl '7': war_Latn '8': nso_Latn '9': wol_Latn '10': kam_Latn '11': mal_Mlym '12': gle_Latn '13': ayr_Latn '14': rus_Cyrl '15': pbt_Arab '16': pag_Latn '17': twi_Latn '18': als_Latn '19': lit_Latn '20': amh_Ethi '21': tur_Latn '22': tel_Telu '23': vec_Latn '24': zsm_Latn '25': ckb_Arab '26': tgk_Cyrl '27': tha_Thai '28': hye_Armn '29': deu_Latn '30': tat_Cyrl '31': swh_Latn '32': kac_Latn '33': tuk_Latn '34': lvs_Latn '35': tso_Latn '36': fao_Latn '37': tpi_Latn '38': umb_Latn '39': mlt_Latn '40': cym_Latn '41': ben_Beng '42': hat_Latn '43': ron_Latn '44': tir_Ethi '45': ewe_Latn '46': ind_Latn '47': snd_Arab '48': nld_Latn '49': urd_Arab '50': vie_Latn '51': mar_Deva '52': fra_Latn '53': lug_Latn '54': pol_Latn '55': ban_Latn '56': est_Latn '57': srp_Cyrl '58': kin_Latn '59': nno_Latn '60': fur_Latn '61': kmr_Latn '62': bho_Deva '63': fin_Latn '64': mri_Latn '65': ilo_Latn '66': fij_Latn '67': slk_Latn '68': knc_Arab '69': guj_Gujr '70': kor_Hang '71': tum_Latn '72': kab_Latn '73': afr_Latn '74': eng_Latn '75': acq_Arab '76': som_Latn '77': tgl_Latn '78': epo_Latn '79': bjn_Arab '80': mni_Beng '81': sot_Latn '82': nob_Latn '83': kat_Geor '84': ory_Orya '85': arb_Arab '86': heb_Hebr '87': ibo_Latn '88': asm_Beng '89': uzn_Latn '90': sna_Latn '91': mos_Latn '92': fuv_Latn '93': hne_Deva '94': apc_Arab '95': hun_Latn '96': ita_Latn '97': bem_Latn '98': slv_Latn '99': ssw_Latn '100': szl_Latn '101': nya_Latn '102': kir_Cyrl '103': hrv_Latn '104': pap_Latn '105': kik_Latn '106': knc_Latn '107': lmo_Latn '108': hau_Latn '109': eus_Latn '110': ltz_Latn '111': grn_Latn '112': lus_Latn '113': taq_Latn '114': scn_Latn '115': kmb_Latn '116': azj_Latn '117': isl_Latn '118': swe_Latn '119': uig_Arab '120': jpn_Jpan '121': sag_Latn '122': xho_Latn '123': ast_Latn '124': kan_Knda '125': sin_Sinh '126': acm_Arab '127': tzm_Tfng '128': dan_Latn '129': zho_Hant '130': zho_Hans '131': pes_Arab '132': fon_Latn '133': tam_Taml '134': yor_Latn '135': run_Latn '136': arz_Arab '137': awa_Deva '138': pan_Guru '139': gaz_Latn '140': lao_Laoo '141': bos_Latn '142': ces_Latn '143': bam_Latn '144': crh_Latn '145': ltg_Latn '146': bul_Cyrl '147': gla_Latn '148': ell_Grek '149': prs_Arab '150': smo_Latn '151': ajp_Arab '152': tsn_Latn '153': bak_Cyrl '154': srd_Latn '155': ace_Arab '156': kas_Arab '157': lua_Latn '158': taq_Tfng '159': jav_Latn '160': cat_Latn '161': kon_Latn '162': hin_Deva '163': lin_Latn '164': khk_Cyrl '165': cjk_Latn '166': mag_Deva '167': dik_Latn '168': bug_Latn '169': bjn_Latn '170': yue_Hant '171': zul_Latn '172': npi_Deva '173': kas_Deva '174': dzo_Tibt '175': ary_Arab '176': bel_Cyrl '177': kbp_Latn '178': khm_Khmr '179': ace_Latn '180': nus_Latn '181': ceb_Latn '182': mai_Deva '183': san_Deva '184': dyu_Latn '185': quy_Latn '186': lim_Latn '187': min_Latn '188': oci_Latn '189': kaz_Cyrl '190': luo_Latn '191': sat_Olck '192': ydd_Hebr '193': shn_Mymr '194': ars_Arab '195': lij_Latn '196': aeb_Arab '197': bod_Tibt '198': glg_Latn '199': kea_Latn '200': azb_Arab - name: dataset_source dtype: string splits: - name: train num_bytes: 21749592609 num_examples: 118296182 download_size: 16568412828 dataset_size: 21749592609 license: other task_categories: - text-classification size_categories: - 100M<n<1B --- # Dataset Card for "open-lid-dataset" ## Dataset Description - **Repository:** [https://github.com/laurieburchell/open-lid-dataset]() - **Paper:** [An Open Dataset and Model for Language Identification](https://aclanthology.org/2023.acl-short.75/) - **Point of Contact:** laurie.burchell AT ed.ac.uk ### Dataset Summary The OpenLID dataset covers 201 languages and is designed for training language identification models. The majority of the source datasets were derived from news sites, Wikipedia, or religious text, though some come from other domains (e.g. transcribed conversations, literature, or social media). A sample of each language in each source was manually audited to check it was in the attested language (see [the paper](https://aclanthology.org/2023.acl-short.75/)) for full details. ### Supported tasks This dataset is intended for training high-coverage language identification models (e.g. [OpenLID](https://huggingface.co/laurievb/OpenLID)). It is compatible with the [FLORES-200](https://github.com/facebookresearch/flores/tree/main/flores200) evaluation benchmark. ### Languages There are 201 languages included in the dataset with varying amounts of data: the largest class (English) contains 7.5 million lines of data, and the smallest (South Azerbaijani) contains 532 lines of data. The mean number of lines per language is 602,812. A full breakdown of lines of data per language is available [on the repo](https://github.com/laurieburchell/open-lid-dataset/blob/main/languages.md). ## Dataset Structure ### Data Instances Each entry in the dataset consists of a line of data, a language label included script information, and a tag indicating the source. ```json { "text": "¿Serás exaltada hasta el cielo?", "language": "spa_Latn", "dataset_source": "lti" } ``` ### Data Splits Only a train split is provided. The dataset is designed to be compatible with the [FLORES-200](https://github.com/facebookresearch/flores/tree/main/flores200) evaluation benchmark. ## Dataset Creation ### Curation Rationale Recent work has found that existing language identification algorithms perform poorly in practice compared to test performance. The problem is particularly acute for low-resource languages: [Kreutzer et al. (2022)](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00447/109285/Quality-at-a-Glance-An-Audit-of-Web-Crawled) found a positive Spearman rank correlation between quality of data and size of language for all of the \ac{lid}-filtered multilingual datasets they studied. In addition, for a significant fraction of the language corpora they studied, less than half of the sentences were in the correct language. They point out that such low-quality data not only leads to poor performance in downstream tasks, but that it also contributes to `representation washing', where the community is given a false view of the actual progress of low-resource natural language processing. There are several open language identification models offering quick classification and high language coverage (e.g. CLD3, No Language Left Behind). However, to the best of our knowledge, none of the commonly-used scalable language identificaiton systems make their training data public. This dataset aims to address that gap by curating and combining sources of open training data for language identification and by auditing a sample of all languages in each source to check reliability. ### Source Data The majority of the source datasets were derived from news sites, Wikipedia, or religious text, though some come from other domains (e.g. transcribed conversations, literature, or social media). We provide a full list at the end of this model card along with the licensing information for each source. #### Initial Data Collection and Normalisation Our initial aim was to cover the same languages present in the FLORES-200 Evaluation Benchmark so that we could use this dataset for evaluation. However, during the curation process, we decided to exclude three languages. Firstly, though Akan and Twi are both included as separate languages in FLORES-200, Akan is actually a macrolanguage covering a language continuum which includes Twi. Given the other languages in FLORES-200 are individual languages, we decided to exclude Akan. Secondly, FLORES-200 includes Modern Standard Arabic (MSA) written in Latin script. It is true that Arabic dialects are often written in Latin characters in informal situations (e.g. social media). However, MSA is a form of standardised Arabic which is not usually used in informal situations. Since we could not any find naturally-occurring training data, we excluded MSA from the dataset. Finally, we excluded Minangkabau in Arabic script because it is now rarely written this way, making it difficult to find useful training data. The first step in our manual audit was to check and standardise language labels, as these are often inconsistent or idiosyncratic. We chose to copy the language codes in FLORES-200 and reassign macrolanguage or ambiguous language codes in the data sources we found to the dominant individual language. Whilst this resulted in more useful data for some languages, for other languages we had to be more conservative. For example, we originally reassigned text labelled as the macrolanguage Malay (msa_Latn) to Standard Malay, but this led to a large drop in performance as the former covers a very diverse set of languages. Two of the authors then carried out a manual audit of a random sample of all data sources and languages: one a native Bulgarian speaker (able to read Cyrillic and Latin scripts and Chinese characters), and the other a native English speaker (able to read Latin, Arabic and Hebrew scripts). For languages we knew, we checked the language was what we expected. For unfamiliar languages in a script we could read, we compared the sample to the Universal Declaration of Human Rights or failing that, to a sample of text on Wikipedia. We compared features of the text which are common in previous language identification algorithms and could be identified easily by humans: similar diacritics, word lengths, common words, loan words matching the right cultural background, similar suffixes and prefixes, and vowel/consonant patterns. For scripts we could not read, we checked that all lines of the sample matched the script in the Universal Declaration of Human Rights. We kept preprocessing minimal so that the process was as language agnostic as possible. We used the scripts provided with Moses to remove non-printing characters and detokenise the data where necessary. We then filtered the data so that each line contained at least one character in the expected script (as defined by Perl) to allow for borrowings. Finally, we sampled proportionally to $ p_l^{0.3} $, where $ p_l $ is the fraction of lines in the dataset which are in language $ l $. This aims to ameliorate class skew issues. ## Considerations for Using the Data ### Social Impact of Dataset This dataset covers a number of low-resourced languages. This makes it a potentially useful resource, but due to the limited amount of data and domains, care must be taken not to overclaim performance or coverage. ### Discussion of Biases Our work aims to broaden natural language processing coverage by allowing practitioners to identify relevant data in more languages. However, we note that language identification is inherently a normative activity that risks excluding minority dialects, scripts, or entire microlanguages from a macrolanguage. Choosing which languages to cover may reinforce power imbalances, as only some groups gain access to language processing technologies. In addition, errors in language identification can have a significant impact on downstream performance, particularly (as is often the case) when a system is used as a `black box'. The performance of our classifier is not equal across languages which could lead to worse downstream performance for particular groups. We mitigate this by providing metrics by class. ## Additional information The dataset was curated from the sources listed below by Laurie Burchell and Nikolay Bogoychev. ### Licensing Information License considerations for each source are given below. Open use for non-commercial purposes is covered by all licences. If you view any part of this dataset as a violation of intellectual property rights, please let us know and we will remove it. | Source | Description | License | |---|---|---| |[Arabic Dialects Dataset](https://www.lancaster.ac.uk/staff/elhaj/corpora.html)| Dataset of Arabic dialects for Gulf, Egyptian, Levantine, and Tunisian Arabic dialects plus MSA|No explicit license; website describes data as "some free and useful Arabic corpora that I have created for researchers working on Arabic Natural Language Processing, Corpus and Computational Linguistics."| |[BLTR](https://github.com/shashwatup9k/bho-resources)|Monolingual Bhojpuri corpus|[CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)| |[Global Voices](https://opus.nlpl.eu/GlobalVoices-v2015.php)|A parallel corpus of news stories from the web site Global Voices|The website for [Global Voices](https://globalvoices.org/) is licensed as [Creative Commons Attribution 3.0](https://creativecommons.org/licenses/by/3.0/). There is no explicit additional license accompanying the dataset.| |[Guaraní Parallel Set](https://github.com/sgongora27/giossa-gongora-guarani-2021)|Parallel Guaraní-Spanish news corpus sourced from Paraguyan websites|No explicit license| |[HKCanCor](https://github.com/fcbond/hkcancor)|Transcribed conversations in Hong Kong Cantonese|[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode)| |[IADD](https://github.com/JihadZa/IADD)|Arabic dialect identification dataset covering 5 regions (Maghrebi, Levantine, Egypt, Iraq, and Gulf) and 9 countries (Algeria, Morocco, Tunisia, Palestine, Jordan, Syria, Lebanon, Egypt and Iraq). It is created from five corpora: [DART](http://qufaculty.qu.edu.qa/telsay), [SHAMI](https://github.com/GU-CLASP/shami-corpus), [TSAC](https://github.com/fbougares/TSAC), [PADIC](https://sourceforge.net/projects/padic/), and [AOC](https://www.cs.jhu.edu/data-archive/AOC-2010/). | Multiple licenses: [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0) (SHAMI); [GNU Lesser General Public License v3.0](https://github.com/fbougares/TSAC/blob/master/LICENSE) (TSAC); [GNU General Public License v3](https://www.gnu.org/licenses/gpl-3.0.en.html) (PADIC). DART and AOC had no explicit license.| |[Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)|A collection of corpora in different languages with an identical format.|The [Terms of Usage](https://wortschatz.uni-leipzig.de/en/usage) states "Permission for use is granted free of charge solely for non-commercial personal and scientific purposes licensed under the [Creative Commons License CC BY-NC](https://creativecommons.org/licenses/by-nc/4.0/)."| |[LTI](https://www.cs.cmu.edu/~ralf/langid.html)|Training data for language identification|From the README: "With the exception of the contents of the Europarl/, ProjectGutenberg/, and PublicDomain/ directories, all code and text in this corpus are copyrighted. However, they may be redistributed under the terms of various Creative Commons licenses and the GNU GPL. Copying the unmodified archive noncommercially is permitted by all of the licenses. For commercial redistribution or redistribution of modified versions, please consult the individual licenses."| |[MADAR Shared Task 2019, subtask 1](https://camel.abudhabi.nyu.edu/madar-shared-task-2019/)|Dialectal Arabic in the travel domain|The MADAR Corpus has a custom license, the text of which can be found in this repo.| |[EM corpus](http://lepage-lab.ips.waseda.ac.jp/en/projects/meiteilon-manipuri-language-resources/)|Parallel Manipuri-English sentences crawled from [The Sangai Express](https://www.thesangaiexpress.com/)|[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)| |[MIZAN](https://github.com/omidkashefi/Mizan)|Parallel Persian-English corpus from literature domain|[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)| |[MT560 v1](https://opus.nlpl.eu/MT560.php)|A machine translation dataset for over 500 languages to English. We have filtered out data from OPUS-100, Europarl, Open Subtitles, Paracrawl, Wikimedia, Wikimatrix, Wikititles, and Common Crawl due to issues with the fidelity of the language labels. |[Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)| |[NLLB Seed](https://github.com/facebookresearch/flores/blob/main/nllb_seed/README.md)|Around 6000 sentences in 39 languages sampled from Wikipedia, intended to cover languages lacking training data.|[CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)| |[SETIMES](https://opus.nlpl.eu/SETIMES.php)|A parallel corpus of news articles in the Balkan languages|[CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/)| |[Tatoeba](https://opus.nlpl.eu/Tatoeba.php)|Collaborative sentence translations|[CC BY 2.0 FR](https://creativecommons.org/licenses/by/2.0/fr/)| |[Tehran English-Persian parallel corpus (TEP)](https://opus.nlpl.eu/TEP.php)|Parallel Persian-English sentences sourced from subtitles|[GNU General Public License](https://www.gnu.org/licenses/gpl-3.0.html)| |[Turkic Interlingua (TIL) Corpus](https://github.com/turkic-interlingua/til-mt)|A large-scale parallel corpus combining most of the public datasets for 22 Turkic languages|[CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)| |[WiLI-2018](https://zenodo.org/record/841984)|Wikipedia language identification benchmark containing 235K paragraphs of 235 languages|[Open Data Commons Open Database License (ODbL) v1.0](https://opendatacommons.org/licenses/odbl/1-0/)| |[XL-Sum](https://github.com/csebuetnlp/xl-sum)|Summarisation dataset covering 44 languages, sourced from BBC News|[CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)| ### Citation Information If you use this dataset, please cite all the authors [in the citation file](https://github.com/laurieburchell/open-lid-dataset/blob/main/citations.bib) who compiled the source datasets, plus the OpenLID paper: ```bibtex @inproceedings{burchell-etal-2023-open, title = "An Open Dataset and Model for Language Identification", author = "Burchell, Laurie and Birch, Alexandra and Bogoychev, Nikolay and Heafield, Kenneth", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-short.75", doi = "10.18653/v1/2023.acl-short.75", pages = "865--879", abstract = "Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033{\%} across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model{'}s performance, both in comparison to existing open models and by language class.", } ``` ### Contributions Thanks to @hac541309 and @davanstrien for adding this dataset.
提供机构:
laurievb
原始信息汇总

数据集描述

数据集概述

OpenLID数据集涵盖201种语言,旨在用于训练语言识别模型。大多数源数据集来自新闻网站、维基百科或宗教文本,也有一些来自其他领域(如转录对话、文学或社交媒体)。每个语言样本都经过手动审核,以确保其语言的可靠性。

支持的任务

该数据集适用于训练高覆盖率的语言识别模型,如OpenLID,并与FLORES-200评估基准兼容。

语言

数据集包含201种语言,数据量不等:最大的类别(英语)包含750万行数据,最小的类别(南阿塞拜疆语)包含532行数据。每种语言的平均行数为602,812。

数据集结构

数据实例

每个数据条目包含一行数据、语言标签(包括脚本信息)和一个源标签。

json { "text": "¿Serás exaltada hasta el cielo?", "language": "spa_Latn", "dataset_source": "lti" }

数据分割

仅提供训练集分割。数据集设计为与FLORES-200评估基准兼容。

数据集创建

策划理由

现有的语言识别算法在实际应用中表现不佳,尤其是在低资源语言方面。该数据集旨在通过策划和合并开放训练数据源,并审核每个源中的所有语言样本,来解决这一差距。

源数据

大多数源数据集来自新闻网站、维基百科或宗教文本,也有一些来自其他领域。数据集提供了每个源的许可信息。

初始数据收集和规范化

数据集最初旨在覆盖FLORES-200评估基准中的相同语言,但在策划过程中排除了三种语言。手动审核包括检查和标准化语言标签,并进行最小化的预处理以保持语言无关性。

使用数据集的注意事项

社会影响

该数据集涵盖了许多低资源语言,是一个潜在的有用资源,但由于数据量和领域的限制,应注意不要过度宣称性能或覆盖范围。

偏见讨论

语言识别本质上是一种规范性活动,可能排除少数方言、脚本或整个微语言。选择覆盖哪些语言可能会强化权力不平衡,因为只有某些群体能够获得语言处理技术。

附加信息

许可信息

数据集的每个源都有相应的许可信息。开放使用非商业目的的数据受到所有许可的保护。

引用信息

如果使用此数据集,请引用所有编译源数据集的作者,以及OpenLID论文。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作