tartuNLP/smugri4-data
收藏Hugging Face2026-02-19 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/tartuNLP/smugri4-data
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "lingrel2025"
license:
- cc-by-nc-sa-4.0
---
# Dataset Card for language_relatives_2025
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Dataset Creation](#dataset-creation)
- [Source Data](#source-data)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Point of Contact:**
### Dataset Summary
The dataset is a collection of mono- and multilingual text corpora of many language relatives of standard Estonian, i.e. Finno-Ugric languages and dialects, excluding Finnish and Hungarian. Multilingual corpora include other languages as translation equivalents, among them also Estonian, Finnish and Hungarian.
The aim is to provide data for language technology, first and foremost for machine translation.
### Languages
#### Finno-Ugric language relatives
| ISO 639-3 | Language | Tokens |
| --- | --- | --- |
| fit | Tornedalen Finnish (Meänkieli) | 8750 |
| fkv | Kven (Kven Finnish) | 55506 |
| izh | Ingrian (= Izhorian) | 249093 |
| kca | Khanty | 97611 |
| koi | Komi-Permyak | 382071 |
| kpv | Komi-Zyrian | 18194967 |
| krl | Karelian (Proper Karelian) | 981158 |
| liv | Livonian | 15035 |
| lud | Ludian | 291293 |
| mdf | Moksha | 822508 |
| mhr | Meadow Mari | 6360111 |
| mns | Mansi | 263280 |
| mrj | Hill Mari | 1494257 |
| myv | Erzya | 2526670 |
| olo | Livvi-Karelian (Olonets) | 1245935 |
| sjd | Kildin Sami | 1338 |
| sju | Ume Sami | 619 |
| sma | Southern Sami | 1703932 |
| sme | Northern Sami | 21540241 |
| smj | Lule Sami | 950311 |
| smn | Inari Sami | 1217515 |
| sms | Skolt Sami | 380250 |
| udm | Udmurt | 1294508 |
| vep | Veps | 2383752 |
| vot | Votic | 48107 |
| vro | Võro | 3266531 |
#### Estonian dialects
| Dialect | Tokens |
| --- | --- |
| hiiu | 9095 |
| kihnu | 66001 |
| mulgi | 26895 |
| ranna | 9887 |
| setu | 283598 |
#### Languages of translation equivalents
| ISO 639-3 | Language | Tokens |
| --- | --- | --- |
| deu | German | 5729 |
| eng | English | 26160 |
| est | Estonian | 1937069 |
| fin | Finnish | 1690382 |
| fra | French | 6300 |
| hun | Hungarian | 1573 |
| lav | Latvian | 1529 |
| nno | Norwegian Nynorsk | 1734 |
| nob | Norwegian Bokmål | 5466 |
| nor | Norwegian | 4425857 |
| rus | Russian | 23503423 |
| swe | Swedish | 6002 |
## Dataset Structure
Texts are represented in JSON. The structures, keys and values are defined in [pydantic_for_lingrel2025.py](./pydantic_for_lingrel2025.py)
Textual material may be a collection of linguistic units with different granularity and coherence. This dataset differentiates between:
- unrelated words and phrases (e.g. a dictionary or a phrasebook)
- unrelated sentences (e.g. a dictionary or a phrasebook)
- coherent sequence of paragraphs and sentences (e.g. a novel with structural mark-up)
- coherent text without explicit split into sentences
Any of these might be applicable to a monolingual or multilingual source, i.e. a text with translation(s).
In addition to granularity and mono/multilinguality, the dataset contains info about the dialect and orthography, and info about the original -
author, title, publication year etc, and source - corpus, web page, file name etc.
Every file in this dataset has the text itself plus
metainfo detailing all the info about this text.
The data is organised into directories. A file path name contains the following parts:
ISO code for language / "mono" or "multi" / eponym from source corpus name / eponym from source file name
## Dataset Creation
Original texts have been transformed into JSON. Depending on the nature of the source, there are four classes:
1. Unrelated words and phrases
2. Unrelated sentences
3. Coherent sequence of paragraphs and sentences
4. Coherent text without explicit split into sentences
Source data that was impossible to map into any of these classes was left out.
In some cases the language of the source text was additionally checked with [GlotLID](https://github.com/cisnlp/GlotLID), and wrong language texts were left out.
The orthography has not been checked nor modified.
### Source Data
The dataset is built from various pre-existing publications and corpora: [corpus_source.md](./corpus_source.md)
### Licensing Information
All original textual content is licensed under a [Creative Commons License](https://creativecommons.org/share-your-work/cclicenses/)
(depending on the source, either CC-BY, CC-BY-SA, CC-BY-NC or CC-BY-NC-SA) or an equivalently permissible licence, or into the public domain.
## Citation Information
```
@InProceedings{smugri4mt,
title={SMUGRI-4: Machine-Translating Low-resource Finno-Ugric Languages and Dialects with Care and Caution},
authors={Lisa Yankovskaya and Mark Fishel and Elena Markus and Fedor Rozhanskiy and Heiki-Jaan Kaalep and
Idaliia Fedotova and Ilia Moshnikov and Janek Vaab and Joshua Wilbur and Liisa Rätsep and Marili Tomingas and
Michael Rie{\ss}ler and Nikolay Kuznetsov and Taido Purason and Valts Ern\v{s}treits },
year={2026},
booktitle={Proceedings of ACL, the 64th Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
pages={submitted},
address={San Diego, California, United States}
}
```
### Contributions
The following people have contributed by collecting or processing the original data:
Britt-Kathleen Mere, Aleksei Ivanov, Tarmo Vaino, Annely-Maria Liivas, Kaire Koljal, Lisa Yankovskaya, Heiki-Jaan Kaalep, Mark Fišel
提供机构:
tartuNLP



