tartuNLP/wikipedia-smugri-20251201
收藏Hugging Face2026-03-13 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/tartuNLP/wikipedia-smugri-20251201
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: est_Latn
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 365748262
num_examples: 255033
download_size: 220777348
dataset_size: 365748262
- config_name: fin_Latn
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1022188356
num_examples: 601263
download_size: 604794904
dataset_size: 1022188356
- config_name: fkv_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 207987
num_examples: 592
download_size: 101870
dataset_size: 207987
- config_name: hun_Latn
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1218396850
num_examples: 527080
download_size: 739124218
dataset_size: 1218396850
- config_name: izh_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8782
num_examples: 11
download_size: 10198
dataset_size: 8782
- config_name: kca_Cyrl_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 17495
num_examples: 25
download_size: 11353
dataset_size: 17495
- config_name: koi_Cyrl
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2631369
num_examples: 3511
download_size: 1061843
dataset_size: 2631369
- config_name: kpv_Cyrl
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5538293
num_examples: 6071
download_size: 2433838
dataset_size: 5538293
- config_name: krl_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 749792
num_examples: 1200
download_size: 399958
dataset_size: 749792
- config_name: liv_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 427468
num_examples: 883
download_size: 206977
dataset_size: 427468
- config_name: mdf_Cyrl
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2366232
num_examples: 7696
download_size: 910852
dataset_size: 2366232
- config_name: mhr_Cyrl
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8631067
num_examples: 10903
download_size: 3339269
dataset_size: 8631067
- config_name: mns_Cyrl_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 59563
num_examples: 115
download_size: 26471
dataset_size: 59563
- config_name: mrj_Cyrl
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5057251
num_examples: 10538
download_size: 1878775
dataset_size: 5057251
- config_name: myv_Cyrl
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6741822
num_examples: 7834
download_size: 2838366
dataset_size: 6741822
- config_name: olo_Latn
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2827589
num_examples: 4524
download_size: 1511606
dataset_size: 2827589
- config_name: sjd_Cyrl_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 132271
num_examples: 128
download_size: 65144
dataset_size: 132271
- config_name: sje_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5119
num_examples: 11
download_size: 6240
dataset_size: 5119
- config_name: sju_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1851
num_examples: 8
download_size: 3479
dataset_size: 1851
- config_name: sma_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 44067
num_examples: 174
download_size: 22441
dataset_size: 44067
- config_name: sme_Latn
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2729501
num_examples: 6695
download_size: 1375453
dataset_size: 2729501
- config_name: smj_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6179
num_examples: 20
download_size: 6727
dataset_size: 6179
- config_name: smn_Latn
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5160325
num_examples: 6488
download_size: 2588720
dataset_size: 5160325
- config_name: sms_Latn_incubator
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 36246
num_examples: 126
download_size: 18007
dataset_size: 36246
- config_name: udm_Cyrl
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4024213
num_examples: 5323
download_size: 1712803
dataset_size: 4024213
- config_name: vep_Latn
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11548381
num_examples: 7046
download_size: 6331617
dataset_size: 11548381
- config_name: vot_Latn_incubator
features: []
splits:
- name: train
num_bytes: 0
num_examples: 0
download_size: 324
dataset_size: 0
- config_name: vro_Latn
features:
- name: id
dtype: string
- name: revid
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3788627
num_examples: 6885
download_size: 2135222
dataset_size: 3788627
configs:
- config_name: est_Latn
data_files:
- split: train
path: est_Latn/train-*
- config_name: fin_Latn
data_files:
- split: train
path: fin_Latn/train-*
- config_name: fkv_Latn_incubator
data_files:
- split: train
path: fkv_Latn_incubator/train-*
- config_name: hun_Latn
data_files:
- split: train
path: hun_Latn/train-*
- config_name: izh_Latn_incubator
data_files:
- split: train
path: izh_Latn_incubator/train-*
- config_name: kca_Cyrl_incubator
data_files:
- split: train
path: kca_Cyrl_incubator/train-*
- config_name: koi_Cyrl
data_files:
- split: train
path: koi_Cyrl/train-*
- config_name: kpv_Cyrl
data_files:
- split: train
path: kpv_Cyrl/train-*
- config_name: krl_Latn_incubator
data_files:
- split: train
path: krl_Latn_incubator/train-*
- config_name: liv_Latn_incubator
data_files:
- split: train
path: liv_Latn_incubator/train-*
- config_name: mdf_Cyrl
data_files:
- split: train
path: mdf_Cyrl/train-*
- config_name: mhr_Cyrl
data_files:
- split: train
path: mhr_Cyrl/train-*
- config_name: mns_Cyrl_incubator
data_files:
- split: train
path: mns_Cyrl_incubator/train-*
- config_name: mrj_Cyrl
data_files:
- split: train
path: mrj_Cyrl/train-*
- config_name: myv_Cyrl
data_files:
- split: train
path: myv_Cyrl/train-*
- config_name: olo_Latn
data_files:
- split: train
path: olo_Latn/train-*
- config_name: sjd_Cyrl_incubator
data_files:
- split: train
path: sjd_Cyrl_incubator/train-*
- config_name: sje_Latn_incubator
data_files:
- split: train
path: sje_Latn_incubator/train-*
- config_name: sju_Latn_incubator
data_files:
- split: train
path: sju_Latn_incubator/train-*
- config_name: sma_Latn_incubator
data_files:
- split: train
path: sma_Latn_incubator/train-*
- config_name: sme_Latn
data_files:
- split: train
path: sme_Latn/train-*
- config_name: smj_Latn_incubator
data_files:
- split: train
path: smj_Latn_incubator/train-*
- config_name: smn_Latn
data_files:
- split: train
path: smn_Latn/train-*
- config_name: sms_Latn_incubator
data_files:
- split: train
path: sms_Latn_incubator/train-*
- config_name: udm_Cyrl
data_files:
- split: train
path: udm_Cyrl/train-*
- config_name: vep_Latn
data_files:
- split: train
path: vep_Latn/train-*
- config_name: vot_Latn_incubator
data_files:
- split: train
path: vot_Latn_incubator/train-*
- config_name: vro_Latn
data_files:
- split: train
path: vro_Latn/train-*
license: cc-by-sa-4.0
language:
- et
- hu
- fi
- vot
- sju
- izh
- sje
- smj
- kca
- mns
- sms
- sjd
- sma
- fkv
- liv
- krl
- koi
- olo
- udm
- kpv
- smn
- sme
- vro
- vep
- mdf
- myv
- mrj
- mhr
- kv
---
# Wikipedia for Finno-Ugric Languages (20251201)
Created using [WikiExtractor](https://github.com/attardi/wikiextractor) from the 20251201 Wikipedia dump (https://dumps.wikimedia.org/).
Language codes are [ISO 639-3](https://iso639-3.sil.org/code_tables/639/data) with the script indicated.
Wikipedia incubator language codes additionally have the *"_incubator"* suffix.
| Language | Lang code | Wiki code | documents | words | characters |
|----------------|--------------------|-----------|-----------|-----------|------------|
| Votic | vot_Latn_incubator | Wp/vot | 0 | 0 | 0 |
| Ume Sami | sju_Latn_incubator | Wp/sju | 8 | 84 | 803 |
| Ingrian | izh_Latn_incubator | Wp/izh | 11 | 967 | 7075 |
| Pite Sami | sje_Latn_incubator | Wp/sje | 11 | 431 | 3513 |
| Lule Sami | smj_Latn_incubator | Wp/smj | 20 | 442 | 3609 |
| Khanty | kca_Cyrl_incubator | Wp/kca | 25 | 1279 | 7926 |
| Mansi | mns_Cyrl_incubator | Wp/mns | 115 | 3877 | 26013 |
| Skolt Sami | sms_Latn_incubator | Wp/sms | 126 | 2279 | 18234 |
| Kildin Sami | sjd_Cyrl_incubator | Wp/sjd | 128 | 8913 | 64758 |
| Southern Sami | sma_Latn_incubator | Wp/sma | 174 | 2779 | 22750 |
| Kven Finnish | fkv_Latn_incubator | Wp/fkv | 592 | 16971 | 132747 |
| Livonian | liv_Latn_incubator | Wp/liv | 883 | 42231 | 279640 |
| Karelian | krl_Latn_incubator | Wp/krl | 1200 | 68692 | 564640 |
| Komi-Permyak | koi_Cyrl | koi | 3511 | 185675 | 1300632 |
| Livvi | olo_Latn | olo | 4524 | 292592 | 2345992 |
| Udmurt | udm_Cyrl | udm | 5323 | 266558 | 1978952 |
| Komi-Zyrian | kpv_Cyrl | kv | 6071 | 402558 | 2821735 |
| Inari Sami | smn_Latn | smn | 6488 | 544615 | 4177988 |
| Northern Sami | sme_Latn | se | 6695 | 256443 | 2057936 |
| Võro | vro_Latn | fiu_vro | 6885 | 409483 | 2964785 |
| Veps | vep_Latn | vep | 7046 | 1335149 | 10437053 |
| Moksha | mdf_Cyrl | mdf | 7696 | 129887 | 958389 |
| Erzya | myv_Cyrl | myv | 7834 | 439189 | 3350993 |
| Western Mari | mrj_Cyrl | mrj | 10538 | 344050 | 2367657 |
| Eastern Mari | mhr_Cyrl | mhr | 10903 | 602988 | 4268455 |
| Estonian | est_Latn | et | 255033 | 41975936 | 333347119 |
| Hungarian | hun_Latn | hu | 527080 | 141082784 | 1065596229 |
| Finnish | fin_Latn | fi | 601263 | 106390059 | 934426131 |
# 芬兰-乌戈尔语族维基百科数据集(20251201版)
本数据集基于WikiExtractor从2025年12月1日的维基百科转储文件(https://dumps.wikimedia.org/)提取生成。
语言代码采用ISO 639-3标准并标注所使用的文字体系,维基百科孵化器项目的语言代码会附加`_incubator`后缀。
## 数据集信息
本数据集包含多个语言配置,每个配置对应一种语言的维基百科条目数据,所有配置仅包含训练划分集(train split),各配置的详细参数如下:
- 配置名称:est_Latn(爱沙尼亚语,拉丁字母),特征字段包括id、revid、url、title、text,均为字符串类型;训练集字节数365748262,样本数255033;下载大小220777348,数据集总大小365748262
- 配置名称:fin_Latn(芬兰语,拉丁字母),特征字段同上;训练集字节数1022188356,样本数601263;下载大小604794904,数据集总大小1022188356
- 配置名称:fkv_Latn_incubator(克文芬兰语,孵化器项目,拉丁字母),特征字段同上;训练集字节数207987,样本数592;下载大小101870,数据集总大小207987
- 配置名称:hun_Latn(匈牙利语,拉丁字母),特征字段同上;训练集字节数1218396850,样本数527080;下载大小739124218,数据集总大小1218396850
- 配置名称:izh_Latn_incubator(英格里亚语,孵化器项目,拉丁字母),特征字段同上;训练集字节数8782,样本数11;下载大小10198,数据集总大小8782
- 配置名称:kca_Cyrl_incubator(汉特语,孵化器项目,西里尔字母),特征字段同上;训练集字节数17495,样本数25;下载大小11353,数据集总大小17495
- 配置名称:koi_Cyrl(科米-彼尔米亚克语,西里尔字母),特征字段同上;训练集字节数2631369,样本数3511;下载大小1061843,数据集总大小2631369
- 配置名称:kpv_Cyrl(科米-兹梁语,西里尔字母),特征字段同上;训练集字节数5538293,样本数6071;下载大小2433838,数据集总大小5538293
- 配置名称:krl_Latn_incubator(卡累利阿语,孵化器项目,拉丁字母),特征字段同上;训练集字节数749792,样本数1200;下载大小399958,数据集总大小749792
- 配置名称:liv_Latn_incubator(利沃尼亚语,孵化器项目,拉丁字母),特征字段同上;训练集字节数427468,样本数883;下载大小206977,数据集总大小427468
- 配置名称:mdf_Cyrl(莫克沙语,西里尔字母),特征字段同上;训练集字节数2366232,样本数7696;下载大小910852,数据集总大小2366232
- 配置名称:mhr_Cyrl(东部马里语,西里尔字母),特征字段同上;训练集字节数8631067,样本数10903;下载大小3339269,数据集总大小8631067
- 配置名称:mns_Cyrl_incubator(曼西语,孵化器项目,西里尔字母),特征字段同上;训练集字节数59563,样本数115;下载大小26471,数据集总大小59563
- 配置名称:mrj_Cyrl(西部马里语,西里尔字母),特征字段同上;训练集字节数5057251,样本数10538;下载大小1878775,数据集总大小5057251
- 配置名称:myv_Cyrl(埃尔齐亚语,西里尔字母),特征字段同上;训练集字节数6741822,样本数7834;下载大小2838366,数据集总大小6741822
- 配置名称:olo_Latn(利维语,拉丁字母),特征字段同上;训练集字节数2827589,样本数4524;下载大小1511606,数据集总大小2827589
- 配置名称:sjd_Cyrl_incubator(基尔丁萨米语,孵化器项目,西里尔字母),特征字段同上;训练集字节数132271,样本数128;下载大小65144,数据集总大小132271
- 配置名称:sje_Latn_incubator(皮特萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数5119,样本数11;下载大小6240,数据集总大小5119
- 配置名称:sju_Latn_incubator(乌梅萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数1851,样本数8;下载大小3479,数据集总大小1851
- 配置名称:sma_Latn_incubator(南萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数44067,样本数174;下载大小22441,数据集总大小44067
- 配置名称:sme_Latn(北萨米语,拉丁字母),特征字段同上;训练集字节数2729501,样本数6695;下载大小1375453,数据集总大小2729501
- 配置名称:smj_Latn_incubator(卢勒萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数6179,样本数20;下载大小6727,数据集总大小6179
- 配置名称:smn_Latn(伊纳里萨米语,拉丁字母),特征字段同上;训练集字节数5160325,样本数6488;下载大小2588720,数据集总大小5160325
- 配置名称:sms_Latn_incubator(斯科尔特萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数36246,样本数126;下载大小18007,数据集总大小36246
- 配置名称:udm_Cyrl(乌德穆尔特语,西里尔字母),特征字段同上;训练集字节数4024213,样本数5323;下载大小1712803,数据集总大小4024213
- 配置名称:vep_Latn(维普斯语,拉丁字母),特征字段同上;训练集字节数11548381,样本数7046;下载大小6331617,数据集总大小11548381
- 配置名称:vot_Latn_incubator(沃提克语,孵化器项目,拉丁字母),特征字段为空;训练集字节数0,样本数0;下载大小324,数据集总大小0
- 配置名称:vro_Latn(沃罗语,拉丁字母),特征字段同上;训练集字节数3788627,样本数6885;下载大小2135222,数据集总大小3788627
## 配置详情
各配置对应的数据文件路径格式为`{配置名称}/train-*`,所有配置的划分集均为训练集。
## 授权协议
知识共享署名-相同方式共享4.0(CC BY-SA 4.0)
## 覆盖语言
et(爱沙尼亚语)、hu(匈牙利语)、fi(芬兰语)、vot(沃提克语)、sju(乌梅萨米语)、izh(英格里亚语)、sje(皮特萨米语)、smj(卢勒萨米语)、kca(汉特语)、mns(曼西语)、sms(斯科尔特萨米语)、sjd(基尔丁萨米语)、sma(南萨米语)、fkv(克文芬兰语)、liv(利沃尼亚语)、krl(卡累利阿语)、koi(科米-彼尔米亚克语)、olo(利维语)、udm(乌德穆尔特语)、kpv(科米-兹梁语)、smn(伊纳里萨米语)、sme(北萨米语)、vro(沃罗语)、vep(维普斯语)、mdf(莫克沙语)、myv(埃尔齐亚语)、mrj(西部马里语)、mhr(东部马里语)、kv(科米-兹梁语)
## 各语言数据集详情
| 语言名称 | 语言代码 | 维基代码 | 文档数 | 总词数 | 总字符数 |
|------------------|--------------------|----------|--------|----------|------------|
| 沃提克语 | vot_Latn_incubator | Wp/vot | 0 | 0 | 0 |
| 乌梅萨米语 | sju_Latn_incubator | Wp/sju | 8 | 84 | 803 |
| 英格里亚语 | izh_Latn_incubator | Wp/izh | 11 | 967 | 7075 |
| 皮特萨米语 | sje_Latn_incubator | Wp/sje | 11 | 431 | 3513 |
| 卢勒萨米语 | smj_Latn_incubator | Wp/smj | 20 | 442 | 3609 |
| 汉特语 | kca_Cyrl_incubator | Wp/kca | 25 | 1279 | 7926 |
| 曼西语 | mns_Cyrl_incubator | Wp/mns | 115 | 3877 | 26013 |
| 斯科尔特萨米语 | sms_Latn_incubator | Wp/sms | 126 | 2279 | 18234 |
| 基尔丁萨米语 | sjd_Cyrl_incubator | Wp/sjd | 128 | 8913 | 64758 |
| 南萨米语 | sma_Latn_incubator | Wp/sma | 174 | 2779 | 22750 |
| 克文芬兰语 | fkv_Latn_incubator | Wp/fkv | 592 | 16971 | 132747 |
| 利沃尼亚语 | liv_Latn_incubator | Wp/liv | 883 | 42231 | 279640 |
| 卡累利阿语 | krl_Latn_incubator | Wp/krl | 1200 | 68692 | 564640 |
| 科米-彼尔米亚克语 | koi_Cyrl | koi | 3511 | 185675 | 1300632 |
| 利维语 | olo_Latn | olo | 4524 | 292592 | 2345992 |
| 乌德穆尔特语 | udm_Cyrl | udm | 5323 | 266558 | 1978952 |
| 科米-兹梁语 | kpv_Cyrl | kv | 6071 | 402558 | 2821735 |
| 伊纳里萨米语 | smn_Latn | smn | 6488 | 544615 | 4177988 |
| 北萨米语 | sme_Latn | se | 6695 | 256443 | 2057936 |
| 沃罗语 | vro_Latn | fiu_vro | 6885 | 409483 | 2964785 |
| 维普斯语 | vep_Latn | vep | 7046 | 1335149 | 10437053 |
| 莫克沙语 | mdf_Cyrl | mdf | 7696 | 129887 | 958389 |
| 埃尔齐亚语 | myv_Cyrl | myv | 7834 | 439189 | 3350993 |
| 西部马里语 | mrj_Cyrl | mrj | 10538 | 344050 | 2367657 |
| 东部马里语 | mhr_Cyrl | mhr | 10903 | 602988 | 4268455 |
| 爱沙尼亚语 | est_Latn | et | 255033 | 41975936 | 333347119 |
| 匈牙利语 | hun_Latn | hu | 527080 | 141082784| 1065596229 |
| 芬兰语 | fin_Latn | fi | 601263 | 106390059| 934426131 |
提供机构:
tartuNLP



