five

tartuNLP/wikipedia-smugri-20251201

收藏
Hugging Face2026-03-13 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/tartuNLP/wikipedia-smugri-20251201
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: est_Latn features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 365748262 num_examples: 255033 download_size: 220777348 dataset_size: 365748262 - config_name: fin_Latn features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1022188356 num_examples: 601263 download_size: 604794904 dataset_size: 1022188356 - config_name: fkv_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 207987 num_examples: 592 download_size: 101870 dataset_size: 207987 - config_name: hun_Latn features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1218396850 num_examples: 527080 download_size: 739124218 dataset_size: 1218396850 - config_name: izh_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 8782 num_examples: 11 download_size: 10198 dataset_size: 8782 - config_name: kca_Cyrl_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 17495 num_examples: 25 download_size: 11353 dataset_size: 17495 - config_name: koi_Cyrl features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2631369 num_examples: 3511 download_size: 1061843 dataset_size: 2631369 - config_name: kpv_Cyrl features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5538293 num_examples: 6071 download_size: 2433838 dataset_size: 5538293 - config_name: krl_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 749792 num_examples: 1200 download_size: 399958 dataset_size: 749792 - config_name: liv_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 427468 num_examples: 883 download_size: 206977 dataset_size: 427468 - config_name: mdf_Cyrl features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2366232 num_examples: 7696 download_size: 910852 dataset_size: 2366232 - config_name: mhr_Cyrl features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 8631067 num_examples: 10903 download_size: 3339269 dataset_size: 8631067 - config_name: mns_Cyrl_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 59563 num_examples: 115 download_size: 26471 dataset_size: 59563 - config_name: mrj_Cyrl features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5057251 num_examples: 10538 download_size: 1878775 dataset_size: 5057251 - config_name: myv_Cyrl features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6741822 num_examples: 7834 download_size: 2838366 dataset_size: 6741822 - config_name: olo_Latn features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2827589 num_examples: 4524 download_size: 1511606 dataset_size: 2827589 - config_name: sjd_Cyrl_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 132271 num_examples: 128 download_size: 65144 dataset_size: 132271 - config_name: sje_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5119 num_examples: 11 download_size: 6240 dataset_size: 5119 - config_name: sju_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1851 num_examples: 8 download_size: 3479 dataset_size: 1851 - config_name: sma_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 44067 num_examples: 174 download_size: 22441 dataset_size: 44067 - config_name: sme_Latn features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2729501 num_examples: 6695 download_size: 1375453 dataset_size: 2729501 - config_name: smj_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6179 num_examples: 20 download_size: 6727 dataset_size: 6179 - config_name: smn_Latn features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5160325 num_examples: 6488 download_size: 2588720 dataset_size: 5160325 - config_name: sms_Latn_incubator features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 36246 num_examples: 126 download_size: 18007 dataset_size: 36246 - config_name: udm_Cyrl features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4024213 num_examples: 5323 download_size: 1712803 dataset_size: 4024213 - config_name: vep_Latn features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 11548381 num_examples: 7046 download_size: 6331617 dataset_size: 11548381 - config_name: vot_Latn_incubator features: [] splits: - name: train num_bytes: 0 num_examples: 0 download_size: 324 dataset_size: 0 - config_name: vro_Latn features: - name: id dtype: string - name: revid dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3788627 num_examples: 6885 download_size: 2135222 dataset_size: 3788627 configs: - config_name: est_Latn data_files: - split: train path: est_Latn/train-* - config_name: fin_Latn data_files: - split: train path: fin_Latn/train-* - config_name: fkv_Latn_incubator data_files: - split: train path: fkv_Latn_incubator/train-* - config_name: hun_Latn data_files: - split: train path: hun_Latn/train-* - config_name: izh_Latn_incubator data_files: - split: train path: izh_Latn_incubator/train-* - config_name: kca_Cyrl_incubator data_files: - split: train path: kca_Cyrl_incubator/train-* - config_name: koi_Cyrl data_files: - split: train path: koi_Cyrl/train-* - config_name: kpv_Cyrl data_files: - split: train path: kpv_Cyrl/train-* - config_name: krl_Latn_incubator data_files: - split: train path: krl_Latn_incubator/train-* - config_name: liv_Latn_incubator data_files: - split: train path: liv_Latn_incubator/train-* - config_name: mdf_Cyrl data_files: - split: train path: mdf_Cyrl/train-* - config_name: mhr_Cyrl data_files: - split: train path: mhr_Cyrl/train-* - config_name: mns_Cyrl_incubator data_files: - split: train path: mns_Cyrl_incubator/train-* - config_name: mrj_Cyrl data_files: - split: train path: mrj_Cyrl/train-* - config_name: myv_Cyrl data_files: - split: train path: myv_Cyrl/train-* - config_name: olo_Latn data_files: - split: train path: olo_Latn/train-* - config_name: sjd_Cyrl_incubator data_files: - split: train path: sjd_Cyrl_incubator/train-* - config_name: sje_Latn_incubator data_files: - split: train path: sje_Latn_incubator/train-* - config_name: sju_Latn_incubator data_files: - split: train path: sju_Latn_incubator/train-* - config_name: sma_Latn_incubator data_files: - split: train path: sma_Latn_incubator/train-* - config_name: sme_Latn data_files: - split: train path: sme_Latn/train-* - config_name: smj_Latn_incubator data_files: - split: train path: smj_Latn_incubator/train-* - config_name: smn_Latn data_files: - split: train path: smn_Latn/train-* - config_name: sms_Latn_incubator data_files: - split: train path: sms_Latn_incubator/train-* - config_name: udm_Cyrl data_files: - split: train path: udm_Cyrl/train-* - config_name: vep_Latn data_files: - split: train path: vep_Latn/train-* - config_name: vot_Latn_incubator data_files: - split: train path: vot_Latn_incubator/train-* - config_name: vro_Latn data_files: - split: train path: vro_Latn/train-* license: cc-by-sa-4.0 language: - et - hu - fi - vot - sju - izh - sje - smj - kca - mns - sms - sjd - sma - fkv - liv - krl - koi - olo - udm - kpv - smn - sme - vro - vep - mdf - myv - mrj - mhr - kv --- # Wikipedia for Finno-Ugric Languages (20251201) Created using [WikiExtractor](https://github.com/attardi/wikiextractor) from the 20251201 Wikipedia dump (https://dumps.wikimedia.org/). Language codes are [ISO 639-3](https://iso639-3.sil.org/code_tables/639/data) with the script indicated. Wikipedia incubator language codes additionally have the *"_incubator"* suffix. | Language | Lang code | Wiki code | documents | words | characters | |----------------|--------------------|-----------|-----------|-----------|------------| | Votic | vot_Latn_incubator | Wp/vot | 0 | 0 | 0 | | Ume Sami | sju_Latn_incubator | Wp/sju | 8 | 84 | 803 | | Ingrian | izh_Latn_incubator | Wp/izh | 11 | 967 | 7075 | | Pite Sami | sje_Latn_incubator | Wp/sje | 11 | 431 | 3513 | | Lule Sami | smj_Latn_incubator | Wp/smj | 20 | 442 | 3609 | | Khanty | kca_Cyrl_incubator | Wp/kca | 25 | 1279 | 7926 | | Mansi | mns_Cyrl_incubator | Wp/mns | 115 | 3877 | 26013 | | Skolt Sami | sms_Latn_incubator | Wp/sms | 126 | 2279 | 18234 | | Kildin Sami | sjd_Cyrl_incubator | Wp/sjd | 128 | 8913 | 64758 | | Southern Sami | sma_Latn_incubator | Wp/sma | 174 | 2779 | 22750 | | Kven Finnish | fkv_Latn_incubator | Wp/fkv | 592 | 16971 | 132747 | | Livonian | liv_Latn_incubator | Wp/liv | 883 | 42231 | 279640 | | Karelian | krl_Latn_incubator | Wp/krl | 1200 | 68692 | 564640 | | Komi-Permyak | koi_Cyrl | koi | 3511 | 185675 | 1300632 | | Livvi | olo_Latn | olo | 4524 | 292592 | 2345992 | | Udmurt | udm_Cyrl | udm | 5323 | 266558 | 1978952 | | Komi-Zyrian | kpv_Cyrl | kv | 6071 | 402558 | 2821735 | | Inari Sami | smn_Latn | smn | 6488 | 544615 | 4177988 | | Northern Sami | sme_Latn | se | 6695 | 256443 | 2057936 | | Võro | vro_Latn | fiu_vro | 6885 | 409483 | 2964785 | | Veps | vep_Latn | vep | 7046 | 1335149 | 10437053 | | Moksha | mdf_Cyrl | mdf | 7696 | 129887 | 958389 | | Erzya | myv_Cyrl | myv | 7834 | 439189 | 3350993 | | Western Mari | mrj_Cyrl | mrj | 10538 | 344050 | 2367657 | | Eastern Mari | mhr_Cyrl | mhr | 10903 | 602988 | 4268455 | | Estonian | est_Latn | et | 255033 | 41975936 | 333347119 | | Hungarian | hun_Latn | hu | 527080 | 141082784 | 1065596229 | | Finnish | fin_Latn | fi | 601263 | 106390059 | 934426131 |

# 芬兰-乌戈尔语族维基百科数据集(20251201版) 本数据集基于WikiExtractor从2025年12月1日的维基百科转储文件(https://dumps.wikimedia.org/)提取生成。 语言代码采用ISO 639-3标准并标注所使用的文字体系,维基百科孵化器项目的语言代码会附加`_incubator`后缀。 ## 数据集信息 本数据集包含多个语言配置,每个配置对应一种语言的维基百科条目数据,所有配置仅包含训练划分集(train split),各配置的详细参数如下: - 配置名称:est_Latn(爱沙尼亚语,拉丁字母),特征字段包括id、revid、url、title、text,均为字符串类型;训练集字节数365748262,样本数255033;下载大小220777348,数据集总大小365748262 - 配置名称:fin_Latn(芬兰语,拉丁字母),特征字段同上;训练集字节数1022188356,样本数601263;下载大小604794904,数据集总大小1022188356 - 配置名称:fkv_Latn_incubator(克文芬兰语,孵化器项目,拉丁字母),特征字段同上;训练集字节数207987,样本数592;下载大小101870,数据集总大小207987 - 配置名称:hun_Latn(匈牙利语,拉丁字母),特征字段同上;训练集字节数1218396850,样本数527080;下载大小739124218,数据集总大小1218396850 - 配置名称:izh_Latn_incubator(英格里亚语,孵化器项目,拉丁字母),特征字段同上;训练集字节数8782,样本数11;下载大小10198,数据集总大小8782 - 配置名称:kca_Cyrl_incubator(汉特语,孵化器项目,西里尔字母),特征字段同上;训练集字节数17495,样本数25;下载大小11353,数据集总大小17495 - 配置名称:koi_Cyrl(科米-彼尔米亚克语,西里尔字母),特征字段同上;训练集字节数2631369,样本数3511;下载大小1061843,数据集总大小2631369 - 配置名称:kpv_Cyrl(科米-兹梁语,西里尔字母),特征字段同上;训练集字节数5538293,样本数6071;下载大小2433838,数据集总大小5538293 - 配置名称:krl_Latn_incubator(卡累利阿语,孵化器项目,拉丁字母),特征字段同上;训练集字节数749792,样本数1200;下载大小399958,数据集总大小749792 - 配置名称:liv_Latn_incubator(利沃尼亚语,孵化器项目,拉丁字母),特征字段同上;训练集字节数427468,样本数883;下载大小206977,数据集总大小427468 - 配置名称:mdf_Cyrl(莫克沙语,西里尔字母),特征字段同上;训练集字节数2366232,样本数7696;下载大小910852,数据集总大小2366232 - 配置名称:mhr_Cyrl(东部马里语,西里尔字母),特征字段同上;训练集字节数8631067,样本数10903;下载大小3339269,数据集总大小8631067 - 配置名称:mns_Cyrl_incubator(曼西语,孵化器项目,西里尔字母),特征字段同上;训练集字节数59563,样本数115;下载大小26471,数据集总大小59563 - 配置名称:mrj_Cyrl(西部马里语,西里尔字母),特征字段同上;训练集字节数5057251,样本数10538;下载大小1878775,数据集总大小5057251 - 配置名称:myv_Cyrl(埃尔齐亚语,西里尔字母),特征字段同上;训练集字节数6741822,样本数7834;下载大小2838366,数据集总大小6741822 - 配置名称:olo_Latn(利维语,拉丁字母),特征字段同上;训练集字节数2827589,样本数4524;下载大小1511606,数据集总大小2827589 - 配置名称:sjd_Cyrl_incubator(基尔丁萨米语,孵化器项目,西里尔字母),特征字段同上;训练集字节数132271,样本数128;下载大小65144,数据集总大小132271 - 配置名称:sje_Latn_incubator(皮特萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数5119,样本数11;下载大小6240,数据集总大小5119 - 配置名称:sju_Latn_incubator(乌梅萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数1851,样本数8;下载大小3479,数据集总大小1851 - 配置名称:sma_Latn_incubator(南萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数44067,样本数174;下载大小22441,数据集总大小44067 - 配置名称:sme_Latn(北萨米语,拉丁字母),特征字段同上;训练集字节数2729501,样本数6695;下载大小1375453,数据集总大小2729501 - 配置名称:smj_Latn_incubator(卢勒萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数6179,样本数20;下载大小6727,数据集总大小6179 - 配置名称:smn_Latn(伊纳里萨米语,拉丁字母),特征字段同上;训练集字节数5160325,样本数6488;下载大小2588720,数据集总大小5160325 - 配置名称:sms_Latn_incubator(斯科尔特萨米语,孵化器项目,拉丁字母),特征字段同上;训练集字节数36246,样本数126;下载大小18007,数据集总大小36246 - 配置名称:udm_Cyrl(乌德穆尔特语,西里尔字母),特征字段同上;训练集字节数4024213,样本数5323;下载大小1712803,数据集总大小4024213 - 配置名称:vep_Latn(维普斯语,拉丁字母),特征字段同上;训练集字节数11548381,样本数7046;下载大小6331617,数据集总大小11548381 - 配置名称:vot_Latn_incubator(沃提克语,孵化器项目,拉丁字母),特征字段为空;训练集字节数0,样本数0;下载大小324,数据集总大小0 - 配置名称:vro_Latn(沃罗语,拉丁字母),特征字段同上;训练集字节数3788627,样本数6885;下载大小2135222,数据集总大小3788627 ## 配置详情 各配置对应的数据文件路径格式为`{配置名称}/train-*`,所有配置的划分集均为训练集。 ## 授权协议 知识共享署名-相同方式共享4.0(CC BY-SA 4.0) ## 覆盖语言 et(爱沙尼亚语)、hu(匈牙利语)、fi(芬兰语)、vot(沃提克语)、sju(乌梅萨米语)、izh(英格里亚语)、sje(皮特萨米语)、smj(卢勒萨米语)、kca(汉特语)、mns(曼西语)、sms(斯科尔特萨米语)、sjd(基尔丁萨米语)、sma(南萨米语)、fkv(克文芬兰语)、liv(利沃尼亚语)、krl(卡累利阿语)、koi(科米-彼尔米亚克语)、olo(利维语)、udm(乌德穆尔特语)、kpv(科米-兹梁语)、smn(伊纳里萨米语)、sme(北萨米语)、vro(沃罗语)、vep(维普斯语)、mdf(莫克沙语)、myv(埃尔齐亚语)、mrj(西部马里语)、mhr(东部马里语)、kv(科米-兹梁语) ## 各语言数据集详情 | 语言名称 | 语言代码 | 维基代码 | 文档数 | 总词数 | 总字符数 | |------------------|--------------------|----------|--------|----------|------------| | 沃提克语 | vot_Latn_incubator | Wp/vot | 0 | 0 | 0 | | 乌梅萨米语 | sju_Latn_incubator | Wp/sju | 8 | 84 | 803 | | 英格里亚语 | izh_Latn_incubator | Wp/izh | 11 | 967 | 7075 | | 皮特萨米语 | sje_Latn_incubator | Wp/sje | 11 | 431 | 3513 | | 卢勒萨米语 | smj_Latn_incubator | Wp/smj | 20 | 442 | 3609 | | 汉特语 | kca_Cyrl_incubator | Wp/kca | 25 | 1279 | 7926 | | 曼西语 | mns_Cyrl_incubator | Wp/mns | 115 | 3877 | 26013 | | 斯科尔特萨米语 | sms_Latn_incubator | Wp/sms | 126 | 2279 | 18234 | | 基尔丁萨米语 | sjd_Cyrl_incubator | Wp/sjd | 128 | 8913 | 64758 | | 南萨米语 | sma_Latn_incubator | Wp/sma | 174 | 2779 | 22750 | | 克文芬兰语 | fkv_Latn_incubator | Wp/fkv | 592 | 16971 | 132747 | | 利沃尼亚语 | liv_Latn_incubator | Wp/liv | 883 | 42231 | 279640 | | 卡累利阿语 | krl_Latn_incubator | Wp/krl | 1200 | 68692 | 564640 | | 科米-彼尔米亚克语 | koi_Cyrl | koi | 3511 | 185675 | 1300632 | | 利维语 | olo_Latn | olo | 4524 | 292592 | 2345992 | | 乌德穆尔特语 | udm_Cyrl | udm | 5323 | 266558 | 1978952 | | 科米-兹梁语 | kpv_Cyrl | kv | 6071 | 402558 | 2821735 | | 伊纳里萨米语 | smn_Latn | smn | 6488 | 544615 | 4177988 | | 北萨米语 | sme_Latn | se | 6695 | 256443 | 2057936 | | 沃罗语 | vro_Latn | fiu_vro | 6885 | 409483 | 2964785 | | 维普斯语 | vep_Latn | vep | 7046 | 1335149 | 10437053 | | 莫克沙语 | mdf_Cyrl | mdf | 7696 | 129887 | 958389 | | 埃尔齐亚语 | myv_Cyrl | myv | 7834 | 439189 | 3350993 | | 西部马里语 | mrj_Cyrl | mrj | 10538 | 344050 | 2367657 | | 东部马里语 | mhr_Cyrl | mhr | 10903 | 602988 | 4268455 | | 爱沙尼亚语 | est_Latn | et | 255033 | 41975936 | 333347119 | | 匈牙利语 | hun_Latn | hu | 527080 | 141082784| 1065596229 | | 芬兰语 | fin_Latn | fi | 601263 | 106390059| 934426131 |
提供机构:
tartuNLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作