liu-nlp/unimorph-blimp-200
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/liu-nlp/unimorph-blimp-200
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- sq
- ar
- hy
- az
- eu
- bn
- ca
- ku
- hr
- cs
- da
- et
- fo
- fi
- fr
- gl
- ka
- de
- el
- he
- hi
- hu
- is
- id
- kn
- lt
- se
- ps
- fa
- pl
- pt
- ro
- ru
- es
- sw
- sv
- te
- tr
- uk
license: apache-2.0
pretty_name: An Experimental BLiMP-style Dataset based on UniMorph Minimal Tag Pairs
dataset_info:
- config_name: albanian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 3551509
num_examples: 7375
download_size: 1721278
dataset_size: 3551509
- config_name: arabic
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1868302
num_examples: 1389
download_size: 571578
dataset_size: 1868302
- config_name: armenian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 41407687
num_examples: 7941
download_size: 12499293
dataset_size: 41407687
- config_name: azerbaijani
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 2124372
num_examples: 4848
download_size: 958296
dataset_size: 2124372
- config_name: basque
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 207074
num_examples: 538
download_size: 122488
dataset_size: 207074
- config_name: bengali
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 2533627
num_examples: 261
download_size: 143364
dataset_size: 2533627
- config_name: catalan
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1453106
num_examples: 3087
download_size: 825907
dataset_size: 1453106
- config_name: central_kurdish
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 863332
num_examples: 907
download_size: 320219
dataset_size: 863332
- config_name: croatian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 9365136
num_examples: 22008
download_size: 5440538
dataset_size: 9365136
- config_name: czech
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 8052772
num_examples: 21347
download_size: 4693274
dataset_size: 8052772
- config_name: danish
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 977529
num_examples: 2485
download_size: 574074
dataset_size: 977529
- config_name: estonian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 2201515
num_examples: 7069
download_size: 1066526
dataset_size: 2201515
- config_name: faroese
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 2727480
num_examples: 8795
download_size: 1394876
dataset_size: 2727480
- config_name: finnish
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 3470582
num_examples: 10033
download_size: 1756967
dataset_size: 3470582
- config_name: french
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1785474
num_examples: 3048
download_size: 1068294
dataset_size: 1785474
- config_name: galician
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1712420
num_examples: 3600
download_size: 1005722
dataset_size: 1712420
- config_name: georgian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 7410805
num_examples: 8035
download_size: 2429439
dataset_size: 7410805
- config_name: german
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 3784606
num_examples: 9000
download_size: 2153238
dataset_size: 3784606
- config_name: greek
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 7401994
num_examples: 9484
download_size: 3290335
dataset_size: 7401994
- config_name: hebrew
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1810087
num_examples: 3044
download_size: 905779
dataset_size: 1810087
- config_name: hindi
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 3828137
num_examples: 872
download_size: 637038
dataset_size: 3828137
- config_name: hungarian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 6366275
num_examples: 14570
download_size: 3240238
dataset_size: 6366275
- config_name: icelandic
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1936100
num_examples: 5453
download_size: 879174
dataset_size: 1936100
- config_name: indonesian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 465739
num_examples: 1065
download_size: 238215
dataset_size: 465739
- config_name: italian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1783448
num_examples: 3021
download_size: 1046705
dataset_size: 1783448
- config_name: kannada
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 4764170
num_examples: 5079
download_size: 1383900
dataset_size: 4764170
- config_name: kazakh
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1034106
num_examples: 1814
download_size: 305171
dataset_size: 1034106
- config_name: latvian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 4701946
num_examples: 13158
download_size: 2391660
dataset_size: 4701946
- config_name: lithuanian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 4803347
num_examples: 13469
download_size: 2185318
dataset_size: 4803347
- config_name: maltese
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 727653
num_examples: 1732
download_size: 373037
dataset_size: 727653
- config_name: northern_sami
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 2150915
num_examples: 6125
download_size: 911629
dataset_size: 2150915
- config_name: pashto
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 4336795
num_examples: 3211
download_size: 1278790
dataset_size: 4336795
- config_name: persian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1739582
num_examples: 2604
download_size: 805674
dataset_size: 1739582
- config_name: polish
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 7905283
num_examples: 19549
download_size: 4385393
dataset_size: 7905283
- config_name: portuguese
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1900545
num_examples: 3564
download_size: 1086492
dataset_size: 1900545
- config_name: romanian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 2417224
num_examples: 5000
download_size: 1400218
dataset_size: 2417224
- config_name: russian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 11928399
num_examples: 16804
download_size: 5946761
dataset_size: 11928399
- config_name: shona
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 150688
num_examples: 397
download_size: 30958
dataset_size: 150688
- config_name: spanish
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 6266812
num_examples: 5330
download_size: 3357644
dataset_size: 6266812
- config_name: swahili
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 193577
num_examples: 527
download_size: 82710
dataset_size: 193577
- config_name: swedish
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 1781002
num_examples: 4254
download_size: 957242
dataset_size: 1781002
- config_name: telugu
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 252956
num_examples: 397
download_size: 43445
dataset_size: 252956
- config_name: turkish
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 3980531
num_examples: 9214
download_size: 2144486
dataset_size: 3980531
- config_name: ukrainian
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 12342982
num_examples: 18248
download_size: 5719783
dataset_size: 12342982
- config_name: zulu
features:
- name: subset
dtype: string
- name: language
dtype: string
- name: original
dtype: string
- name: corrupted
dtype: string
splits:
- name: train
num_bytes: 473556
num_examples: 937
download_size: 189529
dataset_size: 473556
configs:
- config_name: albanian
data_files:
- split: train
path: albanian/train-*
- config_name: arabic
data_files:
- split: train
path: arabic/train-*
- config_name: armenian
data_files:
- split: train
path: armenian/train-*
- config_name: azerbaijani
data_files:
- split: train
path: azerbaijani/train-*
- config_name: basque
data_files:
- split: train
path: basque/train-*
- config_name: bengali
data_files:
- split: train
path: bengali/train-*
- config_name: catalan
data_files:
- split: train
path: catalan/train-*
- config_name: central_kurdish
data_files:
- split: train
path: central_kurdish/train-*
- config_name: croatian
data_files:
- split: train
path: croatian/train-*
- config_name: czech
data_files:
- split: train
path: czech/train-*
- config_name: danish
data_files:
- split: train
path: danish/train-*
- config_name: estonian
data_files:
- split: train
path: estonian/train-*
- config_name: faroese
data_files:
- split: train
path: faroese/train-*
- config_name: finnish
data_files:
- split: train
path: finnish/train-*
- config_name: french
data_files:
- split: train
path: french/train-*
- config_name: galician
data_files:
- split: train
path: galician/train-*
- config_name: georgian
data_files:
- split: train
path: georgian/train-*
- config_name: german
data_files:
- split: train
path: german/train-*
- config_name: greek
data_files:
- split: train
path: greek/train-*
- config_name: hebrew
data_files:
- split: train
path: hebrew/train-*
- config_name: hindi
data_files:
- split: train
path: hindi/train-*
- config_name: hungarian
data_files:
- split: train
path: hungarian/train-*
- config_name: icelandic
data_files:
- split: train
path: icelandic/train-*
- config_name: indonesian
data_files:
- split: train
path: indonesian/train-*
- config_name: italian
data_files:
- split: train
path: italian/train-*
- config_name: kannada
data_files:
- split: train
path: kannada/train-*
- config_name: kazakh
data_files:
- split: train
path: kazakh/train-*
- config_name: latvian
data_files:
- split: train
path: latvian/train-*
- config_name: lithuanian
data_files:
- split: train
path: lithuanian/train-*
- config_name: maltese
data_files:
- split: train
path: maltese/train-*
- config_name: northern_sami
data_files:
- split: train
path: northern_sami/train-*
- config_name: pashto
data_files:
- split: train
path: pashto/train-*
- config_name: persian
data_files:
- split: train
path: persian/train-*
- config_name: polish
data_files:
- split: train
path: polish/train-*
- config_name: portuguese
data_files:
- split: train
path: portuguese/train-*
- config_name: romanian
data_files:
- split: train
path: romanian/train-*
- config_name: russian
data_files:
- split: train
path: russian/train-*
- config_name: shona
data_files:
- split: train
path: shona/train-*
- config_name: spanish
data_files:
- split: train
path: spanish/train-*
- config_name: swahili
data_files:
- split: train
path: swahili/train-*
- config_name: swedish
data_files:
- split: train
path: swedish/train-*
- config_name: telugu
data_files:
- split: train
path: telugu/train-*
- config_name: turkish
data_files:
- split: train
path: turkish/train-*
- config_name: ukrainian
data_files:
- split: train
path: ukrainian/train-*
- config_name: zulu
data_files:
- split: train
path: zulu/train-*
---
* This experimental [BLiMP](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00321/96452/BLiMP-The-Benchmark-of-Linguistic-Minimal-Pairs)-style dataset contains morphological corruptions based on **minimal tag pairs** from [UniMorph](https://unimorph.github.io).
* A minimal tag pair is a pair differing in exactly one variable, e.g., `N;DEF;GEN;PL` --> `N;DEF;NOM;PL`.
* The source corpora are Wikipedia articles, most of them having some sort of quality tag tags (e.g., 'excellent articles').
* The approach is inspired by [MultiBLiMP](https://aclanthology.org/2026.tacl-1.10/), which also uses UniMorph to generate subject–verb agreement minimal pairs, but generalizes it to all possible tag pairs, without manual checks if the corruptions make sense.
* As our approach does not include syntactic or other checks, this dataset is expected to be much **noisier** as the approach is simple and purely data-driven.
* We know that some corruptions do not actually corrupt the sentences in the intended ways!
* We are working on evaluating and iteratively refining it.
提供机构:
liu-nlp



