five

liu-nlp/unimorph-blimp-200

收藏
Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/liu-nlp/unimorph-blimp-200
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - sq - ar - hy - az - eu - bn - ca - ku - hr - cs - da - et - fo - fi - fr - gl - ka - de - el - he - hi - hu - is - id - kn - lt - se - ps - fa - pl - pt - ro - ru - es - sw - sv - te - tr - uk license: apache-2.0 pretty_name: An Experimental BLiMP-style Dataset based on UniMorph Minimal Tag Pairs dataset_info: - config_name: albanian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 3551509 num_examples: 7375 download_size: 1721278 dataset_size: 3551509 - config_name: arabic features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1868302 num_examples: 1389 download_size: 571578 dataset_size: 1868302 - config_name: armenian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 41407687 num_examples: 7941 download_size: 12499293 dataset_size: 41407687 - config_name: azerbaijani features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 2124372 num_examples: 4848 download_size: 958296 dataset_size: 2124372 - config_name: basque features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 207074 num_examples: 538 download_size: 122488 dataset_size: 207074 - config_name: bengali features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 2533627 num_examples: 261 download_size: 143364 dataset_size: 2533627 - config_name: catalan features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1453106 num_examples: 3087 download_size: 825907 dataset_size: 1453106 - config_name: central_kurdish features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 863332 num_examples: 907 download_size: 320219 dataset_size: 863332 - config_name: croatian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 9365136 num_examples: 22008 download_size: 5440538 dataset_size: 9365136 - config_name: czech features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 8052772 num_examples: 21347 download_size: 4693274 dataset_size: 8052772 - config_name: danish features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 977529 num_examples: 2485 download_size: 574074 dataset_size: 977529 - config_name: estonian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 2201515 num_examples: 7069 download_size: 1066526 dataset_size: 2201515 - config_name: faroese features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 2727480 num_examples: 8795 download_size: 1394876 dataset_size: 2727480 - config_name: finnish features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 3470582 num_examples: 10033 download_size: 1756967 dataset_size: 3470582 - config_name: french features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1785474 num_examples: 3048 download_size: 1068294 dataset_size: 1785474 - config_name: galician features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1712420 num_examples: 3600 download_size: 1005722 dataset_size: 1712420 - config_name: georgian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 7410805 num_examples: 8035 download_size: 2429439 dataset_size: 7410805 - config_name: german features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 3784606 num_examples: 9000 download_size: 2153238 dataset_size: 3784606 - config_name: greek features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 7401994 num_examples: 9484 download_size: 3290335 dataset_size: 7401994 - config_name: hebrew features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1810087 num_examples: 3044 download_size: 905779 dataset_size: 1810087 - config_name: hindi features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 3828137 num_examples: 872 download_size: 637038 dataset_size: 3828137 - config_name: hungarian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 6366275 num_examples: 14570 download_size: 3240238 dataset_size: 6366275 - config_name: icelandic features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1936100 num_examples: 5453 download_size: 879174 dataset_size: 1936100 - config_name: indonesian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 465739 num_examples: 1065 download_size: 238215 dataset_size: 465739 - config_name: italian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1783448 num_examples: 3021 download_size: 1046705 dataset_size: 1783448 - config_name: kannada features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 4764170 num_examples: 5079 download_size: 1383900 dataset_size: 4764170 - config_name: kazakh features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1034106 num_examples: 1814 download_size: 305171 dataset_size: 1034106 - config_name: latvian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 4701946 num_examples: 13158 download_size: 2391660 dataset_size: 4701946 - config_name: lithuanian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 4803347 num_examples: 13469 download_size: 2185318 dataset_size: 4803347 - config_name: maltese features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 727653 num_examples: 1732 download_size: 373037 dataset_size: 727653 - config_name: northern_sami features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 2150915 num_examples: 6125 download_size: 911629 dataset_size: 2150915 - config_name: pashto features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 4336795 num_examples: 3211 download_size: 1278790 dataset_size: 4336795 - config_name: persian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1739582 num_examples: 2604 download_size: 805674 dataset_size: 1739582 - config_name: polish features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 7905283 num_examples: 19549 download_size: 4385393 dataset_size: 7905283 - config_name: portuguese features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1900545 num_examples: 3564 download_size: 1086492 dataset_size: 1900545 - config_name: romanian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 2417224 num_examples: 5000 download_size: 1400218 dataset_size: 2417224 - config_name: russian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 11928399 num_examples: 16804 download_size: 5946761 dataset_size: 11928399 - config_name: shona features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 150688 num_examples: 397 download_size: 30958 dataset_size: 150688 - config_name: spanish features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 6266812 num_examples: 5330 download_size: 3357644 dataset_size: 6266812 - config_name: swahili features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 193577 num_examples: 527 download_size: 82710 dataset_size: 193577 - config_name: swedish features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 1781002 num_examples: 4254 download_size: 957242 dataset_size: 1781002 - config_name: telugu features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 252956 num_examples: 397 download_size: 43445 dataset_size: 252956 - config_name: turkish features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 3980531 num_examples: 9214 download_size: 2144486 dataset_size: 3980531 - config_name: ukrainian features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 12342982 num_examples: 18248 download_size: 5719783 dataset_size: 12342982 - config_name: zulu features: - name: subset dtype: string - name: language dtype: string - name: original dtype: string - name: corrupted dtype: string splits: - name: train num_bytes: 473556 num_examples: 937 download_size: 189529 dataset_size: 473556 configs: - config_name: albanian data_files: - split: train path: albanian/train-* - config_name: arabic data_files: - split: train path: arabic/train-* - config_name: armenian data_files: - split: train path: armenian/train-* - config_name: azerbaijani data_files: - split: train path: azerbaijani/train-* - config_name: basque data_files: - split: train path: basque/train-* - config_name: bengali data_files: - split: train path: bengali/train-* - config_name: catalan data_files: - split: train path: catalan/train-* - config_name: central_kurdish data_files: - split: train path: central_kurdish/train-* - config_name: croatian data_files: - split: train path: croatian/train-* - config_name: czech data_files: - split: train path: czech/train-* - config_name: danish data_files: - split: train path: danish/train-* - config_name: estonian data_files: - split: train path: estonian/train-* - config_name: faroese data_files: - split: train path: faroese/train-* - config_name: finnish data_files: - split: train path: finnish/train-* - config_name: french data_files: - split: train path: french/train-* - config_name: galician data_files: - split: train path: galician/train-* - config_name: georgian data_files: - split: train path: georgian/train-* - config_name: german data_files: - split: train path: german/train-* - config_name: greek data_files: - split: train path: greek/train-* - config_name: hebrew data_files: - split: train path: hebrew/train-* - config_name: hindi data_files: - split: train path: hindi/train-* - config_name: hungarian data_files: - split: train path: hungarian/train-* - config_name: icelandic data_files: - split: train path: icelandic/train-* - config_name: indonesian data_files: - split: train path: indonesian/train-* - config_name: italian data_files: - split: train path: italian/train-* - config_name: kannada data_files: - split: train path: kannada/train-* - config_name: kazakh data_files: - split: train path: kazakh/train-* - config_name: latvian data_files: - split: train path: latvian/train-* - config_name: lithuanian data_files: - split: train path: lithuanian/train-* - config_name: maltese data_files: - split: train path: maltese/train-* - config_name: northern_sami data_files: - split: train path: northern_sami/train-* - config_name: pashto data_files: - split: train path: pashto/train-* - config_name: persian data_files: - split: train path: persian/train-* - config_name: polish data_files: - split: train path: polish/train-* - config_name: portuguese data_files: - split: train path: portuguese/train-* - config_name: romanian data_files: - split: train path: romanian/train-* - config_name: russian data_files: - split: train path: russian/train-* - config_name: shona data_files: - split: train path: shona/train-* - config_name: spanish data_files: - split: train path: spanish/train-* - config_name: swahili data_files: - split: train path: swahili/train-* - config_name: swedish data_files: - split: train path: swedish/train-* - config_name: telugu data_files: - split: train path: telugu/train-* - config_name: turkish data_files: - split: train path: turkish/train-* - config_name: ukrainian data_files: - split: train path: ukrainian/train-* - config_name: zulu data_files: - split: train path: zulu/train-* --- * This experimental [BLiMP](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00321/96452/BLiMP-The-Benchmark-of-Linguistic-Minimal-Pairs)-style dataset contains morphological corruptions based on **minimal tag pairs** from [UniMorph](https://unimorph.github.io). * A minimal tag pair is a pair differing in exactly one variable, e.g., `N;DEF;GEN;PL` --> `N;DEF;NOM;PL`. * The source corpora are Wikipedia articles, most of them having some sort of quality tag tags (e.g., 'excellent articles'). * The approach is inspired by [MultiBLiMP](https://aclanthology.org/2026.tacl-1.10/), which also uses UniMorph to generate subject–verb agreement minimal pairs, but generalizes it to all possible tag pairs, without manual checks if the corruptions make sense. * As our approach does not include syntactic or other checks, this dataset is expected to be much **noisier** as the approach is simple and purely data-driven. * We know that some corruptions do not actually corrupt the sentences in the intended ways! * We are working on evaluating and iteratively refining it.
提供机构:
liu-nlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作