cjvt/WALS-Bench
收藏Hugging Face2026-04-03 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/cjvt/WALS-Bench
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: WALS-bench
license: cc-by-4.0
multilinguality: multilingual
configs:
- config_name: WALS-features-format1
data_files:
- split: train
path: "WALS-benchmark-feat.jsonl"
- config_name: WALS-features-with-languages-format2
data_files:
- split: train
path: "WALS-benchmark-feat-with-lang.jsonl"
---
---
WALS-bench: A Metalinguistic Benchmark Based on WALS
### Overview
This is a large-scale multilingual benchmark that evaluates metalinguistic knowledge in large language models using typological features from the World Atlas of Language Structures (WALS). The benchmark covers 192 linguistic features across 2,660 languages.
### Benchmark Format
The benchmark is available in two formats:
Format 1: 192-question version - one question per feature, under which all languages with a corresponding ground truth value for that feature are listed.
Format 2: 76,475-question version - one question per feature-language pair with a corresponding ground truth value, fully expanded across all languages.
### Task Definition
Given a linguistic question derived from a WALS feature with a set of possible answers for a specific language, the model must predict the correct typological category for that language.
### Prompt
The benchmark is evaluated using a single prompt. The prompt template used in our experiment:
{question} The options are {possible_answers}. Answer by choosing one option. Do not provide an explanation.
### Proposed Data Splits
Validation set:
29 features - 5A, 12A, 17A, 21B, 28A, 33A, 35A, 45A, 49A, 56A, 58B, 73A, 80A, 81A, 86A, 89A, 90A, 90D, 92A, 98A, 109B, 111A, 118A, 124A, 131A, 137A, 143A, 144M, 144X
Test set:
29 features - 6A, 10A, 15A, 25A, 26A, 36A, 42A, 46A, 55A, 67A, 71A, 77A, 85A, 87A, 90C, 94A, 97A, 106A, 107A, 112A, 117A, 125A, 127A, 130B, 136A, 139A, 143C, 144Q, 144W
Training set:
134 features - the remaining features
### Data Format
FORMAT 1:
Each feature is stored in JSONL format:
{"feature_id": "1A",
"feature_name": "Consonant Inventories",
"domain": "Phonology",
"question": "How large is the consonant inventory in the `<LANGUAGE>` language?",
"possible_answers": "Small; Moderately small; Average; Moderately large; Large",
"ground_truth": {"Abipón": "Moderately small", "Abkhaz": "Large", "Alabama": "Small", "Aché": "Small" /* additional languages omitted/}}
`<LANGUAGE>` is replaced with a specific language name at inference time.
FORMAT 2:
Each feature is stored in JSONL format:
{"feature_id": "1A",
"feature_name": "Consonant Inventories",
"domain": "Phonology",
"question": "How large is the consonant inventory in the Abipón language?",
"possible_answers": "Small; Moderately small; Average; Moderately large; Large",
"language_name": "Abipón",
"ISO639-3": "axb",
"ground_truth": "Moderately small"}
### Linguistic Feature Coverage
Word Order: 56 features
Nominal Categories: 29 features
Simple Clauses: 26 features
Phonology: 20 features
Verbal Categories: 17 features
Lexicon: 13 features
Morphology: 12 features
Nominal Syntax: 8 features
Complex Sentences: 7 features
Sign Languages: 2 features
Clicks (Other): 1 feature
Writing System (Other): 1 feature
### Language Coverage
Total number of languages covered: 2,660 world languages.
### Evaluation
Predictions are evaluated by comparing model outputs to the WALS ground-truth categories.
### Dataset authors
Tjaša Arčon, Matej Klemen, Marko Robnik-Šikonja, Kaja Dobrovoljc and Luka Terčon (See http://hdl.handle.net/11356/2083 for the full entry.)
### Licence
The original WALS data is licensed under CC BY 4.0. The data has been adapted for use in this benchmark.
Source:
Dryer, Matthew S. & Haspelmath, Martin (eds.).
World Atlas of Language Structures Online.
Max Planck Institute for Evolutionary Anthropology.
https://wals.info
### Citation information
```
@misc{arčon2026evaluatingmetalinguisticknowledgelarge,
title={Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages},
author={Tjaša Arčon and Matej Klemen and Marko Robnik-Šikonja and Kaja Dobrovoljc},
year={2026},
eprint={2602.02182},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.02182},
}
```
提供机构:
cjvt
搜集汇总
数据集介绍

构建方式
WALS-Bench的构建植根于语言类型学领域,其核心数据来源于世界语言结构图谱这一权威资源。该数据集通过系统提取WALS中的192个语言学特征,覆盖了从音系到句法等11个主要语言领域,并依据特征与语言的对应关系设计了两种结构化格式。第一种格式以特征为中心,每个条目囊括了所有相关语言的真值;第二种格式则完全展开,为每个特征-语言对生成独立的问题条目,最终形成了包含76,475个具体问题的庞大基准测试集。
特点
本数据集最显著的特点在于其宏大的跨语言覆盖度与严谨的类型学框架。它涵盖了全球2,660种语言,提供了前所未有的语言多样性样本。在内容组织上,数据集将复杂的类型学知识转化为标准化的选择题形式,每个问题均对应一个明确的WALS特征及预设的有限答案选项。这种设计不仅确保了评估的客观性与可重复性,也使得大规模语言模型对元语言学知识的掌握程度得以被精确量化。
使用方法
使用WALS-Bench进行评估时,研究者需根据任务需求选择两种数据格式之一。评估遵循统一的提示模板:向模型呈现具体语言的特征问题及其选项,要求模型直接选择答案。数据集已预先划分为训练、验证和测试集,共涉及192个特征,其中134个用于训练,其余各29个分别用于验证和测试,这为模型的能力测评与对比研究提供了清晰、标准化的框架。评估的核心在于比较模型输出与WALS提供的标准答案之间的一致性。
背景与挑战
背景概述
WALS-Bench数据集由Tjaša Arčon等研究人员于2026年构建,依托马克斯·普朗克进化人类学研究所发布的《世界语言结构地图集》(WALS)这一权威类型学资源。该数据集旨在系统评估大语言模型在跨语言元语言学知识方面的能力,核心研究问题聚焦于模型能否准确预测全球2660种语言在192个语言学特征上的类型学分类。作为首个基于WALS的大规模多语言基准,它不仅深化了计算语言学与语言类型学的交叉研究,还为模型的语言理解泛化性提供了严谨的评估框架,推动了语言智能向更细致、更全面的跨语言认知方向发展。
当前挑战
WALS-Bench所应对的领域挑战在于,传统语言模型评估往往局限于少数高资源语言,难以衡量模型对全球语言多样性的类型学特征的元语言学推理能力。构建过程中的挑战则体现为:需从WALS的复杂类型学数据库中精准提取并结构化192个特征及其对应语言的真值,同时处理特征定义的多义性与语言样本的不均衡分布;此外,设计既能覆盖广泛语言又保持评估一致性的提示模板,以及将原始数据转化为两种可扩展的基准格式(特征级与特征-语言对级),均需克服数据标准化与评估逻辑设计的双重困难。
常用场景
解决学术问题
该数据集有效解决了计算语言学中关于大型语言模型元语言知识评估的若干核心学术问题。它通过整合WALS中跨越2,660种语言的类型学数据,为研究者提供了衡量模型是否内化了深层语言结构规律的可靠工具。其意义在于突破了以往评估多集中于少数主流语言的局限,促进了语言多样性在人工智能研究中的重视,推动了模型在音系、词序、形态等语言学范畴上的可解释性分析,对构建更具语言普遍性的人工智能系统产生了深远影响。
衍生相关工作
围绕WALS-Bench数据集,已衍生出一系列探索模型类型学知识与多语言性能关联的经典研究工作。这些研究通常利用该基准的评估结果,深入分析模型在不同语言家族或特征域上的表现差异,并尝试将类型学特征作为先验知识注入模型以提升其泛化能力。相关成果进一步推动了如语言类型学驱动的模型适配、元语言知识溯源以及多语言表示对齐等新兴研究方向的发展,丰富了计算语言学与语言类型学交叉领域的学术图景。
以上内容由遇见数据集搜集并总结生成



