five

multiloko

收藏
魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/multiloko
下载链接
链接失效反馈
官方服务:
资源简介:
# MultiLoKo: a multilingual local knowledge benchmark for LLMs MultiLoKo is a multilingual knowledge benchmark, covering 30 languages plus English. The questions are separately sourced for each language, with an annotation protocol designed to target locally relevant topics for the respective language. MultiLoKo contains the original data for each language, as well as both human and machine-authored translations of each non-English subset into English and vice versa, facilitating studies into a variety of research questions relating to multilinguality. More information about the benchmark design can be found in the release paper of the benchmark: * Dieuwke Hupkes and Nikolay Bogoychev. [MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages](https://arxiv.org/abs/2504.10356). ``` @article{hupkes2025multiloko, title={MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages }, author = {Dieuwke Hupkes and Nikolay Bogoychev}, year = {2025}, journal = {CoRR}, volume = {abs/2504.10356}, eprinttype = {arXiv}, eprint = {2504.10356}, url = {https://doi.org/10.48550/arXiv.2504.10356}, } ``` ## Data Each language in MultiLoKo has its own subdirectory, containing: - A jsonl file `dev.jsonl` containing the 250 locally sourced questions for the respective language. - A jsonl file `knowledge_fewshot.jsonl`, containing five fewshot examples for the language. - A jsonl file `dev_translated_human_english.jsonl` with the human translations of the file into English (for English, instead there will be 30 files `dev_translated_human_{language}.jsonl`, for each of the respective languages). - A jsonl file `dev_translated_machine_english.jsonl` with the machine translations of the file into English (for English, instead there will be 30 files `dev_translated_machine_{language}.jsonl`, for each of the respective languages). Multilingual prompts can be found in our our github repository, in the file [examples/prompts.py](https://github.com/facebookresearch/multiloko/examples/prompts.py) ### Data format The benchmark data is stored in jsonl files, containing a separate json object for each question. Each such objects has six fields: - `text`: the paragraph from which the question was created. The answer to the question can be derived from this text. It is possible to transform the benchmark into a reading comprehension benchmark by preceeding the question itself from this text. - `question`: the question itself. - `targets`: a list of acceptable (short) answers to the question. - `target`: a long answer to the question, which could potentially be used for COT. Note that the long answers have not been checked as extensively as the questions and short answers. - `id`: the wikipedia page that the text is sourced from, along with the rank of that page in the relevant locale. - `output_type`: the expected type of the output (e.g.\ number, date, year, word, etc), in the question language. - `source_language`: the language in which the question was originally sourced. - `question_language`: the language in which the question is asked. - `translated`: whether the question is translated. - `translation_type`: whether the translation is machine- or human-authored. ### MultiLoKo-test MultiLoKo also has a secret test set with 250 examples for each languages, which will be released later on, blindly. The split is similar to the dev split, apart from the fact that the topics it includes are more obscure. If you would like to obtain scores for your model on the set, please upload your model on HuggingFace and send us a request to compute scores through the creation of an issue on this [MultiLoKo github page](https://github.com/facebookresearch/multiloko). We will return the full set of results to you, and put the test results of your model on the leaderboard on this page. If it is not possible to upload your model on HuggingFace, please reach out to either one of the authors to discuss other options. ## Evaluation On our github page, you can find an evaluation script [eval.py](https://github.com/facebookresearch/multiloko/eval.py) that can be ran directly on a model's output, provided in jsonl or csv. The script postprocesses and normalises the model answers and computes f1 score, exact match score, sentence chrf, edit distance between the closest target, edit distance between the first target, and max edit similarity between the targets. It also computes five aggregate metrics for each of the metrics above: the average score, the maximum and minimum scores (across languages), and the gap between the best and worst performing language. We support both CSV and JSONL. ### Usage ``` $ ./eval.py --help usage: eval.py [-h] --dataset_location DATASET_LOCATION --subset SUBSET --predictions PREDICTIONS [--output OUTPUT] options: -h, --help show this help message and exit --dataset_location, -d DATASET_LOCATION Dataset root directory where all language folders are located --subset, -s SUBSET subset, ie dev, test, etc. Should match the jsonl filename inside the individual dataset language folders --predictions, -p PREDICTIONS Prediction file to evaluate. Could be CSV or JSONL --output, -o OUTPUT Output file (json) to write results to. If not specified, will print to stdout ./eval.py -d benchmark_data -s dev -p examples/test.json -o testscore.json ``` ### Metric cheat sheet | Metric | Description | | --- | --- | | Average EM | The first main metric we use to quantify performance for MultiLoKo is the average Exact Match score across languages, which expresses how many of the answers match one of the gold standard answers verbatim (after post-processing the answers). | | Gap | The second main metric is the gap between a model’s best and worst performing language. We gap to quantify the extent to which a model has achieved parity across languages. Because a small gap can be achieved both through parity on high scores as parity on low scores, it is most informative in combination with average benchmark performance. | | Mother tongue effect (MTE) | MTE expresses the impact of asking questions in a language in which the requested information is locally salient, compared to asking it in English. A positive MTE indicates information is more readily available in the language it was (likely) present in the training data, whereas a negative mother tongue effect indicates the information is more easily accessible in English. | |Locality effect (LE) | LE quantifies the effect of using locally sourced vs translated data. It is measured by computing the difference between scores for locally sourced data and translated English-sourced data. A positive LE implies that using translated English data underestimates performance on a language, a negative LE that using translated English data overestimates performance. |

# MultiLoKo:面向大语言模型(Large Language Model,LLM)的多语言本地知识基准数据集 MultiLoKo是一款多语言知识基准数据集,覆盖30种非英语语言与英语,总计包含31种语言。 该数据集的问题针对每种语言单独采集,并采用针对对应语言本地相关主题设计的标注规范。 MultiLoKo不仅包含每种语言的原始数据集,还提供了各非英语子集到英语的人工与机器翻译版本,以及英语子集到各非英语语言的对应翻译版本,可用于开展各类与多语言特性相关的研究。 有关该基准数据集设计的更多细节可参阅其发布论文: * Dieuwke Hupkes 与 Nikolay Bogoychev. [MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages](https://arxiv.org/abs/2504.10356). @article{hupkes2025multiloko, title={MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages }, author = {Dieuwke Hupkes and Nikolay Bogoychev}, year = {2025}, journal = {CoRR}, volume = {abs/2504.10356}, eprinttype = {arXiv}, eprint = {2504.10356}, url = {https://doi.org/10.48550/arXiv.2504.10356}, } ## 数据集结构 MultiLoKo中的每种语言均拥有独立子目录,其中包含以下文件: - `dev.jsonl`:包含对应语言的250道本地采集问题的JSON Lines(jsonl)格式文件; - `knowledge_fewshot.jsonl`:包含对应语言的5个少样本示例的JSON Lines格式文件; - `dev_translated_human_english.jsonl`:包含对应语言数据集到英语的人工翻译版本的JSON Lines格式文件;若针对英语语言,则会生成30个分别对应各非英语语言的`dev_translated_human_{language}.jsonl`文件; - `dev_translated_machine_english.jsonl`:包含对应语言数据集到英语的机器翻译版本的JSON Lines格式文件;若针对英语语言,则会生成30个分别对应各非英语语言的`dev_translated_machine_{language}.jsonl`文件。 多语言提示词可在我们的GitHub仓库的`examples/prompts.py`文件中获取,链接为:https://github.com/facebookresearch/multiloko/examples/prompts.py ### 数据格式 基准数据集以JSON Lines格式存储,每个问题对应一个独立的JSON对象。每个对象包含以下字段: - `text`:用于生成问题的源段落,问题的答案可从该段落中推导得出。通过将该段落作为问题的前置上下文,可将本基准数据集转换为阅读理解基准数据集; - `question`:问题本身; - `targets`:该问题可接受的(简短)答案列表; - `target`:该问题的长答案,可用于思维链(Chain-of-Thought,COT)推理。需注意,长答案的校验程度不及问题与短答案; - `id`:源段落所属的维基百科页面,以及该页面在对应语言地区维基百科中的排名; - `output_type`:问题语言下的期望输出类型(如数字、日期、年份、单词等); - `source_language`:问题的原始采集语言; - `question_language`:问题的提问语言; - `translated`:标识该问题是否为翻译版本; - `translation_type`:标识翻译的类型,为机器翻译或人工翻译。 ### MultiLoKo测试集 MultiLoKo还包含针对每种语言的250个样本的保密测试集,将在后续以盲审形式发布。该测试集的划分方式与开发集类似,但其涵盖的主题更为生僻。 若您希望获取您的模型在该测试集上的评估分数,请将您的模型上传至HuggingFace,并通过在本MultiLoKo GitHub页面(https://github.com/facebookresearch/multiloko)提交Issue的方式发起分数计算请求。我们将向您反馈完整的评估结果,并将您的模型测试结果发布至本页面的排行榜中。 若您无法将模型上传至HuggingFace,请联系任意一位作者商讨其他评估方案。 ## 评估方法 在我们的GitHub页面中,可找到评估脚本`eval.py`(https://github.com/facebookresearch/multiloko/eval.py),该脚本可直接处理JSON Lines或CSV格式的模型输出结果。 该脚本会对模型答案进行后处理与标准化,并计算多项评估指标:F1分数、精确匹配(Exact Match,EM)分数、句子chrf分数、与最优标准答案的编辑距离、与首个标准答案的编辑距离,以及所有标准答案间的最大编辑相似度。此外,该脚本还会针对上述每项指标计算五项聚合指标:平均分数、各语言间的最高与最低分数,以及表现最优与最差语言间的分数差距。本脚本支持CSV与JSONL两种输入格式。 ### 使用方法 $ ./eval.py --help usage: eval.py [-h] --dataset_location DATASET_LOCATION --subset SUBSET --predictions PREDICTIONS [--output OUTPUT] 可选参数: -h, --help 显示此帮助信息并退出 --dataset_location, -d DATASET_LOCATION 数据集根目录,其中存储所有语言的子文件夹 --subset, -s SUBSET 数据集子集,如dev、test等,需与对应语言子文件夹内的JSONL文件名匹配 --predictions, -p PREDICTIONS 待评估的模型输出文件,支持CSV或JSONL格式 --output, -o OUTPUT 结果输出文件(JSON格式),若未指定则将结果打印至标准输出 ./eval.py -d benchmark_data -s dev -p examples/test.json -o testscore.json ### 评估指标速查表 | 指标名称 | 指标说明 | | --- | --- | | 平均精确匹配分数(Average EM) | 我们用于量化MultiLoKo模型性能的首要指标为各语言平均精确匹配分数,其表示经过后处理的模型答案与标准答案之一完全匹配的比例(以精确匹配计)。 | | 分数差距(Gap) | 第二项核心指标为模型在表现最优与最差语言间的分数差距,用于量化模型在各语言间实现性能均衡的程度。由于较小的分数差距既可通过高分下的性能均衡实现,也可通过低分下的性能均衡达成,因此该指标需与基准数据集的平均性能结合分析,方能提供有效信息。 | | 母语效应(Mother Tongue Effect,MTE) | MTE用于量化以信息所在的本地语言提问相较于以英语提问的性能差异。正的MTE值表示模型在信息的原生语言中更容易获取相关知识,这通常是因为该语言在模型训练数据中占比更高;负的MTE值则表示模型在英语中更容易获取相关知识。 | | 本地性效应(Locality Effect,LE) | LE用于量化使用本地采集数据与翻译后的英语数据间的性能差异,其计算方式为本地采集数据的分数与翻译英语数据的分数之差。正的LE值表示使用翻译后的英语数据会低估该语言的模型性能,负的LE值则表示使用翻译后的英语数据会高估该语言的模型性能。 |
提供机构:
maas
创建时间:
2025-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作