nayeon212/BLEnD
收藏Hugging Face2024-07-16 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/nayeon212/BLEnD
下载链接
链接失效反馈官方服务:
资源简介:
BLEnD数据集是一个用于评估大型语言模型(LLMs)在不同文化和语言中日常知识的基准数据集。该数据集包含52.6k个问答对,涵盖16个国家/地区的13种语言,包括低资源语言如阿姆哈拉语、阿萨姆语、阿塞拜疆语、豪萨语和巽他语。数据集分为短答案问题和多项选择题两种格式,并提供了详细的注释和问题数据。数据集的构建是为了解决LLMs在跨文化和非英语语言中缺乏文化特定日常知识的问题。
BLEnD is a benchmark dataset designed to evaluate the everyday knowledge of large language models (LLMs) across diverse cultures and languages. The dataset comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. The dataset is divided into two formats: short-answer questions and multiple-choice questions, and provides detailed annotations and question data. The dataset was created to address the lack of culture-specific everyday knowledge in LLMs, particularly across diverse regions and non-English languages.
提供机构:
nayeon212
原始信息汇总
数据集概述
数据集信息
- 许可证: cc-by-sa-4.0
- 任务类别:
- question-answering
- 语言:
- en
- zh
- es
- id
- ko
- el
- fa
- ar
- az
- su
- as
- ha
- am
- 数据规模:
- 10K<n<100K
数据配置
-
配置名称: annotations
- 数据文件:
- 分割: DZ
- 路径: "data/annotations_hf/Algeria_data.json"
- 分割: AS
- 路径: "data/annotations_hf/Assam_data.json"
- 分割: AZ
- 路径: "data/annotations_hf/Azerbaijan_data.json"
- 分割: CN
- 路径: "data/annotations_hf/China_data.json"
- 分割: ET
- 路径: "data/annotations_hf/Ethiopia_data.json"
- 分割: GR
- 路径: "data/annotations_hf/Greece_data.json"
- 分割: ID
- 路径: "data/annotations_hf/Indonesia_data.json"
- 分割: IR
- 路径: "data/annotations_hf/Iran_data.json"
- 分割: MX
- 路径: "data/annotations_hf/Mexico_data.json"
- 分割: KP
- 路径: "data/annotations_hf/North_Korea_data.json"
- 分割: NG
- 路径: "data/annotations_hf/Northern_Nigeria_data.json"
- 分割: KR
- 路径: "data/annotations_hf/South_Korea_data.json"
- 分割: ES
- 路径: "data/annotations_hf/Spain_data.json"
- 分割: GB
- 路径: "data/annotations_hf/UK_data.json"
- 分割: US
- 路径: "data/annotations_hf/US_data.json"
- 分割: JB
- 路径: "data/annotations_hf/West_Java_data.json"
- 分割: DZ
- 数据文件:
-
配置名称: short-answer-questions
- 数据文件:
- 分割: DZ
- 路径: "data/questions_hf/Algeria_questions.json"
- 分割: AS
- 路径: "data/questions_hf/Assam_questions.json"
- 分割: AZ
- 路径: "data/questions_hf/Azerbaijan_questions.json"
- 分割: CN
- 路径: "data/questions_hf/China_questions.json"
- 分割: ET
- 路径: "data/questions_hf/Ethiopia_questions.json"
- 分割: GR
- 路径: "data/questions_hf/Greece_questions.json"
- 分割: ID
- 路径: "data/questions_hf/Indonesia_questions.json"
- 分割: IR
- 路径: "data/questions_hf/Iran_questions.json"
- 分割: MX
- 路径: "data/questions_hf/Mexico_questions.json"
- 分割: KP
- 路径: "data/questions_hf/North_Korea_questions.json"
- 分割: NG
- 路径: "data/questions_hf/Northern_Nigeria_questions.json"
- 分割: KR
- 路径: "data/questions_hf/South_Korea_questions.json"
- 分割: ES
- 路径: "data/questions_hf/Spain_questions.json"
- 分割: GB
- 路径: "data/questions_hf/UK_questions.json"
- 分割: US
- 路径: "data/questions_hf/US_questions.json"
- 分割: JB
- 路径: "data/questions_hf/West_Java_questions.json"
- 分割: DZ
- 数据文件:
-
配置名称: multiple-choice-questions
- 数据文件:
- 分割: test
- 路径: "data/mc_questions_hf/mc_questions_file.json"
- 分割: test
- 数据文件:
数据访问
-
加载数据集: python from datasets import load_dataset
annotations = load_dataset("nayeon212/BLEnD",annotations) questions = load_dataset("nayeon212/BLEnD",short-answer-questions) mcq = load_dataset("nayeon212/BLEnD",multiple-choice-questions)
-
访问特定国家/地区数据: python
访问Assam的标注数据
assam_annotations = annotations[AS]
访问Assam的问题数据
assam_questions = questions[AS]
数据格式
- 标注数据格式: json [{ "ID": "Al-en-06", "question": "대한민국 학교 급식에서 흔히 볼 수 있는 음식은 무엇인가요?", "en_question": "What is a common school cafeteria food in your country?", "annotations": [ { "answers": [ "김치" ], "en_answers": [ "kimchi" ], "count": 4 }, { "answers": [ "밥", "쌀밥", "쌀" ], "en_answers": [ "rice" ], "count": 3 }, ... ], "idks": { "idk": 0, "no-answer": 0, "not-applicable": 0 } }]
国家/地区代码
| Country/Region | Code | Language | Code |
|---|---|---|---|
| United States | US | English | en |
| United Kingdom | GB | English | en |
| China | CN | Chinese | zh |
| Spain | ES | Spanish | es |
| Mexico | MX | Spanish | es |
| Indonesia | ID | Indonesian | id |
| South Korea | KR | Korean | ko |
| North Korea | KP | Korean | ko |
| Greece | GR | Greek | el |
| Iran | IR | Persian | fa |
| Algeria | DZ | Arabic | ar |
| Azerbaijan | AZ | Azerbaijani | az |
| West Java | JB | Sundanese | su |
| Assam | AS | Assamese | as |
| Northern Nigeria | NG | Hausa | ha |
| Ethiopia | ET | Amharic | am |



