five

nayeon212/BLEnD

收藏
Hugging Face2024-07-16 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/nayeon212/BLEnD
下载链接
链接失效反馈
官方服务:
资源简介:
BLEnD数据集是一个用于评估大型语言模型(LLMs)在不同文化和语言中日常知识的基准数据集。该数据集包含52.6k个问答对,涵盖16个国家/地区的13种语言,包括低资源语言如阿姆哈拉语、阿萨姆语、阿塞拜疆语、豪萨语和巽他语。数据集分为短答案问题和多项选择题两种格式,并提供了详细的注释和问题数据。数据集的构建是为了解决LLMs在跨文化和非英语语言中缺乏文化特定日常知识的问题。

BLEnD is a benchmark dataset designed to evaluate the everyday knowledge of large language models (LLMs) across diverse cultures and languages. The dataset comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. The dataset is divided into two formats: short-answer questions and multiple-choice questions, and provides detailed annotations and question data. The dataset was created to address the lack of culture-specific everyday knowledge in LLMs, particularly across diverse regions and non-English languages.
提供机构:
nayeon212
原始信息汇总

数据集概述

数据集信息

  • 许可证: cc-by-sa-4.0
  • 任务类别:
    • question-answering
  • 语言:
    • en
    • zh
    • es
    • id
    • ko
    • el
    • fa
    • ar
    • az
    • su
    • as
    • ha
    • am
  • 数据规模:
    • 10K<n<100K

数据配置

  • 配置名称: annotations

    • 数据文件:
      • 分割: DZ
        • 路径: "data/annotations_hf/Algeria_data.json"
      • 分割: AS
        • 路径: "data/annotations_hf/Assam_data.json"
      • 分割: AZ
        • 路径: "data/annotations_hf/Azerbaijan_data.json"
      • 分割: CN
        • 路径: "data/annotations_hf/China_data.json"
      • 分割: ET
        • 路径: "data/annotations_hf/Ethiopia_data.json"
      • 分割: GR
        • 路径: "data/annotations_hf/Greece_data.json"
      • 分割: ID
        • 路径: "data/annotations_hf/Indonesia_data.json"
      • 分割: IR
        • 路径: "data/annotations_hf/Iran_data.json"
      • 分割: MX
        • 路径: "data/annotations_hf/Mexico_data.json"
      • 分割: KP
        • 路径: "data/annotations_hf/North_Korea_data.json"
      • 分割: NG
        • 路径: "data/annotations_hf/Northern_Nigeria_data.json"
      • 分割: KR
        • 路径: "data/annotations_hf/South_Korea_data.json"
      • 分割: ES
        • 路径: "data/annotations_hf/Spain_data.json"
      • 分割: GB
        • 路径: "data/annotations_hf/UK_data.json"
      • 分割: US
        • 路径: "data/annotations_hf/US_data.json"
      • 分割: JB
        • 路径: "data/annotations_hf/West_Java_data.json"
  • 配置名称: short-answer-questions

    • 数据文件:
      • 分割: DZ
        • 路径: "data/questions_hf/Algeria_questions.json"
      • 分割: AS
        • 路径: "data/questions_hf/Assam_questions.json"
      • 分割: AZ
        • 路径: "data/questions_hf/Azerbaijan_questions.json"
      • 分割: CN
        • 路径: "data/questions_hf/China_questions.json"
      • 分割: ET
        • 路径: "data/questions_hf/Ethiopia_questions.json"
      • 分割: GR
        • 路径: "data/questions_hf/Greece_questions.json"
      • 分割: ID
        • 路径: "data/questions_hf/Indonesia_questions.json"
      • 分割: IR
        • 路径: "data/questions_hf/Iran_questions.json"
      • 分割: MX
        • 路径: "data/questions_hf/Mexico_questions.json"
      • 分割: KP
        • 路径: "data/questions_hf/North_Korea_questions.json"
      • 分割: NG
        • 路径: "data/questions_hf/Northern_Nigeria_questions.json"
      • 分割: KR
        • 路径: "data/questions_hf/South_Korea_questions.json"
      • 分割: ES
        • 路径: "data/questions_hf/Spain_questions.json"
      • 分割: GB
        • 路径: "data/questions_hf/UK_questions.json"
      • 分割: US
        • 路径: "data/questions_hf/US_questions.json"
      • 分割: JB
        • 路径: "data/questions_hf/West_Java_questions.json"
  • 配置名称: multiple-choice-questions

    • 数据文件:
      • 分割: test
        • 路径: "data/mc_questions_hf/mc_questions_file.json"

数据访问

  • 加载数据集: python from datasets import load_dataset

    annotations = load_dataset("nayeon212/BLEnD",annotations) questions = load_dataset("nayeon212/BLEnD",short-answer-questions) mcq = load_dataset("nayeon212/BLEnD",multiple-choice-questions)

  • 访问特定国家/地区数据: python

    访问Assam的标注数据

    assam_annotations = annotations[AS]

    访问Assam的问题数据

    assam_questions = questions[AS]

数据格式

  • 标注数据格式: json [{ "ID": "Al-en-06", "question": "대한민국 학교 급식에서 흔히 볼 수 있는 음식은 무엇인가요?", "en_question": "What is a common school cafeteria food in your country?", "annotations": [ { "answers": [ "김치" ], "en_answers": [ "kimchi" ], "count": 4 }, { "answers": [ "밥", "쌀밥", "쌀" ], "en_answers": [ "rice" ], "count": 3 }, ... ], "idks": { "idk": 0, "no-answer": 0, "not-applicable": 0 } }]

国家/地区代码

Country/Region Code Language Code
United States US English en
United Kingdom GB English en
China CN Chinese zh
Spain ES Spanish es
Mexico MX Spanish es
Indonesia ID Indonesian id
South Korea KR Korean ko
North Korea KP Korean ko
Greece GR Greek el
Iran IR Persian fa
Algeria DZ Arabic ar
Azerbaijan AZ Azerbaijani az
West Java JB Sundanese su
Assam AS Assamese as
Northern Nigeria NG Hausa ha
Ethiopia ET Amharic am
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作