EthioNLP/Amharic_Instruction_dataset

Name: EthioNLP/Amharic_Instruction_dataset
Creator: EthioNLP
Published: 2025-05-30 13:13:32
License: 暂无描述

Hugging Face2025-05-30 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/EthioNLP/Amharic_Instruction_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

Walia数据集旨在通过以下方式增强针对阿姆哈拉语的大型语言模型：将现有的任务特定数据集（例如情感分析、问答、命名实体识别）转换为指令格式；创建新的生成性数据集（例如诗歌生成、宗教歌词、故事生成）；将英文指令数据集（例如Alpaca、Dolly）翻译成阿姆哈拉语进行比较研究。每个数据点都包含一个自然语言任务描述、可选的输入文本以及阿姆哈拉语中的预期模型输出。

The Walia dataset is designed to enhance large language models for the Amharic language by converting existing task-specific datasets (e.g., sentiment analysis, QA, NER) into instruction format, creating new generative datasets (e.g., poem generation, religious lyrics, story generation), and translating English instruction datasets (e.g., Alpaca, Dolly) into Amharic for comparative studies. Each data point follows a structured instruction format with a natural language task description, optional input text, and the expected model output in Amharic.

提供机构：

EthioNLP

原始信息汇总

数据集概述

数据集特征

instruction: 数据类型为字符串
input: 数据类型为字符串
output: 数据类型为字符串
prompt_header: 数据类型为字符串
datasource: 数据类型为字符串

数据集分割

训练集 (train):
- 示例数量: 122425
- 数据大小: 405544798 字节
验证集 (validation):
- 示例数量: 16311
- 数据大小: 47050567 字节
测试集 (test):
- 示例数量: 15261
- 数据大小: 56184295 字节

数据集大小

下载大小: 204309893 字节
数据集总大小: 508779660 字节

语言

数据集支持的语言包括: 阿姆哈拉语 (am)

5,000+

优质数据集

54 个

任务类型

进入经典数据集