nahommohan/tibeb-training-data

Name: nahommohan/tibeb-training-data
Creator: nahommohan
Published: 2026-03-26 21:01:56
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/nahommohan/tibeb-training-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - am - en tags: - amharic - ethiopian - finance - financial-literacy - instruction-tuning size_categories: - 100K<n<1M --- # Tibeb Training Data Training dataset for **Tibeb AI** — Ethiopia's Amharic financial assistant. ## Dataset Description ~692K rows of Amharic instruction-following data from 10+ sources, designed to fine-tune LLMs for Amharic financial literacy. ## Sources | Source | ~Rows | Description | |--------|-------|-------------| | EthioNLP Instructions | 122K | Amharic instruction-following tasks | | Amharic MT | 200K | Translation pairs (filtered for Amharic output) | | Amharic News | 41K | News classification | | Aya Collection | 100K | Diverse Amharic NLP tasks | | EthioSenti | 47K | Sentiment analysis | | Native Amharic (Wiki, C4, etc.) | 78K | 4 corpus sources | | ALFFA Transcriptions | 10K+ | Voice transcription text | | Tibeb Synthetic Financial | 1,595 | Generated financial conversations (5x upsampled) | ## Files - `data/tibeb_unified_train.jsonl` — Merged, deduplicated, normalized dataset (1.6 GB) - `data/mlx_train/` — Pre-split for MLX training (95/5 train/valid) - `data/native_amharic/` — Raw native Amharic corpora - Individual source JSONL files ## Format Each row in the unified dataset: ```json {"instruction": "...", "input": "...", "output": "...", "source": "..."} ``` ## Usage ```python from datasets import load_dataset ds = load_dataset("nahommohan/tibeb-training-data", data_files="data/tibeb_unified_train.jsonl") ``` ## Code Training pipeline: [github.com/nahomar/tibeb-training](https://github.com/nahomar/tibeb-training)

--- 许可证：Apache-2.0 语言： - 阿姆哈拉语（Amharic） - 英语（English）标签： - 阿姆哈拉语（Amharic） - 埃塞俄比亚 - 金融 - 金融素养 - 指令微调规模类别： - 10万 < 样本量 < 100万 --- # Tibeb训练数据集本数据集为**Tibeb AI**——埃塞俄比亚阿姆哈拉语（Amharic）金融助手的训练数据集。 ## 数据集说明约69.2万条阿姆哈拉语指令跟随样本，源自十余种数据源，专为阿姆哈拉语金融素养场景下的大语言模型（Large Language Model）微调任务设计。 ## 数据源 | 数据源 | 约样本量 | 任务描述 | |----------------------|----------|------------------------------| | EthioNLP 指令集 | 12.2万 | 阿姆哈拉语指令跟随任务 | | 阿姆哈拉语机器翻译 | 20万 | 经筛选的阿姆哈拉语输出翻译对 | | 阿姆哈拉语新闻 | 4.1万 | 新闻分类任务 | | Aya 数据集 | 10万 | 多样化阿姆哈拉语自然语言处理任务 | | EthioSenti | 4.7万 | 情感分析任务 | | 原生阿姆哈拉语（维基百科、C4等） | 7.8万 | 4个语料库来源 | | ALFFA 转录文本 | 1万+ | 语音转录文本 | | Tibeb 合成金融对话 | 1595 | 生成式金融对话（经5倍上采样） | ## 文件说明 - `data/tibeb_unified_train.jsonl`：合并、去重、归一化后的整合数据集（文件大小1.6 GB） - `data/mlx_train/`：预拆分的MLX训练集（训练集与验证集划分比例为95:5） - `data/native_amharic/`：原生阿姆哈拉语原始语料库 - 各独立数据源对应的JSONL文件 ## 数据格式整合数据集中的每条样本格式如下： json {"instruction": "...", "input": "...", "output": "...", "source": "..."} ## 使用示例 python from datasets import load_dataset ds = load_dataset("nahommohan/tibeb-training-data", data_files="data/tibeb_unified_train.jsonl") ## 代码仓库训练流水线：[github.com/nahomar/tibeb-training](https://github.com/nahomar/tibeb-training)

提供机构：

nahommohan

5,000+

优质数据集

54 个

任务类型

进入经典数据集