welyjesch/alpaca_waray
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/welyjesch/alpaca_waray
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- war
- en
license: cc-by-nc-4.0
task_categories:
- text-generation
- question-answering
tags:
- alpaca
- instruction-tuning
- waray
- waray-waray
- philippine-languages
- low-resource
- translation
pretty_name: Waray Alpaca Dataset
size_categories:
- 10K<n<100K
source_datasets:
- tatsu-lab/alpaca
---
# 🇵🇭 Waray Alpaca Dataset
## Dataset Description
- **Point of Contact:** welyjesch@gmail.com
- **Primary Language:** Waray (Waray-Waray)
- **Source Language:** English
### Dataset Summary
This dataset is a **Waray translation** of the original Alpaca instruction-following dataset. It is designed to support research and development of **instruction-tuned language models** for low-resource Philippine languages, particularly Waray.
The dataset retains the original Alpaca structure while providing high-quality translations of instructions, inputs, and outputs.
## Dataset Structure
### Data Instances
Each example follows this JSON format:
```json
{
"instruction": "Waray instruction text",
"input": "Optional context in Waray",
"output": "Expected response in Waray"
}
```
### Data Fields
- `instruction`: The task or question in Waray.
- `input`: Additional context (may be empty).
- `output`: The correct expected response in Waray.
### Data Splits
| Split | Description |
|------------|-------------------------------------------|
| `train` | Main dataset for training |
| `validation` | Optional validation set (if provided) |
## Dataset Creation
### Source Data
Based on the original **Alpaca dataset**, which was generated using instruction-following data derived from OpenAI models.
### Translation Process
Translated from English to Waray using:
- Machine translation + human post-editing *(or specify your actual method)*
- Native speaker validation *(if applicable)*
## Use Cases
This dataset can be used for:
- Instruction tuning of LLMs in Waray
- Multilingual NLP research
- Low-resource language modeling
- Chatbot and assistant development for Waray speakers
## Limitations
- May contain translation artifacts or unnatural phrasing.
- Cultural nuances might not always be preserved.
- Not all instructions may perfectly align with Waray linguistic norms.
- Quality depends on the exact translation method used.
## Ethical Considerations
Ensure responsible use when deploying models trained on this dataset. Be mindful of:
- Bias inherited from the original Alpaca dataset.
- Potential mistranslations or harmful outputs.
- Not intended for high-stakes applications without further validation.
## Licensing
The original Alpaca dataset license applies.
**License:** CC BY-NC 4.0 *(Note: Datasets generated from OpenAI models are generally restricted from commercial use competing with OpenAI).*
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{waray_alpaca,
title = {Waray Alpaca Dataset},
author = {Wely Jesch Sabalilag},
year = {2026},
note = {Translated version of the Alpaca dataset}
}
```
## Acknowledgements
- Original[Alpaca dataset creators (Stanford CRFM)](https://crfm.stanford.edu/2023/03/13/alpaca.html).
- Contributors and translators for Waray.
## Contact
For questions or contributions:
- **Name:** Wely Jesch Sabalilag
- **Email:**[welyjesch@gmail.com](mailto:welyjesch@gmail.com)
- **GitHub:**[github.com/welyjesch](https://github.com/welyjesch)
```
language:
- 瓦雷语(Waray)
- 英语(English)
license: CC BY-NC 4.0
task_categories:
- 文本生成(Text Generation)
- 问答(Question Answering)
tags:
- Alpaca
- 指令微调(Instruction-tuning)
- 瓦雷语(Waray)
- 瓦雷-瓦雷语(Waray-Waray)
- 菲律宾语言(Philippine Languages)
- 低资源语言(Low-resource)
- 翻译(Translation)
pretty_name: 瓦雷语Alpaca数据集(Waray Alpaca Dataset)
size_categories: 10K<n<100K
source_datasets:
- tatsu-lab/alpaca
# 🇵🇭 瓦雷语Alpaca数据集
## 数据集说明
- **联系人**:welyjesch@gmail.com
- **主要使用语言**:瓦雷语(Waray,又称Waray-Waray)
- **源语言**:英语(English)
### 数据集概览
本数据集是原始Alpaca指令遵循数据集的**瓦雷语翻译版本**,旨在支持针对低资源菲律宾语言(尤其是瓦雷语)的**指令微调大语言模型(Instruction-tuned Large Language Model)**的研发。
本数据集保留了原始Alpaca的结构,同时为指令、输入与输出提供了高质量的瓦雷语译文。
## 数据集结构
### 数据实例
每个样本遵循如下JSON格式:
json
{
"instruction": "瓦雷语指令文本",
"input": "瓦雷语可选上下文",
"output": "瓦雷语预期响应"
}
### 数据字段
- `instruction`:瓦雷语形式的任务或问题。
- `input`:额外上下文信息(可为空)。
- `output`:瓦雷语形式的正确预期响应。
### 数据划分
| 划分 | 描述 |
|------------|-------------------------------------------|
| `train` | 用于模型训练的主数据集 |
| `validation` | 可选验证集(如已提供) |
## 数据集构建
### 源数据
基于原始**Alpaca数据集**构建,该数据集由遵循指令的样本生成,原始样本源自OpenAI的模型。
### 翻译流程
从英语翻译至瓦雷语的方式包括:
- 机器翻译结合人工后编辑(*或注明实际采用的方法*)
- 母语使用者验证(如适用)
## 应用场景
本数据集可用于:
- 瓦雷语大语言模型(Large Language Model, LLM)的指令微调
- 多语言自然语言处理研究
- 低资源语言建模
- 面向瓦雷语使用者的聊天机器人与智能助手开发
## 局限性说明
- 可能存在翻译瑕疵或不自然的表述。
- 文化内涵可能无法完全保留。
- 部分指令可能无法完全适配瓦雷语的语言规范。
- 数据集质量取决于所采用的具体翻译方法。
## 伦理考量
在部署基于本数据集训练的模型时,请确保负责任地使用,并留意以下问题:
- 源自原始Alpaca数据集的偏见。
- 潜在的误译或有害输出。
- 未经进一步验证的情况下,不适合用于高风险场景。
## 许可证说明
适用原始Alpaca数据集的许可证。
**许可证**:CC BY-NC 4.0(*注:由OpenAI模型生成的数据集通常限制用于与OpenAI竞争的商业用途*)
## 引用规范
如使用本数据集,请引用如下文献:
bibtex
@dataset{waray_alpaca,
title = {瓦雷语Alpaca数据集},
author = {Wely Jesch Sabalilag},
year = {2026},
note = {Alpaca数据集的翻译版本}
}
## 致谢
- 原始[Alpaca数据集创作者(斯坦福大学CRFM团队)](https://crfm.stanford.edu/2023/03/13/alpaca.html)。
- 瓦雷语翻译与贡献者。
## 联系方式
如有疑问或贡献意向,请联系:
- **姓名**:Wely Jesch Sabalilag
- **邮箱**:[welyjesch@gmail.com](mailto:welyjesch@gmail.com)
- **GitHub主页**:[github.com/welyjesch](https://github.com/welyjesch)
提供机构:
welyjesch



