welyjesch/alpaca_waray

Name: welyjesch/alpaca_waray
Creator: welyjesch
Published: 2026-03-25 10:18:03
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/welyjesch/alpaca_waray

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - war - en license: cc-by-nc-4.0 task_categories: - text-generation - question-answering tags: - alpaca - instruction-tuning - waray - waray-waray - philippine-languages - low-resource - translation pretty_name: Waray Alpaca Dataset size_categories: - 10K<n<100K source_datasets: - tatsu-lab/alpaca --- # 🇵🇭 Waray Alpaca Dataset ## Dataset Description - **Point of Contact:** welyjesch@gmail.com - **Primary Language:** Waray (Waray-Waray) - **Source Language:** English ### Dataset Summary This dataset is a **Waray translation** of the original Alpaca instruction-following dataset. It is designed to support research and development of **instruction-tuned language models** for low-resource Philippine languages, particularly Waray. The dataset retains the original Alpaca structure while providing high-quality translations of instructions, inputs, and outputs. ## Dataset Structure ### Data Instances Each example follows this JSON format: ```json { "instruction": "Waray instruction text", "input": "Optional context in Waray", "output": "Expected response in Waray" } ``` ### Data Fields - `instruction`: The task or question in Waray. - `input`: Additional context (may be empty). - `output`: The correct expected response in Waray. ### Data Splits | Split | Description | |------------|-------------------------------------------| | `train` | Main dataset for training | | `validation` | Optional validation set (if provided) | ## Dataset Creation ### Source Data Based on the original **Alpaca dataset**, which was generated using instruction-following data derived from OpenAI models. ### Translation Process Translated from English to Waray using: - Machine translation + human post-editing *(or specify your actual method)* - Native speaker validation *(if applicable)* ## Use Cases This dataset can be used for: - Instruction tuning of LLMs in Waray - Multilingual NLP research - Low-resource language modeling - Chatbot and assistant development for Waray speakers ## Limitations - May contain translation artifacts or unnatural phrasing. - Cultural nuances might not always be preserved. - Not all instructions may perfectly align with Waray linguistic norms. - Quality depends on the exact translation method used. ## Ethical Considerations Ensure responsible use when deploying models trained on this dataset. Be mindful of: - Bias inherited from the original Alpaca dataset. - Potential mistranslations or harmful outputs. - Not intended for high-stakes applications without further validation. ## Licensing The original Alpaca dataset license applies. **License:** CC BY-NC 4.0 *(Note: Datasets generated from OpenAI models are generally restricted from commercial use competing with OpenAI).* ## Citation If you use this dataset, please cite: ```bibtex @dataset{waray_alpaca, title = {Waray Alpaca Dataset}, author = {Wely Jesch Sabalilag}, year = {2026}, note = {Translated version of the Alpaca dataset} } ``` ## Acknowledgements - Original[Alpaca dataset creators (Stanford CRFM)](https://crfm.stanford.edu/2023/03/13/alpaca.html). - Contributors and translators for Waray. ## Contact For questions or contributions: - **Name:** Wely Jesch Sabalilag - **Email:**[welyjesch@gmail.com](mailto:welyjesch@gmail.com) - **GitHub:**[github.com/welyjesch](https://github.com/welyjesch) ```

language: - 瓦雷语（Waray） - 英语（English） license: CC BY-NC 4.0 task_categories: - 文本生成（Text Generation） - 问答（Question Answering） tags: - Alpaca - 指令微调（Instruction-tuning） - 瓦雷语（Waray） - 瓦雷-瓦雷语（Waray-Waray） - 菲律宾语言（Philippine Languages） - 低资源语言（Low-resource） - 翻译（Translation） pretty_name: 瓦雷语Alpaca数据集（Waray Alpaca Dataset） size_categories: 10K<n<100K source_datasets: - tatsu-lab/alpaca # 🇵🇭 瓦雷语Alpaca数据集 ## 数据集说明 - **联系人**：welyjesch@gmail.com - **主要使用语言**：瓦雷语（Waray，又称Waray-Waray） - **源语言**：英语（English） ### 数据集概览本数据集是原始Alpaca指令遵循数据集的**瓦雷语翻译版本**，旨在支持针对低资源菲律宾语言（尤其是瓦雷语）的**指令微调大语言模型（Instruction-tuned Large Language Model）**的研发。本数据集保留了原始Alpaca的结构，同时为指令、输入与输出提供了高质量的瓦雷语译文。 ## 数据集结构 ### 数据实例每个样本遵循如下JSON格式： json { "instruction": "瓦雷语指令文本", "input": "瓦雷语可选上下文", "output": "瓦雷语预期响应" } ### 数据字段 - `instruction`：瓦雷语形式的任务或问题。 - `input`：额外上下文信息（可为空）。 - `output`：瓦雷语形式的正确预期响应。 ### 数据划分 | 划分 | 描述 | |------------|-------------------------------------------| | `train` | 用于模型训练的主数据集 | | `validation` | 可选验证集（如已提供） | ## 数据集构建 ### 源数据基于原始**Alpaca数据集**构建，该数据集由遵循指令的样本生成，原始样本源自OpenAI的模型。 ### 翻译流程从英语翻译至瓦雷语的方式包括： - 机器翻译结合人工后编辑（*或注明实际采用的方法*） - 母语使用者验证（如适用） ## 应用场景本数据集可用于： - 瓦雷语大语言模型（Large Language Model, LLM）的指令微调 - 多语言自然语言处理研究 - 低资源语言建模 - 面向瓦雷语使用者的聊天机器人与智能助手开发 ## 局限性说明 - 可能存在翻译瑕疵或不自然的表述。 - 文化内涵可能无法完全保留。 - 部分指令可能无法完全适配瓦雷语的语言规范。 - 数据集质量取决于所采用的具体翻译方法。 ## 伦理考量在部署基于本数据集训练的模型时，请确保负责任地使用，并留意以下问题： - 源自原始Alpaca数据集的偏见。 - 潜在的误译或有害输出。 - 未经进一步验证的情况下，不适合用于高风险场景。 ## 许可证说明适用原始Alpaca数据集的许可证。 **许可证**：CC BY-NC 4.0（*注：由OpenAI模型生成的数据集通常限制用于与OpenAI竞争的商业用途*） ## 引用规范如使用本数据集，请引用如下文献： bibtex @dataset{waray_alpaca, title = {瓦雷语Alpaca数据集}, author = {Wely Jesch Sabalilag}, year = {2026}, note = {Alpaca数据集的翻译版本} } ## 致谢 - 原始[Alpaca数据集创作者（斯坦福大学CRFM团队）](https://crfm.stanford.edu/2023/03/13/alpaca.html)。 - 瓦雷语翻译与贡献者。 ## 联系方式如有疑问或贡献意向，请联系： - **姓名**：Wely Jesch Sabalilag - **邮箱**：[welyjesch@gmail.com](mailto:welyjesch@gmail.com) - **GitHub主页**：[github.com/welyjesch](https://github.com/welyjesch)

提供机构：

welyjesch

5,000+

优质数据集

54 个

任务类型

进入经典数据集