CMLI-NLP/TIFD
收藏Hugging Face2024-11-14 更新2025-04-19 收录
下载链接:
https://hf-mirror.com/datasets/CMLI-NLP/TIFD
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
# TIFD: Tibetan Instruction-Following Dataset
TIFD (Tibetan Instruction-Following Dataset) is a specialized instruction dataset for large language models supervised fine-tuning. The dataset contains 11,535 high-quality Tibetan instructions with four attributes: unique identifier, instruction, input, and output.
## Dataset Features
- **Scale**: 11,535 high-quality Tibetan instruction data
- **Format**: JSON format with four fields: id, instruction, input, output
- **Source**: Generated by GPT-4 and reviewed by professional Tibetan speakers
- **Usage**: Suitable for supervised fine-tuning of large language models
## Data Processing Pipeline
1. **Initial Data Generation**: Using GPT-4 to generate data based on 175 seed instructions
2. **Data Selection**: Using LaBSE model for vectorization and K-Center-Greedy algorithm for representative instruction selection
3. **Manual Review**: Multiple Tibetan experts review and verify data quality
## Dataset Access
The complete dataset is available at:
- [TIFD Dataset](https://huggingface.co/datasets/CMLI-NLP/TIFD/tree/main)
## Application Example
Successfully applied to supervised fine-tuning of the Tibetan language model TiLamb (based on LLaMA2-7B), significantly improving the model's Tibetan instruction understanding and dialogue capabilities.
## Disclaimer
This dataset/model is for academic research purposes only. Commercial use or unethical applications are prohibited.
## Citation
If you find this project useful for your research, please consider citing:
```bibtex
@article{Zhuang2024TIFD,
title={TIFD: Tibetan Instruction-Following Dataset for Large Language Models Supervised Fine-Tuning},
author={Wenhao Zhuang and Dawa Cairen and Yuan Sun},
journal={Data Intelligence},
year={2024},
url={}
}
提供机构:
CMLI-NLP



