five

CMLI-NLP/TIFD

收藏
Hugging Face2024-11-14 更新2025-04-19 收录
下载链接:
https://hf-mirror.com/datasets/CMLI-NLP/TIFD
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 --- # TIFD: Tibetan Instruction-Following Dataset TIFD (Tibetan Instruction-Following Dataset) is a specialized instruction dataset for large language models supervised fine-tuning. The dataset contains 11,535 high-quality Tibetan instructions with four attributes: unique identifier, instruction, input, and output. ## Dataset Features - **Scale**: 11,535 high-quality Tibetan instruction data - **Format**: JSON format with four fields: id, instruction, input, output - **Source**: Generated by GPT-4 and reviewed by professional Tibetan speakers - **Usage**: Suitable for supervised fine-tuning of large language models ## Data Processing Pipeline 1. **Initial Data Generation**: Using GPT-4 to generate data based on 175 seed instructions 2. **Data Selection**: Using LaBSE model for vectorization and K-Center-Greedy algorithm for representative instruction selection 3. **Manual Review**: Multiple Tibetan experts review and verify data quality ## Dataset Access The complete dataset is available at: - [TIFD Dataset](https://huggingface.co/datasets/CMLI-NLP/TIFD/tree/main) ## Application Example Successfully applied to supervised fine-tuning of the Tibetan language model TiLamb (based on LLaMA2-7B), significantly improving the model's Tibetan instruction understanding and dialogue capabilities. ## Disclaimer This dataset/model is for academic research purposes only. Commercial use or unethical applications are prohibited. ## Citation If you find this project useful for your research, please consider citing: ```bibtex @article{Zhuang2024TIFD, title={TIFD: Tibetan Instruction-Following Dataset for Large Language Models Supervised Fine-Tuning}, author={Wenhao Zhuang and Dawa Cairen and Yuan Sun}, journal={Data Intelligence}, year={2024}, url={} }
提供机构:
CMLI-NLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作