CMLI-NLP/TIFD

Name: CMLI-NLP/TIFD
Creator: CMLI-NLP
Published: 2024-11-14 14:00:37
License: 暂无描述

Hugging Face2024-11-14 更新2025-04-19 收录

下载链接：

https://hf-mirror.com/datasets/CMLI-NLP/TIFD

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- # TIFD: Tibetan Instruction-Following Dataset TIFD (Tibetan Instruction-Following Dataset) is a specialized instruction dataset for large language models supervised fine-tuning. The dataset contains 11,535 high-quality Tibetan instructions with four attributes: unique identifier, instruction, input, and output. ## Dataset Features - **Scale**: 11,535 high-quality Tibetan instruction data - **Format**: JSON format with four fields: id, instruction, input, output - **Source**: Generated by GPT-4 and reviewed by professional Tibetan speakers - **Usage**: Suitable for supervised fine-tuning of large language models ## Data Processing Pipeline 1. **Initial Data Generation**: Using GPT-4 to generate data based on 175 seed instructions 2. **Data Selection**: Using LaBSE model for vectorization and K-Center-Greedy algorithm for representative instruction selection 3. **Manual Review**: Multiple Tibetan experts review and verify data quality ## Dataset Access The complete dataset is available at: - [TIFD Dataset](https://huggingface.co/datasets/CMLI-NLP/TIFD/tree/main) ## Application Example Successfully applied to supervised fine-tuning of the Tibetan language model TiLamb (based on LLaMA2-7B), significantly improving the model's Tibetan instruction understanding and dialogue capabilities. ## Disclaimer This dataset/model is for academic research purposes only. Commercial use or unethical applications are prohibited. ## Citation If you find this project useful for your research, please consider citing: ```bibtex @article{Zhuang2024TIFD, title={TIFD: Tibetan Instruction-Following Dataset for Large Language Models Supervised Fine-Tuning}, author={Wenhao Zhuang and Dawa Cairen and Yuan Sun}, journal={Data Intelligence}, year={2024}, url={} }

提供机构：

CMLI-NLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集