INX-TEXT/Bailong-bench
收藏Hugging Face2024-11-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/INX-TEXT/Bailong-bench
下载链接
链接失效反馈官方服务:
资源简介:
Bailong-bench是一个基准数据集,专门设计用于评估模型在遵循英文和繁体中文指令方面的熟练程度。该数据集旨在评估模型在各种实际应用场景中的表现,并评估其保持语言一致性的能力。
Bailong-bench是一个基准数据集,专门设计用于评估模型在遵循英文和繁体中文指令方面的熟练程度。该数据集旨在评估模型在各种实际应用场景中的表现,并评估其保持语言一致性的能力。
提供机构:
INX-TEXT
原始信息汇总
Bailong: Bilingual transfer learning based on QLoRA and zip-tie embedding
Overview
Bailong is a project aimed at enhancing Traditional Chinese performance in open-source large language models (LLMs) through bilingual transfer learning. The project utilizes QLoRA and zip-tie embedding techniques to improve model efficiency and performance.
Key Components
- Bailong 7B: An autoregressive language model with 7B parameters, derived from Llama 2 7B. It employs tied embedding and expanded vocabulary, trained with a context length of 2048 tokens primarily on Traditional Chinese data with a minor portion of English data. QLoRA is used during secondary pretraining to reduce computational costs while maintaining model performance.
- Bailong-instruct 7B: A fine-tuned version of Bailong 7B optimized for multi-turn dialogue use cases. It also uses QLoRA for fine-tuning and is released on Hugging Face to facilitate research in Traditional Chinese NLP.
- Bailong-bench: A benchmark dataset designed to evaluate a models proficiency in following both English and Traditional Chinese instructions, ensuring language consistency in real-world applications.
- Technical report: A future release providing a detailed overview of the Bailong project.
Features
- Fast and efficient tokenizer: Expands Llama 2s vocabulary size to 59241, enhancing tokenization efficiency for Traditional Chinese sequences.
- Aggressive cleaning: Implements semantic deduplication like SemDeDup to improve pretraining data quality.
- Memory efficient training: Uses QLoRA to save GPU memory during secondary pretraining and supervised fine-tuning.
- Advanced embedding initialization: Proposes zip-tie embedding for initializing appended vocabularies, saving training steps with appropriate learning rates.
- Advanced instruction tuning method for multi-turn dialogue: Leverages TargetLMLoss and training methods from the FireFly project to endow the model with multi-turn dialogue capability.
Applications
- Medical consultation: Demonstrates the models ability to provide advice and information in medical scenarios.
- Product copywriting generation: Showcases the models capability to generate promotional content for products.
- Creative writing: Illustrates the models potential in generating creative content such as Instagram posts and songs.
- Knowledge base QA: Demonstrates the models ability to answer questions based on provided texts.
- Multi-turn dialogue: Showcases the models capability in maintaining coherent and context-aware conversations.
- Mail assistant: Illustrates the models ability to assist in writing various types of emails.
- Summary generation: Demonstrates the models capability to generate concise summaries from given texts.
- Open QA: Showcases the models ability to answer open-ended questions.
- English instruct: Demonstrates the models capability to follow and respond to English instructions.
- Proofreading assistant: Illustrates the models ability to assist in correcting and improving written content.
Conclusion
Bailong represents a significant advancement in bilingual transfer learning for Traditional Chinese NLP, leveraging innovative techniques to enhance model efficiency and performance. The projects various applications demonstrate its versatility and potential impact in real-world scenarios.



