fluently-sets/ultraset
收藏Hugging Face2024-12-22 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/fluently-sets/ultraset
下载链接
链接失效反馈官方服务:
资源简介:
Ultraset是一个为使用SFT方法在Alpaca格式下训练和重训练LLM模型而设计的全合一数据集。它包含785K行数据,数据文件类型为parquet,支持多种语言,如英语、俄语、法语、意大利语、西班牙语、德语、中文、韩语等。该数据集采用灵活的多重许可,主要遵循MIT许可。它解决了用户在训练LLM模型时面对众多不同数据集和方法的困惑,集成了文本编写、数学和代码、生物医学、金融、CoT数据以及多语言数据等,适合进行基础训练。使用该数据集进行训练时,推荐使用instruction、input、output列,并在1-3个epoch内完成训练。该数据集能够提升模型在文本编写/编辑/分析、数学和编码、生物医学和金融知识以及多种流行语言方面的技能。
Ultraset is a one-stop dataset designed for training and retraining LLM models using the SFT method in the Alpaca format. It contains 785K rows of data, with dataset files in parquet format, supporting multiple languages such as English, Russian, French, Italian, Spanish, German, Chinese, Korean, etc. The dataset is licensed under a flexible multi-license, primarily MIT. It addresses the issue of users being overwhelmed by the many different datasets and approaches when training LLM models, combining text writing, mathematics and code, biology and medicine, finance, CoT data, and multilingual data for basic training. When training with this dataset, it is recommended to use the instruction, input, and output columns and complete training within 1-3 epochs. The dataset will enhance the models skills in text writing/editing/analysis, mathematics and coding, knowledge in biology, medicine, and finance, as well as proficiency in various popular languages.
提供机构:
fluently-sets



