five

悦悦猫娘模型数据集

收藏
魔搭社区2025-10-02 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/Songwufeng/TT-YueYue-Neko-8B-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
TT-YueYue-Neko-8B-Dataset 是由 TT 实验室专为悦悦猫娘 8B 模型(基于 PyTorch 框架)研发的监督微调数据集,共包含 800 条精确标注、无重复无空值的样本,数据格式为 JSON Lines(.jsonl),以中文为主(支持少量中英混合技术术语场景),所有样本经人工双重审核与模型预验证,确保内容准确性、猫娘人设一致性及合规性。数据集按使用场景与内容类型划分为五大类,其中知识讲解类 200 条(涵盖数学、物理、科普等,以 “软萌语气 + 精确知识” 为核心,如讲解拉格朗日定理的场景化回应)、日常交互类 200 条(模拟情感回应、生活建议等日常聊天,强化悦悦温柔软萌的性格)、多轮对话类 150 条(2-5 轮连续对话,保障模型上下文理解能力,如关于 “小鱼干购买” 的多轮交互)、指令遵循类 150 条(包含总结、分点说明等明确指令响应,如用 3 点总结人工智能特点并带猫娘语气)、安全合规类 100 条(覆盖敏感问题拒答、错误信息纠正,如拒绝教授破解密码的合规回应),各类样本均衡分布,全面适配模型不同场景的训练需求。为保障数据质量,数据集采用 “权威源头筛选 + 双重标注审核 + 模型预验证” 机制,原始数据来自学科教材、科普平台及人工设计场景,每条样本经标注员标注与审核员二次核验(不通过样本占比约 15%),还随机抽取 80 条样本进行模型测试并回溯优化。使用时支持通过datasets库加载(示例代码可直接调用本地 JSON Lines 文件),主要适用于悦悦 8B 模型的监督微调阶段,可抽取 160 条样本作为验证集,不建议用于预训练;使用过程中建议保留 “category” 字段以便分层训练,后续 TT 实验室将每季度评估更新数据集,新增高频需求场景样本、优化交互表述、补充合规内容,确保数据持续适配模型迭代需求。

TT-YueYue-Neko-8B-Dataset is a supervised fine-tuning dataset developed by TT Lab specifically for the YueYue Neko 8B model (based on the PyTorch framework). It contains 800 precisely annotated, duplicate-free and null-free samples, with the data format being JSON Lines (.jsonl). The dataset is primarily in Chinese, supporting a small number of scenarios involving Chinese-English mixed technical terms. All samples have undergone manual double review and model pre-verification to ensure content accuracy, consistency with the Neko character setting, and compliance. The dataset is divided into five categories based on usage scenarios and content types, with balanced distribution across all categories to fully meet the model's training needs in different scenarios: 1. Knowledge Explanation Category (200 samples): Covers fields such as mathematics, physics, and popular science, with the core of "soft and cute tone + precise knowledge", such as scenario-based responses explaining Lagrange's theorem. 2. Daily Interaction Category (200 samples): Simulates daily chats including emotional responses and life advice, to reinforce the gentle and cute personality of YueYue. 3. Multi-turn Dialogue Category (150 samples): Consists of 2-5 rounds of continuous dialogue, to ensure the model's context understanding ability, such as multi-turn interactions about "buying dried fish snacks". 4. Instruction Following Category (150 samples): Includes responses to explicit instructions such as summarization and point-by-point elaboration, such as summarizing the characteristics of artificial intelligence in 3 points with a Neko-like tone. 5. Safety and Compliance Category (100 samples): Covers scenarios such as refusing to answer sensitive questions and correcting misinformation, such as compliant responses refusing to teach password cracking techniques. To ensure data quality, the dataset adopts the mechanism of "authoritative source screening + double annotation review + model pre-verification". The original data is sourced from subject textbooks, popular science platforms, and manually designed scenarios. Each sample is annotated by annotators and secondarily verified by reviewers, with a rejection rate of approximately 15%. In addition, 80 samples are randomly selected for model testing and retrospective dataset optimization. The dataset supports loading via the Datasets library (sample code can directly call local JSON Lines files). It is mainly applicable to the supervised fine-tuning stage of the YueYue 8B model, and 160 samples can be extracted as the validation set; it is not recommended for pre-training. It is recommended to retain the "category" field during use for stratified training. In the future, TT Lab will conduct quarterly evaluations and updates of the dataset, adding samples of high-frequency demand scenarios, optimizing interaction expressions, and supplementing compliance content to ensure that the data continues to adapt to the model's iterative development needs.
提供机构:
maas
创建时间:
2025-10-02
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务