horelulus/DeepSeek_0528_8B_Legal_Distill

Name: horelulus/DeepSeek_0528_8B_Legal_Distill
Creator: horelulus
Published: 2026-03-28 09:01:18
License: 暂无描述

Hugging Face2026-03-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/horelulus/DeepSeek_0528_8B_Legal_Distill

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation language: - en tags: - legal - distillation - grpo - deepseek - rlhf - reinforcement-learning --- # ⚖️ DeepSeek-0528-8B Legal Distill Dataset This repository contains a high-density trajectory dataset generated during the **GRPO (Group Relative Policy Optimization)** training of the **DeepSeek-8B** architecture. It is specifically optimized for advanced **Knowledge Distillation** and structural legal reasoning. 🚀 ## 💡 The Concept: "Log-as-Distillation" Traditional training often treats logs as temporary metadata. This dataset flips that script. By capturing the **multi-generation groups** produced during GRPO and pairing them with **complex, multi-dimensional reward scores**, we provide a ready-made "map" of model behavior. 🗺️ Instead of just seeing the "best" answer, you see the variety of attempts the model made and exactly why certain outputs were favored over others. This allows for: * 🎯 **Precision Filtering:** Users can extract only the highest-scoring reasoning paths. * 📉 **Negative Constraint Learning:** Analyze low-scoring outputs to understand and prevent common legal hallucinations. ## 🛠️ Complex Scoring Architecture Unlike simpler models, the **DeepSeek-0528-8B** run utilized a sophisticated reward ensemble to evaluate each generation: 1. **Legal Accuracy Reward:** Measures alignment with statutory references and regulatory language. 📜 2. **Structural Format Reward:** Ensures the model adheres to strict markdown or JSON schemas required for legal tech integration. 🏗️ 3. **Logical Consistency Reward:** Evaluates the internal "Chain of Thought" (CoT) for contradictions. 🧠 4. **Length & Verbosity Penalty:** Incentivizes concise, high-impact legal advice. ✂️ ## ☁️ Cloud-to-Cloud (C2C) Pipeline This dataset was built using a seamless, automated workflow: * **Infrastructure:** Orchestrated across high-performance cloud platforms. ⚡ * **Direct Sync:** A specialized pipeline pulls the base weights and pushes the resulting trajectory logs directly to **Hugging Face** in real-time. 🔄 * **Integrity:** Developed using legitimate developer methods, ensuring high-quality data lineage and zero-noise acquisition. ✅ ## 📂 Dataset Structure Each entry includes: * **Prompt:** The legal inquiry or regulatory task. * **Generation Group:** A collection of $N$ completions sampled for relative advantage. * **Weighted Rewards:** A detailed breakdown of the multi-complex scores for each completion. * **Model Metadata:** Checkpoint information from the DeepSeek-0528-8B training run. ## 🧪 Use Cases * **Student Distillation:** Train smaller models (1B–3B) to mimic the 8B model's complex reasoning. 🎓 * **RLHF Research:** Test new reward functions against pre-existing model trajectories. 🔬 * **Legal RAG Refinement:** Improve the "reasoning" step in Retrieval-Augmented Generation pipelines. 🔍 ## 📜 License & Attribution This dataset is licensed under the **Creative Commons Attribution 4.0 International (CC BY 4.0)**. 📝 ### Attribution 1. **Dataset Curator:** Azzindani (via Hugging Face Datasets). 2. **Base Architecture:** DeepSeek-AI. --- **Disclaimer:** *These generations are byproducts of an experimental RL run. Users should perform their own safety and fact-checking audits before deploying distilled models in production legal environments.* ⚠️ ---

许可协议: CC BY 4.0 任务类别: - 文本生成语言: - 英语标签: - 法律 - 知识蒸馏 - GRPO (Group Relative Policy Optimization) - DeepSeek - RLHF (Reinforcement Learning from Human Feedback) - 强化学习 # ⚖️ DeepSeek-0528-8B 法律蒸馏数据集本仓库包含**DeepSeek-8B**架构在**GRPO (Group Relative Policy Optimization，群体相对策略优化)**训练过程中生成的高密度轨迹数据集，专为进阶知识蒸馏与结构化法律推理优化设计。🚀 ## 💡 核心理念：「日志即蒸馏数据」（Log-as-Distillation）传统训练通常将日志视为临时元数据，本数据集颠覆了这一思路。通过捕获GRPO训练过程中产生的多生成候选组，并将其与复杂多维奖励评分配对，我们为模型行为提供了一份现成的「行为图谱」。🗺️ 用户不再仅能看到「最优」答案，还能了解模型尝试过的各类推理路径，以及特定输出被优先选中的具体原因。这可实现以下应用： * 🎯 **精准筛选：** 用户可仅提取得分最高的推理路径。 * 📉 **负约束学习：** 分析低分输出，以理解并规避常见的法律幻觉问题。 ## 🛠️ 复杂评分体系与简化模型不同，本次**DeepSeek-0528-8B**训练采用了一套精密的奖励集成系统对每一次生成结果进行评估： 1. **法律准确性奖励：** 衡量输出与法定条文及监管语言的对齐程度。📜 2. **结构化格式奖励：** 确保模型输出符合法律科技集成所需的严格Markdown或JSON Schema规范。🏗️ 3. **逻辑一致性奖励：** 评估模型内部「思维链（Chain of Thought, CoT）」是否存在矛盾。🧠 4. **长度与冗余惩罚：** 鼓励生成简洁且高价值的法律建议。✂️ ## ☁️ 云间（C2C, Cloud-to-Cloud）流水线本数据集通过一套无缝自动化工作流构建： * **基础设施：** 跨高性能云平台进行编排。⚡ * **直接同步：** 专属流水线拉取基础权重，并将生成的轨迹日志实时直接推送至**Hugging Face**平台。🔄 * **数据完整性：** 采用合规开发者方法构建，确保数据溯源清晰且无噪声污染。✅ ## 📂 数据集结构每条数据条目包含以下内容： * **提示词（Prompt）：** 法律问询或监管任务。 * **生成候选组（Generation Group）：** 为评估相对优势而采样得到的N个补全结果集合。 * **加权奖励：** 针对每个补全结果的多维度复杂评分明细。 * **模型元数据：** DeepSeek-0528-8B训练轮次的检查点信息。 ## 🧪 应用场景 * **学生模型蒸馏：** 训练小型模型（1B–3B参数）以复刻8B模型的复杂推理能力。🎓 * **RLHF研究：** 基于已有模型轨迹测试新型奖励函数。🔬 * **法律RAG（Retrieval-Augmented Generation，检索增强生成）优化：** 改进检索增强生成流水线中的「推理」环节。🔍 ## 📜 许可协议与署名声明本数据集采用**CC BY 4.0（Creative Commons Attribution 4.0 International，知识共享署名4.0国际许可协议）**进行授权。📝 ### 署名要求 1. **数据集维护者：** Azzindani（通过Hugging Face Datasets发布）。 2. **基础架构方：** DeepSeek-AI。 --- **免责声明：** *本数据集生成结果为实验性强化学习运行的副产品。用户在将蒸馏后的模型部署至生产级法律环境前，应自行开展安全性与事实核查审计。* ⚠️

提供机构：

horelulus

5,000+

优质数据集

54 个

任务类型

进入经典数据集