horelulus/DeepSeek_0528_8B_Legal_Distill
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/horelulus/DeepSeek_0528_8B_Legal_Distill
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
language:
- en
tags:
- legal
- distillation
- grpo
- deepseek
- rlhf
- reinforcement-learning
---
# ⚖️ DeepSeek-0528-8B Legal Distill Dataset
This repository contains a high-density trajectory dataset generated during the **GRPO (Group Relative Policy Optimization)** training of the **DeepSeek-8B** architecture. It is specifically optimized for advanced **Knowledge Distillation** and structural legal reasoning. 🚀
## 💡 The Concept: "Log-as-Distillation"
Traditional training often treats logs as temporary metadata. This dataset flips that script. By capturing the **multi-generation groups** produced during GRPO and pairing them with **complex, multi-dimensional reward scores**, we provide a ready-made "map" of model behavior. 🗺️
Instead of just seeing the "best" answer, you see the variety of attempts the model made and exactly why certain outputs were favored over others. This allows for:
* 🎯 **Precision Filtering:** Users can extract only the highest-scoring reasoning paths.
* 📉 **Negative Constraint Learning:** Analyze low-scoring outputs to understand and prevent common legal hallucinations.
## 🛠️ Complex Scoring Architecture
Unlike simpler models, the **DeepSeek-0528-8B** run utilized a sophisticated reward ensemble to evaluate each generation:
1. **Legal Accuracy Reward:** Measures alignment with statutory references and regulatory language. 📜
2. **Structural Format Reward:** Ensures the model adheres to strict markdown or JSON schemas required for legal tech integration. 🏗️
3. **Logical Consistency Reward:** Evaluates the internal "Chain of Thought" (CoT) for contradictions. 🧠
4. **Length & Verbosity Penalty:** Incentivizes concise, high-impact legal advice. ✂️
## ☁️ Cloud-to-Cloud (C2C) Pipeline
This dataset was built using a seamless, automated workflow:
* **Infrastructure:** Orchestrated across high-performance cloud platforms. ⚡
* **Direct Sync:** A specialized pipeline pulls the base weights and pushes the resulting trajectory logs directly to **Hugging Face** in real-time. 🔄
* **Integrity:** Developed using legitimate developer methods, ensuring high-quality data lineage and zero-noise acquisition. ✅
## 📂 Dataset Structure
Each entry includes:
* **Prompt:** The legal inquiry or regulatory task.
* **Generation Group:** A collection of $N$ completions sampled for relative advantage.
* **Weighted Rewards:** A detailed breakdown of the multi-complex scores for each completion.
* **Model Metadata:** Checkpoint information from the DeepSeek-0528-8B training run.
## 🧪 Use Cases
* **Student Distillation:** Train smaller models (1B–3B) to mimic the 8B model's complex reasoning. 🎓
* **RLHF Research:** Test new reward functions against pre-existing model trajectories. 🔬
* **Legal RAG Refinement:** Improve the "reasoning" step in Retrieval-Augmented Generation pipelines. 🔍
## 📜 License & Attribution
This dataset is licensed under the **Creative Commons Attribution 4.0 International (CC BY 4.0)**. 📝
### Attribution
1. **Dataset Curator:** Azzindani (via Hugging Face Datasets).
2. **Base Architecture:** DeepSeek-AI.
---
**Disclaimer:** *These generations are byproducts of an experimental RL run. Users should perform their own safety and fact-checking audits before deploying distilled models in production legal environments.* ⚠️
---
许可协议: CC BY 4.0
任务类别:
- 文本生成
语言:
- 英语
标签:
- 法律
- 知识蒸馏
- GRPO (Group Relative Policy Optimization)
- DeepSeek
- RLHF (Reinforcement Learning from Human Feedback)
- 强化学习
# ⚖️ DeepSeek-0528-8B 法律蒸馏数据集
本仓库包含**DeepSeek-8B**架构在**GRPO (Group Relative Policy Optimization,群体相对策略优化)**训练过程中生成的高密度轨迹数据集,专为进阶知识蒸馏与结构化法律推理优化设计。🚀
## 💡 核心理念:「日志即蒸馏数据」(Log-as-Distillation)
传统训练通常将日志视为临时元数据,本数据集颠覆了这一思路。通过捕获GRPO训练过程中产生的多生成候选组,并将其与复杂多维奖励评分配对,我们为模型行为提供了一份现成的「行为图谱」。🗺️
用户不再仅能看到「最优」答案,还能了解模型尝试过的各类推理路径,以及特定输出被优先选中的具体原因。这可实现以下应用:
* 🎯 **精准筛选:** 用户可仅提取得分最高的推理路径。
* 📉 **负约束学习:** 分析低分输出,以理解并规避常见的法律幻觉问题。
## 🛠️ 复杂评分体系
与简化模型不同,本次**DeepSeek-0528-8B**训练采用了一套精密的奖励集成系统对每一次生成结果进行评估:
1. **法律准确性奖励:** 衡量输出与法定条文及监管语言的对齐程度。📜
2. **结构化格式奖励:** 确保模型输出符合法律科技集成所需的严格Markdown或JSON Schema规范。🏗️
3. **逻辑一致性奖励:** 评估模型内部「思维链(Chain of Thought, CoT)」是否存在矛盾。🧠
4. **长度与冗余惩罚:** 鼓励生成简洁且高价值的法律建议。✂️
## ☁️ 云间(C2C, Cloud-to-Cloud)流水线
本数据集通过一套无缝自动化工作流构建:
* **基础设施:** 跨高性能云平台进行编排。⚡
* **直接同步:** 专属流水线拉取基础权重,并将生成的轨迹日志实时直接推送至**Hugging Face**平台。🔄
* **数据完整性:** 采用合规开发者方法构建,确保数据溯源清晰且无噪声污染。✅
## 📂 数据集结构
每条数据条目包含以下内容:
* **提示词(Prompt):** 法律问询或监管任务。
* **生成候选组(Generation Group):** 为评估相对优势而采样得到的N个补全结果集合。
* **加权奖励:** 针对每个补全结果的多维度复杂评分明细。
* **模型元数据:** DeepSeek-0528-8B训练轮次的检查点信息。
## 🧪 应用场景
* **学生模型蒸馏:** 训练小型模型(1B–3B参数)以复刻8B模型的复杂推理能力。🎓
* **RLHF研究:** 基于已有模型轨迹测试新型奖励函数。🔬
* **法律RAG(Retrieval-Augmented Generation,检索增强生成)优化:** 改进检索增强生成流水线中的「推理」环节。🔍
## 📜 许可协议与署名声明
本数据集采用**CC BY 4.0(Creative Commons Attribution 4.0 International,知识共享署名4.0国际许可协议)**进行授权。📝
### 署名要求
1. **数据集维护者:** Azzindani(通过Hugging Face Datasets发布)。
2. **基础架构方:** DeepSeek-AI。
---
**免责声明:** *本数据集生成结果为实验性强化学习运行的副产品。用户在将蒸馏后的模型部署至生产级法律环境前,应自行开展安全性与事实核查审计。* ⚠️
提供机构:
horelulus



