中文法律条文检索数据集

Name: 中文法律条文检索数据集
Creator: maas
Published: 2026-05-21 17:16:51
License: 暂无描述

魔搭社区2026-05-21 更新2025-12-20 收录

下载链接：

https://modelscope.cn/datasets/ByronLeeee/CN-Law-Query-Retrieval-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# CN-Law-Query-Retrieval-Dataset ## 📖 数据集简介 (Introduction) **CN-Law-Query-Retrieval-Dataset** 是一个专为**中文法律 RAG (检索增强生成)** 场景设计的高质量微调数据集。该数据集旨在解决通用向量模型在法律领域遇到的核心痛点：无法准确区分**用户口语化提问**与**专业法条**之间的语义联系，以及容易混淆**国家法律**与**地方法规/行政规章**中相似条款的问题。本数据集包含约 **60k** 条 `(Anchor, Positive, Negative)` 三元组，适用于微调 BERT、RoBERTa、BGE、Gemma 等 Embedding 模型，尤其适合微调EmbeddingGemma-300M。 ### 💡 核心特性 (Features) 1. **口语化 Query (Colloquial Queries)**：Anchor（查询）由 LLM (DeepSeek/Gemma) 基于真实法条生成，模拟了普通用户在咨询法律问题时的口语、模糊描述和场景化提问，弥补了“法言法语”与日常用语的鸿沟。 2. **高难度负例 (Hard Negatives)**：Negative（负例）并非随机抽取，而是通过EmbeddingGemma-300M模型向量检索挖掘出的“Top-K 难负例”。 * *同源过滤*：严格排除了与正例来自同一部法律的条款，迫使模型学会区分法律层级（如《专利法》vs《陕西省专利条例》）。 3. **覆盖广泛**：数据源覆盖中国主要成文法（宪法、民商法、刑法、行政法等）。 ## 📂 数据格式 (Data Format) 数据集采用 JSONL 格式，每一行包含一个三元组： * **anchor**: 用户查询 (Query) * **positive**: 正确的法律条文 (Target Document) * **negative**: 容易混淆的错误法条 (Hard Negative Document) ```json { "anchor": "负责专利执法的部门查处假冒行为时能扣押产品吗？", "positive": "title: 中华人民共和国专利法第六十九条 | text: 负责专利执法的部门根据已经取得的证据...可以查封或者扣押。", "negative": "title: 陕西省专利条例第二十四条 | text: 负责专利执法的部门根据已经取得的证据...可以查封或者扣押。" } ``` ## 🛠️ 构建方法 (Construction Method) 1. **数据清洗**：从全量法律数据库中提取“国家法律”层级的条文作为正例源。 2. **Query 生成**：使用 DeepSeek-V3 等模型，利用 Prompt Engineering 生成对应的口语化问题。 3. **负例挖掘**： * 使用通用 Embedding 模型对 Query 进行检索。 * 选取排名靠前（Top-K）但非正确答案的法条作为负例。 * *策略优化*：优先保留文本相似度高但属于不同法律文件的条目（如地方法规），以增强模型的细粒度判别能力。 ## 🚀 使用方法 (Usage) ### 使用 Hugging Face `datasets` 加载 ```python from datasets import load_dataset dataset = load_dataset("ByronLeeee/CN-Law-Query-Retrieval-Dataset") # 查看第一条数据 print(dataset['train'][0]) ``` ### 配合 Sentence-Transformers 训练 ```python from sentence_transformers import InputExample, losses, SentenceTransformer from torch.utils.data import DataLoader # 1. 加载数据转换格式 train_examples = [] for row in dataset['train']: train_examples.append(InputExample(texts=[row['anchor'], row['positive'], row['negative']])) # 2. 定义 Dataloader train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) # 3. 定义损失函数 (推荐使用 MNRLoss) model = SentenceTransformer('google/embeddinggemma-300m') train_loss = losses.MultipleNegativesRankingLoss(model=model) # 4. 开始微调... ```

# CN-Law-Query-Retrieval-Dataset ## 📖 Dataset Introduction **CN-Law-Query-Retrieval-Dataset** is a high-quality fine-tuning dataset specifically designed for Chinese legal RAG (Retrieval-Augmented Generation) scenarios. This dataset aims to address the core pain points of general-purpose vector models in the legal domain: their inability to accurately distinguish the semantic connections between colloquial user queries and professional legal articles, as well as their tendency to confuse similar clauses in national laws, local regulations, and administrative rules. This dataset contains approximately **60k** `(Anchor, Positive, Negative)` triplets, which are suitable for fine-tuning embedding models such as BERT, RoBERTa, BGE, and Gemma, and is particularly well-suited for fine-tuning EmbeddingGemma-300M. ### 💡 Core Features 1. **Colloquial Queries**: The Anchor (query) is generated by LLMs (DeepSeek/Gemma) based on real legal articles, simulating the colloquial, vague, and scenario-based questions that ordinary users ask when seeking legal advice, bridging the gap between formal legal jargon and daily language. 2. **Hard Negatives**: The Negative samples are not randomly selected, but are "Top-K hard negatives" mined via vector retrieval using the EmbeddingGemma-300M model. * *Homology Filtering*: Clauses originating from the same legal document as the positive sample are strictly excluded, forcing the model to learn to distinguish legal hierarchies (e.g., the Patent Law of the People's Republic of China vs. the Shaanxi Provincial Patent Regulations). 3. **Wide Coverage**: The data source covers major statutory laws of China, including the Constitution, civil and commercial law, criminal law, administrative law, etc. ## 📂 Data Format The dataset is provided in JSONL format, with each line containing a triplet: * **anchor**: User query * **positive**: Correct legal article (Target Document) * **negative**: Easily confused incorrect legal article (Hard Negative Document) json { "anchor": "负责专利执法的部门查处假冒行为时能扣押产品吗？", "positive": "title: 中华人民共和国专利法第六十九条 | text: 负责专利执法的部门根据已经取得的证据...可以查封或者扣押。", "negative": "title: 陕西省专利条例第二十四条 | text: 负责专利执法的部门根据已经取得的证据...可以查封或者扣押。" } ## 🛠️ Construction Method 1. **Data Cleaning**: Extract clauses at the "national law" level from the full legal database as the positive sample source. 2. **Query Generation**: Use models such as DeepSeek-V3 and Prompt Engineering to generate corresponding colloquial questions. 3. **Negative Sample Mining**: * Use a general-purpose embedding model to retrieve against the Query. * Select top-ranked (Top-K) legal articles that are not the correct answer as negative samples. * *Strategy Optimization*: Prioritize entries with high text similarity but belonging to different legal documents (e.g., local regulations) to enhance the model's fine-grained discriminative ability. ## 🚀 Usage ### Loading via Hugging Face `datasets` python from datasets import load_dataset dataset = load_dataset("ByronLeeee/CN-Law-Query-Retrieval-Dataset") # View the first sample print(dataset['train'][0]) ### Training with Sentence-Transformers python from sentence_transformers import InputExample, losses, SentenceTransformer from torch.utils.data import DataLoader # 1. Convert dataset format train_examples = [] for row in dataset['train']: train_examples.append(InputExample(texts=[row['anchor'], row['positive'], row['negative']])) # 2. Define Dataloader train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) # 3. Define loss function (MNRLoss recommended) model = SentenceTransformer('google/embeddinggemma-300m') train_loss = losses.MultipleNegativesRankingLoss(model=model) # 4. Start fine-tuning...

提供机构：

maas

创建时间：

2025-12-06

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个专为中文法律检索增强生成场景设计的高质量微调数据集，包含约6万个（锚点、正例、负例）三元组，旨在解决通用向量模型难以准确关联口语化查询与专业法律条文、以及易混淆国家法律与地方法规的问题。它通过大语言模型生成模拟用户日常提问的锚点，并利用EmbeddingGemma-300M挖掘困难负例，覆盖中国主要成文法领域，适用于微调BERT、BGE等嵌入模型。

以上内容由遇见数据集搜集并总结生成