Yuel-P/CaseMatch-Agent-data

Name: Yuel-P/CaseMatch-Agent-data
Creator: Yuel-P
Published: 2026-03-24 16:54:53
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Yuel-P/CaseMatch-Agent-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: CaseMatch-Agent Data language: - zh tags: - legal - retrieval - reranking - chinese - llm task_categories: - text-retrieval size_categories: - 10K<n<100K --- # Dataset Card for CaseMatch-Agent Data ## Dataset Summary `CaseMatch-Agent Data` is the public data package used by [`CaseMatch-Agent`](https://github.com/XP-PY/CaseMatch-Agent), an open-source prototype for Chinese criminal similar-case retrieval. The current release is centered on a processed [`LeCaRD`](https://github.com/myx666/LeCaRD) package and includes: - normalized query files - normalized relevance labels - normalized candidate pools - a merged case corpus with raw case text and LLM-extracted structured fields - prebuilt LanceDB and SQLite database artifacts for direct use This repository is intended to be downloaded into the local `data/` directory of the main codebase, where it is used to build retrieval indexes and run experiments. ## Supported Tasks This dataset is primarily intended for: - Chinese legal case retrieval - candidate recall and reranking experiments - hybrid retrieval research combining sparse, dense, and structured signals - LLM-assisted legal retrieval prototypes ## Dataset Structure Current layout: ```text README.md cases.lancedb/ cases.sqlite3 lecard/ README.md candidate_pools.jsonl corpus_merged.jsonl qrels.jsonl queries.jsonl ``` ### Main Files #### `cases.lancedb` Prebuilt LanceDB database artifacts for the current corpus. These files can be used directly by the main `CaseMatch-Agent` codebase without rebuilding the primary vector-first retrieval database from scratch. #### `cases.sqlite3` Prebuilt SQLite fallback database for the current corpus. This file can be used as the fallback candidate store when LanceDB is unavailable, and it also reduces setup cost for local testing. #### `lecard/queries.jsonl` Normalized query file derived from the original `LeCaRD` query set. Each line is a query record with fields such as: - `query_id` - `query_text` - `charge_labels` #### `lecard/qrels.jsonl` Normalized relevance judgments derived from the original `LeCaRD` relevance annotations. Each line is one labeled `(query_id, case_id)` pair with: - `query_id` - `case_id` - `relevance` #### `lecard/candidate_pools.jsonl` Normalized candidate pools used for offline evaluation. Each line contains: - `query_id` - `candidate_case_ids` #### `lecard/corpus_merged.jsonl` The main case corpus used by the current CaseMatch pipeline. Each line contains: - `case_id` - `structured_data` - `raw_data` `structured_data` is an LLM-extracted representation used for retrieval and reranking. `raw_data` contains mapped judgment text fields from the underlying case document. For detailed schema definitions, see [lecard/README.md](lecard/README.md). ## Data Sources This dataset package is built from two layers of data: ### 1. Original LeCaRD resources The following components are derived from the original [`LeCaRD`](https://github.com/myx666/LeCaRD) release: - queries - relevance labels - candidate pools - case document content used to build the merged corpus ### 2. Project-level processing On top of the original data, this repository applies additional processing for the CaseMatch project: - normalization into flat `jsonl` files - reorganization into a cleaner repository structure - merging raw case text fields into a unified corpus format - LLM-based extraction of structured legal information for each case As a result: - `queries.jsonl`, `qrels.jsonl`, and `candidate_pools.jsonl` are normalized derivatives of `LeCaRD` - `corpus_merged.jsonl` is a project-specific derived corpus and is not part of the original `LeCaRD` release ## Intended Use This dataset is designed for research and engineering work on: - criminal similar-case retrieval - retrieval system evaluation on a fixed candidate pool - hybrid ranking pipelines using structured fields, BM25-style sparse signals, and dense embeddings It is not presented as an authoritative legal database, and it should not be treated as a production legal service by itself. ## Limitations - The current release is criminal-only. - The merged corpus contains project-specific LLM-extracted structured fields, which may contain extraction errors or omissions. - Relevance labels and candidate pools inherit the assumptions and limitations of the original `LeCaRD` benchmark. - The included `cases.lancedb` and `cases.sqlite3` files are derived from the current corpus release. If the corpus changes, they may need to be rebuilt to stay consistent. ## Repository Usage In the main `CaseMatch-Agent` repository, this dataset is typically downloaded into: ```text data/ README.md cases.lancedb/ cases.sqlite3 lecard/ README.md corpus_merged.jsonl queries.jsonl qrels.jsonl candidate_pools.jsonl ``` The main codebase then uses this data to: - directly use the bundled LanceDB / SQLite database artifacts - rebuild LanceDB indexes when needed - rebuild SQLite fallback indexes when needed - run offline retrieval experiments - support incremental case import workflows

提供机构：

Yuel-P

5,000+

优质数据集

54 个

任务类型

进入经典数据集