Yuel-P/CaseMatch-Agent-data
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Yuel-P/CaseMatch-Agent-data
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: CaseMatch-Agent Data
language:
- zh
tags:
- legal
- retrieval
- reranking
- chinese
- llm
task_categories:
- text-retrieval
size_categories:
- 10K<n<100K
---
# Dataset Card for CaseMatch-Agent Data
## Dataset Summary
`CaseMatch-Agent Data` is the public data package used by [`CaseMatch-Agent`](https://github.com/XP-PY/CaseMatch-Agent), an open-source prototype for Chinese criminal similar-case retrieval.
The current release is centered on a processed [`LeCaRD`](https://github.com/myx666/LeCaRD) package and includes:
- normalized query files
- normalized relevance labels
- normalized candidate pools
- a merged case corpus with raw case text and LLM-extracted structured fields
- prebuilt LanceDB and SQLite database artifacts for direct use
This repository is intended to be downloaded into the local `data/` directory of the main codebase, where it is used to build retrieval indexes and run experiments.
## Supported Tasks
This dataset is primarily intended for:
- Chinese legal case retrieval
- candidate recall and reranking experiments
- hybrid retrieval research combining sparse, dense, and structured signals
- LLM-assisted legal retrieval prototypes
## Dataset Structure
Current layout:
```text
README.md
cases.lancedb/
cases.sqlite3
lecard/
README.md
candidate_pools.jsonl
corpus_merged.jsonl
qrels.jsonl
queries.jsonl
```
### Main Files
#### `cases.lancedb`
Prebuilt LanceDB database artifacts for the current corpus.
These files can be used directly by the main `CaseMatch-Agent` codebase without rebuilding the primary vector-first retrieval database from scratch.
#### `cases.sqlite3`
Prebuilt SQLite fallback database for the current corpus.
This file can be used as the fallback candidate store when LanceDB is unavailable, and it also reduces setup cost for local testing.
#### `lecard/queries.jsonl`
Normalized query file derived from the original `LeCaRD` query set.
Each line is a query record with fields such as:
- `query_id`
- `query_text`
- `charge_labels`
#### `lecard/qrels.jsonl`
Normalized relevance judgments derived from the original `LeCaRD` relevance annotations.
Each line is one labeled `(query_id, case_id)` pair with:
- `query_id`
- `case_id`
- `relevance`
#### `lecard/candidate_pools.jsonl`
Normalized candidate pools used for offline evaluation.
Each line contains:
- `query_id`
- `candidate_case_ids`
#### `lecard/corpus_merged.jsonl`
The main case corpus used by the current CaseMatch pipeline.
Each line contains:
- `case_id`
- `structured_data`
- `raw_data`
`structured_data` is an LLM-extracted representation used for retrieval and reranking. `raw_data` contains mapped judgment text fields from the underlying case document.
For detailed schema definitions, see [lecard/README.md](lecard/README.md).
## Data Sources
This dataset package is built from two layers of data:
### 1. Original LeCaRD resources
The following components are derived from the original [`LeCaRD`](https://github.com/myx666/LeCaRD) release:
- queries
- relevance labels
- candidate pools
- case document content used to build the merged corpus
### 2. Project-level processing
On top of the original data, this repository applies additional processing for the CaseMatch project:
- normalization into flat `jsonl` files
- reorganization into a cleaner repository structure
- merging raw case text fields into a unified corpus format
- LLM-based extraction of structured legal information for each case
As a result:
- `queries.jsonl`, `qrels.jsonl`, and `candidate_pools.jsonl` are normalized derivatives of `LeCaRD`
- `corpus_merged.jsonl` is a project-specific derived corpus and is not part of the original `LeCaRD` release
## Intended Use
This dataset is designed for research and engineering work on:
- criminal similar-case retrieval
- retrieval system evaluation on a fixed candidate pool
- hybrid ranking pipelines using structured fields, BM25-style sparse signals, and dense embeddings
It is not presented as an authoritative legal database, and it should not be treated as a production legal service by itself.
## Limitations
- The current release is criminal-only.
- The merged corpus contains project-specific LLM-extracted structured fields, which may contain extraction errors or omissions.
- Relevance labels and candidate pools inherit the assumptions and limitations of the original `LeCaRD` benchmark.
- The included `cases.lancedb` and `cases.sqlite3` files are derived from the current corpus release. If the corpus changes, they may need to be rebuilt to stay consistent.
## Repository Usage
In the main `CaseMatch-Agent` repository, this dataset is typically downloaded into:
```text
data/
README.md
cases.lancedb/
cases.sqlite3
lecard/
README.md
corpus_merged.jsonl
queries.jsonl
qrels.jsonl
candidate_pools.jsonl
```
The main codebase then uses this data to:
- directly use the bundled LanceDB / SQLite database artifacts
- rebuild LanceDB indexes when needed
- rebuild SQLite fallback indexes when needed
- run offline retrieval experiments
- support incremental case import workflows
<!-- ## Licensing and Redistribution
Before publishing or redistributing this package, make sure you have the right to redistribute all included and derived files.
If you plan to make this dataset public long-term, it is recommended to add:
- an explicit license
- citation information
- usage restrictions or legal disclaimers where necessary -->
提供机构:
Yuel-P



