phoneee/thai-legal-corpus
收藏Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/phoneee/thai-legal-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- th
task_categories:
- text-generation
- fill-mask
- text-classification
tags:
- legal
- thai-law
- continual-pretraining
- supreme-court
- statutory-law
- case-law
- citation-analysis
size_categories:
- 100K<n<1M
---
# Thai Legal Corpus v1
Cleaned and deduplicated Thai legal text corpus for Continual Pre-Training (CPT), with structured citation metadata for legal analysis.
## Dataset Description
| Metric | Value |
|--------|-------|
| Total records | 176,543 |
| Total size | 6.03 GB |
| Avg doc length | 11,952 chars |
| Sources | 3 |
| Year range | 1874-2026 |
| Splits | train 158,887 / val 8,826 / test 8,830 |
### Sources
| Source | Records |
|--------|---------|
| krisdika | 6,743 |
| supreme_court | 127,012 |
| thailaw | 42,788 |
## Repository Structure
```
data/
train.jsonl, val.jsonl, test.jsonl -- Cleaned text (for CPT)
metadata/
deka_citations.jsonl -- Citation metadata (join by doc_id)
law_section_index.json -- Law-section-case mapping
corpus_stats.json
```
## Data Schema
### Training Data (`data/*.jsonl`)
| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Full legal text (cleaned, 100-500K chars) |
| `source` | string | `krisdika`, `thailaw`, or `supreme_court` |
| `doc_id` | string | Unique document ID |
| `year` | int (optional) | Publication year (CE) |
### Citation Metadata (`metadata/deka_citations.jsonl`)
Structured fields for Supreme Court rulings. Join with training data on `doc_id`.
| Field | Type | Description |
|-------|------|-------------|
| `doc_id` | string | Matches `doc_id` in training data |
| `case_id` | string | Case number (e.g., `7946/2568`) |
| `case_year` | string | Buddhist Era year |
| `laws_cited` | list[str] | Laws referenced in ruling |
| `sections_cited` | list[str] | Specific sections cited (e.g., `ป.อ. มาตรา 265`) |
### Law-Section-Case Index (`metadata/law_section_index.json`)
Structured mapping from Thai laws to sections to Supreme Court case counts.
| Metric | Value |
|--------|-------|
| Total laws | 2,210 |
| Total sections | 18,003 |
| Total case references | 667,613 |
**Top laws:** ป.พ.พ. (275,772 cases), ป.อ. (102,245), ป.วิ.อ. (99,840), ป.วิ.พ. (70,590)
#### Original Law Full Text Sources
| Source | URL | Coverage |
|--------|-----|----------|
| สำนักงานคณะกรรมการกฤษฎีกา | [law.go.th](https://www.krisdika.go.th) | Official -- all Thai laws |
| HF: ocs-krisdika | [`open-law-data-thailand/ocs-krisdika`](https://huggingface.co/datasets/open-law-data-thailand/ocs-krisdika) | 7,042 laws (structured) |
| HF: thailaw | [`pythainlp/thailaw-v1.0`](https://huggingface.co/datasets/pythainlp/thailaw-v1.0) | 43,012 laws (full text) |
| ราชกิจจานุเบกษา | [ratchakitcha.soc.go.th](http://www.ratchakitcha.soc.go.th) | Original gazette PDFs |
## Sources
1. **krisdika** (Office of the Council of State) -- Structured laws from [`open-law-data-thailand/ocs-krisdika`](https://huggingface.co/datasets/open-law-data-thailand/ocs-krisdika)
2. **thailaw** (PyThaiNLP) -- Full-text Thai laws from [`pythainlp/thailaw-v1.0`](https://huggingface.co/datasets/pythainlp/thailaw-v1.0)
3. **supreme_court** -- Supreme Court rulings from [deka.supremecourt.or.th](https://deka.supremecourt.or.th)
## Cleaning Pipeline
Integrated with [OpenThaiGPT/data-processing](https://github.com/OpenThaiGPT/data-processing) patterns:
1. Unicode NFC normalization + cc_net punctuation normalization + invisible/non-printing char removal
2. Content cleaning: strip URLs, emails, HTML tags, iframes, markup, IPs, hashtags
3. Encoding corruption detection: GHOST (double-encoding), NONECHAR (invalid Thai codepoints), NONE_TONE_MARK (OCR artifacts)
4. Quality filtering: Thai ratio >= 30%, text length 100-500K chars, garbled ratio < 10%
5. Exact hash deduplication (MD5)
6. Stratified 90/5/5 split by source
## Intended Use
- Continual Pre-Training (CPT) of Thai language models for legal domain
- Legal citation analysis and knowledge graph construction
- Legal text analysis and information retrieval
## Limitations
- Supreme Court rulings may contain archaic Thai legal terminology
- Some older statutory records may have OCR artifacts from source
- Year metadata is best-effort extraction from text
- Citation extraction is regex-based and may miss some references
## Citation
```
@misc{thai-legal-corpus-2026,
title={Thai Legal Corpus v1},
year={2026},
howpublished={HuggingFace},
}
```
提供机构:
phoneee



