five

phoneee/thai-legal-corpus

收藏
Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/phoneee/thai-legal-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - th task_categories: - text-generation - fill-mask - text-classification tags: - legal - thai-law - continual-pretraining - supreme-court - statutory-law - case-law - citation-analysis size_categories: - 100K<n<1M --- # Thai Legal Corpus v1 Cleaned and deduplicated Thai legal text corpus for Continual Pre-Training (CPT), with structured citation metadata for legal analysis. ## Dataset Description | Metric | Value | |--------|-------| | Total records | 176,543 | | Total size | 6.03 GB | | Avg doc length | 11,952 chars | | Sources | 3 | | Year range | 1874-2026 | | Splits | train 158,887 / val 8,826 / test 8,830 | ### Sources | Source | Records | |--------|---------| | krisdika | 6,743 | | supreme_court | 127,012 | | thailaw | 42,788 | ## Repository Structure ``` data/ train.jsonl, val.jsonl, test.jsonl -- Cleaned text (for CPT) metadata/ deka_citations.jsonl -- Citation metadata (join by doc_id) law_section_index.json -- Law-section-case mapping corpus_stats.json ``` ## Data Schema ### Training Data (`data/*.jsonl`) | Field | Type | Description | |-------|------|-------------| | `text` | string | Full legal text (cleaned, 100-500K chars) | | `source` | string | `krisdika`, `thailaw`, or `supreme_court` | | `doc_id` | string | Unique document ID | | `year` | int (optional) | Publication year (CE) | ### Citation Metadata (`metadata/deka_citations.jsonl`) Structured fields for Supreme Court rulings. Join with training data on `doc_id`. | Field | Type | Description | |-------|------|-------------| | `doc_id` | string | Matches `doc_id` in training data | | `case_id` | string | Case number (e.g., `7946/2568`) | | `case_year` | string | Buddhist Era year | | `laws_cited` | list[str] | Laws referenced in ruling | | `sections_cited` | list[str] | Specific sections cited (e.g., `ป.อ. มาตรา 265`) | ### Law-Section-Case Index (`metadata/law_section_index.json`) Structured mapping from Thai laws to sections to Supreme Court case counts. | Metric | Value | |--------|-------| | Total laws | 2,210 | | Total sections | 18,003 | | Total case references | 667,613 | **Top laws:** ป.พ.พ. (275,772 cases), ป.อ. (102,245), ป.วิ.อ. (99,840), ป.วิ.พ. (70,590) #### Original Law Full Text Sources | Source | URL | Coverage | |--------|-----|----------| | สำนักงานคณะกรรมการกฤษฎีกา | [law.go.th](https://www.krisdika.go.th) | Official -- all Thai laws | | HF: ocs-krisdika | [`open-law-data-thailand/ocs-krisdika`](https://huggingface.co/datasets/open-law-data-thailand/ocs-krisdika) | 7,042 laws (structured) | | HF: thailaw | [`pythainlp/thailaw-v1.0`](https://huggingface.co/datasets/pythainlp/thailaw-v1.0) | 43,012 laws (full text) | | ราชกิจจานุเบกษา | [ratchakitcha.soc.go.th](http://www.ratchakitcha.soc.go.th) | Original gazette PDFs | ## Sources 1. **krisdika** (Office of the Council of State) -- Structured laws from [`open-law-data-thailand/ocs-krisdika`](https://huggingface.co/datasets/open-law-data-thailand/ocs-krisdika) 2. **thailaw** (PyThaiNLP) -- Full-text Thai laws from [`pythainlp/thailaw-v1.0`](https://huggingface.co/datasets/pythainlp/thailaw-v1.0) 3. **supreme_court** -- Supreme Court rulings from [deka.supremecourt.or.th](https://deka.supremecourt.or.th) ## Cleaning Pipeline Integrated with [OpenThaiGPT/data-processing](https://github.com/OpenThaiGPT/data-processing) patterns: 1. Unicode NFC normalization + cc_net punctuation normalization + invisible/non-printing char removal 2. Content cleaning: strip URLs, emails, HTML tags, iframes, markup, IPs, hashtags 3. Encoding corruption detection: GHOST (double-encoding), NONECHAR (invalid Thai codepoints), NONE_TONE_MARK (OCR artifacts) 4. Quality filtering: Thai ratio >= 30%, text length 100-500K chars, garbled ratio < 10% 5. Exact hash deduplication (MD5) 6. Stratified 90/5/5 split by source ## Intended Use - Continual Pre-Training (CPT) of Thai language models for legal domain - Legal citation analysis and knowledge graph construction - Legal text analysis and information retrieval ## Limitations - Supreme Court rulings may contain archaic Thai legal terminology - Some older statutory records may have OCR artifacts from source - Year metadata is best-effort extraction from text - Citation extraction is regex-based and may miss some references ## Citation ``` @misc{thai-legal-corpus-2026, title={Thai Legal Corpus v1}, year={2026}, howpublished={HuggingFace}, } ```
提供机构:
phoneee
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作