Maathis-com/ohada-ccja-corpus

Name: Maathis-com/ohada-ccja-corpus
Creator: Maathis-com
Published: 2026-03-20 21:49:17
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Maathis-com/ohada-ccja-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr license: cc-by-4.0 size_categories: - 1K-10K task_categories: - text-classification - summarization - token-classification - text-generation tags: - legal - african-nlp - ohada - court-decisions - francophone-africa - legal-nlp - low-resource pretty_name: OHADA CCJA Court Decisions Corpus dataset_info: features: - name: case_id dtype: string - name: case_number dtype: string - name: date dtype: date32 - name: year dtype: int32 - name: legal_domain dtype: string - name: case_type dtype: string - name: jurisdiction dtype: string - name: formation dtype: string - name: plaintiff dtype: string - name: defendant dtype: string - name: articles_cited dtype: string - name: dispute_summary dtype: string - name: reasoning dtype: string - name: ruling dtype: string - name: full_text dtype: string - name: source dtype: string splits: - name: train num_examples: 2841 - name: validation num_examples: 609 - name: test num_examples: 609 --- # OHADA-CCJA Court Decisions Corpus ## Dataset Description A curated corpus of **4,059 court decisions** from the **Cour Commune de Justice et d'Arbitrage (CCJA)**, the supranational court of the Organisation pour l'Harmonisation en Afrique du Droit des Affaires (OHADA). OHADA harmonizes business law across **17 African member states**: Benin, Burkina Faso, Cameroon, Central African Republic, Chad, Comoros, Democratic Republic of Congo, Republic of Congo, Côte d'Ivoire, Equatorial Guinea, Gabon, Guinea, Guinea-Bissau, Mali, Niger, Senegal, and Togo. This dataset provides structured access to CCJA jurisprudence spanning over two decades (1997–2023), making it a unique resource for African legal NLP research. ### Why This Dataset Matters Legal NLP is a rapidly growing field, yet virtually all existing benchmarks and datasets focus on Common Law (US, UK) or EU/Continental European legal systems. African legal systems — and in particular pan-African harmonized law — are entirely absent from the research landscape. This dataset addresses that gap by providing: - **The first structured, ML-ready corpus of OHADA CCJA decisions** in any language - **Rich annotation layers** — not just full text, but separately extracted dispute summaries, judicial reasoning, rulings, legal domain labels, and cited articles - **Pan-African geographic coverage** — cases involving parties and disputes from all 17 OHADA member states - **Temporal depth** — decisions spanning from 1997 to 2023, enabling longitudinal legal analysis ### Supported Tasks | Task | Input | Target | Metric | |------|-------|--------|--------| | **Legal domain classification** | `full_text` or `dispute_summary` | `legal_domain` (16 classes) | F1-macro | | **Legal judgment summarization** | `full_text` | `ruling` or `dispute_summary` | ROUGE-L | | **Legal reasoning extraction** | `dispute_summary` + `ruling` | `reasoning` | ROUGE-L, BERTScore | | **Legal NER** | `full_text` | Parties, jurisdictions, legal articles | Entity-level F1 | | **Cited article prediction** | `full_text` or `dispute_summary` | `articles_cited` | Recall@k | ### Languages French (fr) — the working language of the OHADA CCJA. ## Dataset Structure ### Data Fields | Field | Type | Description | Completeness | |-------|------|-------------|--------------| | `case_id` | string | Unique identifier (e.g., `OHADA-CCJA-00001`) | 100% | | `case_number` | string | Official case number (Numéro d'arrêt) | 94.8% | | `date` | date | Date of the decision (ISO 8601) | 95.3% | | `year` | int | Year extracted from date | 95.3% | | `legal_domain` | string | Area of OHADA law | 99.8% | | `case_type` | string | Subject matter (e.g., Saisie immobilière) | 90.5% | | `jurisdiction` | string | Court (CCJA) | 100% | | `formation` | string | Chamber (Première/Deuxième/Troisième chambre) | 1.0% | | `plaintiff` | string | Name(s) of the plaintiff(s) | 86.9% | | `defendant` | string | Name(s) of the defendant(s) | 85.6% | | `articles_cited` | string | Legal articles referenced in the decision | 85.4% | | `dispute_summary` | string | Summary of the dispute (Exposé du litige) | 99.8% | | `reasoning` | string | Court's reasoning (Motif) | 27.1% | | `ruling` | string | Final ruling (Dispositif) | 99.8% | | `full_text` | string | Complete text of the decision | 100% | | `source` | string | Provenance: `file1`, `file2`, or `both` | 100% | **Note on field completeness:** This dataset was compiled from two complementary sources with different annotation depths. The "reasoning" field (court's motif) is available for approximately 1,100 cases from Source 1. The "articles_cited", "plaintiff", and "defendant" fields are primarily available from Source 2 (approximately 3,500 cases). The "source" column indicates provenance, allowing researchers to filter for task-specific subsets. See "Source Data" below. ### Data Splits | Split | Cases | Purpose | |-------|-------|---------| | `train` | 2,841 | Model training | | `validation` | 609 | Hyperparameter tuning | | `test` | 609 | Final evaluation | Splits are stratified by `legal_domain` to preserve class proportions across all splits. ### Legal Domain Distribution The `legal_domain` field covers **16 categories** across the major branches of OHADA harmonized law: | Legal Domain | Count | % | |---|---|---| | Droit des voies d'exécution (Enforcement law) | 2,144 | 52.8% | | Droit des sociétés commerciales et GIE (Commercial companies) | 479 | 11.8% | | Droit commercial général (General commercial law) | 428 | 10.5% | | Droit des procédures collectives (Insolvency) | 256 | 6.3% | | Droit des sûretés (Securities law) | 195 | 4.8% | | Règlement de procédure de la CCJA (CCJA procedural rules) | 188 | 4.6% | | Droit des sociétés coopératives (Cooperative law) | 155 | 3.8% | | Droit de l'arbitrage (Arbitration law) | 136 | 3.4% | | Droit des contrats de transport par route (Road transport contracts) | 41 | 1.0% | | Droit des contrats (Contract law) | 13 | 0.3% | | Droit des assurances (Insurance law) | 8 | 0.2% | | Other rare categories | 16 | 0.4% | ## Dataset Creation ### Source Data The corpus was compiled from two complementary sources of publicly available CCJA decisions: - **Source 1** (1,115 unique cases after deduplication): Decisions with extracted judicial reasoning (`reasoning`/motif), dispute summaries, and rulings. These are typically original court decision texts scraped from OHADA legal databases. - **Source 2** (3,642 unique cases after deduplication from 10,410 raw records): Decisions with cited legal articles (`articles_cited`), detailed party names (`plaintiff`/`defendant`), and descriptive case type labels (`case_type`). These include annotated case analyses with structured metadata. **548 cases were present in both sources** and were merged to combine the richest available annotations. The final dataset contains 4,059 unique cases. **Field availability by source:** | Field | Source 1 only (`file1`) | Source 2 only (`file2`) | Merged (`both`) | |-------|------------------------|------------------------|------------------| | `reasoning` | ✅ | ❌ | ✅ | | `articles_cited` | ❌ | ✅ | ✅ | | `plaintiff` / `defendant` | Sparse (~3%) | ✅ (~95%) | ✅ | | `case_type` (descriptive) | Generic | ✅ Descriptive | ✅ Descriptive | | `dispute_summary` | ✅ | ✅ | ✅ (best of both) | | `ruling` | ✅ | ✅ | ✅ (best of both) | ### Preprocessing 1. **Deduplication**: Content-hash-based deduplication removed 6,768 duplicates from Source 2 and 33 from Source 1, plus 3 cross-source duplicates 2. **Label normalization**: Spelling and accent variants in `legal_domain` were harmonized (e.g., "suretés" → "sûretés"), reducing from 18 raw labels to 16 clean categories 3. **Date parsing**: Dates converted from mixed formats (dd/mm/yyyy and French text like "27 avril 2015") to ISO 8601. A small number of implausible dates (pre-1995 or post-2024) resulting from parsing errors were set to null 4. **Schema unification**: Columns standardized to English names with consistent types 5. **Cross-source merge**: For the 548 overlapping cases, the most complete value for each field was retained using a coalesce strategy (e.g., `reasoning` from Source 1, `articles_cited` from Source 2, `plaintiff`/`defendant` preferring Source 2 where populated) ### Ethical Considerations - **Public records**: All CCJA decisions are matters of public record, publicly accessible through official OHADA channels - **Party names**: Names of litigants appear as published in official court records. Researchers working with this data should consider whether their downstream applications require further anonymization - **Jurisdictional scope**: OHADA law governs business disputes; this corpus does not contain criminal cases or cases involving minors - **Class imbalance**: The corpus reflects the CCJA's actual caseload, which skews toward enforcement law (~53%) and commercial disputes. This distribution mirrors real litigation patterns but may not represent the full breadth of legal issues in OHADA member states. Researchers should account for this imbalance in model training and evaluation ### Licensing This dataset is released under **CC-BY-4.0**. OHADA court decisions are public legal documents. The added value of this dataset lies in its structuring, cleaning, annotation, and packaging for ML research. ## Usage ### Loading with HuggingFace Datasets ```python from datasets import load_dataset dataset = load_dataset("Maathis-com/ohada-ccja-corpus") # Access splits train = dataset["train"] print(f"Training examples: {len(train)}") print(train[0]) ``` ### Example: Legal Domain Classification ```python from datasets import load_dataset dataset = load_dataset("Maathis-com/ohada-ccja-corpus") # Use dispute_summary as input, legal_domain as label train_texts = dataset["train"]["dispute_summary"] train_labels = dataset["train"]["legal_domain"] ``` ### Example: Filter for Cases with Court Reasoning ```python # ~1,100 cases have the court's reasoning (motif) reasoning_subset = dataset["train"].filter(lambda x: x["reasoning"] is not None) print(f"Cases with reasoning: {len(reasoning_subset)}") ``` ### Example: Filter for Cases with Cited Articles ```python # ~3,500 cases have cited legal articles articles_subset = dataset["train"].filter(lambda x: x["articles_cited"] is not None) print(f"Cases with cited articles: {len(articles_subset)}") ``` ### Example: Full-Feature Subset (Merged Cases) ```python # 546 cases have ALL fields populated (from both sources) full_subset = dataset["train"].filter(lambda x: x["source"] == "both") print(f"Cases with all fields: {len(full_subset)}") ``` ## Citation If you use this dataset in your research, please cite: ```bibtex @dataset{ohada_ccja_corpus_2026, title={OHADA-CCJA Court Decisions Corpus: A Dataset for African Legal NLP}, author={Foutse Yuehgoh, Priyanka N, Patrick NGUETCHOUESSI}, year={2026}, url={https://huggingface.co/datasets/Maathis-com/ohada-ccja-corpus}, note={Submitted at Deep Learning Indaba 2026, Nigeria} } ``` ## Contact For questions about this dataset, please contact the dataset creator or open an issue on the [HuggingFace repository](https://huggingface.co/datasets/Maathis-com/ohada-ccja-corpus/discussions).

提供机构：

Maathis-com

5,000+

优质数据集

54 个

任务类型

进入经典数据集