five

antonhome/indian-legal-supervised-fine-tuning-data

收藏
Hugging Face2025-12-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/antonhome/indian-legal-supervised-fine-tuning-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: context dtype: string - name: question dtype: string - name: response dtype: string splits: - name: train num_bytes: 16325944571 num_examples: 6055371 download_size: 9101140222 dataset_size: 16325944571 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 language: - hi - en - ta - mr - bn - or - te --- # 🇮🇳 LegalBrain Indic Legal Corpus A large-scale **multilingual Indian legal dataset** curated to support research in: - Domain-specific **LLM training** - **Legal question answering** - **Policy reasoning & case retrieval** - Agentic systems for **legal workflow automation** This dataset contains text drawn from publicly available legal sources across multiple Indian languages, including: **English, Hindi, Marathi, Bengali, Kannada, Tamil, Telugu, Odia**, and others. The corpus is structured and processed to be directly usable for **supervised fine-tuning**, **RAG pipelines**, and **conversational legal assistants**. --- ## 📦 Dataset Structure After preprocessing and supervised alignment, the dataset is provided in the format: | context | question | response | |--------|----------|-----------| | Multi-paragraph legal text (case summary, statute, commentary) | A legal or interpretation-type query derived from the context | Answer grounded in the specific information contained in *context* | This enables training of: - Legal chatbots - Agentic reasoning systems - Legal retrieval-augmented QA models - Court case summarizers - Argumentation-based LLM pipelines --- ## 🏛️ **Data Sources** Data was collected only from **publicly and legally accessible sources**, including: - Supreme Court judgments - High Court decisions - Law Commission reports - Public legal textbooks & commentaries - Open legal news archives - Public domain legal Q&A portals - Government acts, rules, and notifications **No proprietary or licensed content** was used. --- ## 🧹 Cleaning & Normalization Pipeline Large-scale legal data is noisy. The following steps were used: 1. **HTML + Boilerplate Removal** Removal of menus, footers, ads, repeated headers, legal boilerplate markers. 2. **OCR + Text Correction** OCR applied to scanned PDFs using **Tesseract + custom normalization**, followed by regex-based cleanup for: - Section markers - Citations - Case line references 3. **Language Detection & Segmentation** Auto-sharding by language → Sentence & clause-level segmentation using spaCy + Indic NLP. 4. **De-duplication** Removed near-duplicate clauses across multiple case reports using MinHash (LSH). --- ## ✨ Argilla-Based Supervised Dataset Construction To transform unstructured text into **(context, question, response)** triplets, the dataset was processed using **Argilla** for human feedback and model-assisted annotation. ### Workflow: 1. Select meaningful legal text chunks (150–600 words). 2. Use a prompting pipeline to generate **candidate questions and answers**. 3. Load candidate examples into **Argilla workspace**. 4. Curate + refine: - Fix hallucinations - Improve citation grounding - Ensure all responses **strictly reference context** 5. Export final reviewed dataset to HF. **This ensures the dataset trains models that cite the law instead of hallucinating it.** --- ## 🏗️ Example Entry ```json { "context": "The right to constitutional remedies allows citizens to approach the Supreme Court under Article 32...", "question": "Which constitutional provision allows an individual to directly move the Supreme Court?", "response": "Article 32 of the Constitution grants the right to constitutional remedies, enabling citizens to directly approach the Supreme Court." }
提供机构:
antonhome
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作