Azri-Muhsin/sangraha-tamil-cleaned

Name: Azri-Muhsin/sangraha-tamil-cleaned
Creator: Azri-Muhsin
Published: 2026-01-05 06:27:48
License: 暂无描述

Hugging Face2026-01-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Azri-Muhsin/sangraha-tamil-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ta license: cc-by-4.0 task_categories: - text-generation - language-modeling tags: - nlp - tamil - sangraha - continuous-pretraining size_categories: - 1M<n<10M --- # Sangraha Tamil (Cleaned & Verified) ## Dataset Description This dataset is a cleaned, processed subset of the **[AI4Bharat Sangraha (Verified)](https://huggingface.co/datasets/ai4bharat/sangraha)** dataset, specifically targeting the **Tamil** language. It was prepared for the purpose of Continuous Pre-Training (CPT) of Large Language Models (LLMs) like Llama 3 and Qwen 2.5/3 to improve their performance on Indic languages. * **Original Source:** AI4Bharat Sangraha (Verified Split) * **Language:** Tamil (ta) * **Format:** Parquet * **Approximate Size:** ~7.8 Million Rows (Processed) ## Data Processing Pipeline To ensure high-quality training tokens, the following cleaning heuristics were applied to the raw data: (most heavyduty preprocessing tasks were handled by the AI4Bharat/Sangraha folks ) 1. **Unicode Normalization:** Applied NFC normalization. 2. **HTML Removal:** All HTML tags and artifacts stripped. 3. **Script Density Filter:** Rows were dropped if **Tamil characters constituted < 50%** of the text. This filters out English-heavy content mixed into the dataset. 4. **Length Filtering:** Sentences with fewer than 20 characters were removed to reduce noise. 5. **Heuristic Noise Filtering:** Removed common web-scrape noise phrases (e.g., "home screen", "click here", "amazon card"). 6. **Deduplication:** Relied on the original Sangraha "Verified" pipeline (MinHash LSH). ## Usage Since the dataset is large, it is recommended to use `streaming=True` or load it via `load_dataset` in chunks. ```python from datasets import load_dataset # Load in streaming mode (Recommended for local PCs) dataset = load_dataset("Azri-Muhsin/sangraha-tamil-cleaned", split="train", streaming=True) for sample in dataset.take(3): print(sample["text"])

提供机构：

Azri-Muhsin

5,000+

优质数据集

54 个

任务类型

进入经典数据集