five

Azri-Muhsin/sangraha-tamil-cleaned

收藏
Hugging Face2026-01-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Azri-Muhsin/sangraha-tamil-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ta license: cc-by-4.0 task_categories: - text-generation - language-modeling tags: - nlp - tamil - sangraha - continuous-pretraining size_categories: - 1M<n<10M --- # Sangraha Tamil (Cleaned & Verified) ## Dataset Description This dataset is a cleaned, processed subset of the **[AI4Bharat Sangraha (Verified)](https://huggingface.co/datasets/ai4bharat/sangraha)** dataset, specifically targeting the **Tamil** language. It was prepared for the purpose of Continuous Pre-Training (CPT) of Large Language Models (LLMs) like Llama 3 and Qwen 2.5/3 to improve their performance on Indic languages. * **Original Source:** AI4Bharat Sangraha (Verified Split) * **Language:** Tamil (ta) * **Format:** Parquet * **Approximate Size:** ~7.8 Million Rows (Processed) ## Data Processing Pipeline To ensure high-quality training tokens, the following cleaning heuristics were applied to the raw data: (most heavyduty preprocessing tasks were handled by the AI4Bharat/Sangraha folks ) 1. **Unicode Normalization:** Applied NFC normalization. 2. **HTML Removal:** All HTML tags and artifacts stripped. 3. **Script Density Filter:** Rows were dropped if **Tamil characters constituted < 50%** of the text. This filters out English-heavy content mixed into the dataset. 4. **Length Filtering:** Sentences with fewer than 20 characters were removed to reduce noise. 5. **Heuristic Noise Filtering:** Removed common web-scrape noise phrases (e.g., "home screen", "click here", "amazon card"). 6. **Deduplication:** Relied on the original Sangraha "Verified" pipeline (MinHash LSH). ## Usage Since the dataset is large, it is recommended to use `streaming=True` or load it via `load_dataset` in chunks. ```python from datasets import load_dataset # Load in streaming mode (Recommended for local PCs) dataset = load_dataset("Azri-Muhsin/sangraha-tamil-cleaned", split="train", streaming=True) for sample in dataset.take(3): print(sample["text"])
提供机构:
Azri-Muhsin
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作