Azri-Muhsin/sangraha-tamil-cleaned
收藏Hugging Face2026-01-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Azri-Muhsin/sangraha-tamil-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ta
license: cc-by-4.0
task_categories:
- text-generation
- language-modeling
tags:
- nlp
- tamil
- sangraha
- continuous-pretraining
size_categories:
- 1M<n<10M
---
# Sangraha Tamil (Cleaned & Verified)
## Dataset Description
This dataset is a cleaned, processed subset of the **[AI4Bharat Sangraha (Verified)](https://huggingface.co/datasets/ai4bharat/sangraha)** dataset, specifically targeting the **Tamil** language. It was prepared for the purpose of Continuous Pre-Training (CPT) of Large Language Models (LLMs) like Llama 3 and Qwen 2.5/3 to improve their performance on Indic languages.
* **Original Source:** AI4Bharat Sangraha (Verified Split)
* **Language:** Tamil (ta)
* **Format:** Parquet
* **Approximate Size:** ~7.8 Million Rows (Processed)
## Data Processing Pipeline
To ensure high-quality training tokens, the following cleaning heuristics were applied to the raw data:
(most heavyduty preprocessing tasks were handled by the AI4Bharat/Sangraha folks )
1. **Unicode Normalization:** Applied NFC normalization.
2. **HTML Removal:** All HTML tags and artifacts stripped.
3. **Script Density Filter:** Rows were dropped if **Tamil characters constituted < 50%** of the text. This filters out English-heavy content mixed into the dataset.
4. **Length Filtering:** Sentences with fewer than 20 characters were removed to reduce noise.
5. **Heuristic Noise Filtering:** Removed common web-scrape noise phrases (e.g., "home screen", "click here", "amazon card").
6. **Deduplication:** Relied on the original Sangraha "Verified" pipeline (MinHash LSH).
## Usage
Since the dataset is large, it is recommended to use `streaming=True` or load it via `load_dataset` in chunks.
```python
from datasets import load_dataset
# Load in streaming mode (Recommended for local PCs)
dataset = load_dataset("Azri-Muhsin/sangraha-tamil-cleaned", split="train", streaming=True)
for sample in dataset.take(3):
print(sample["text"])
提供机构:
Azri-Muhsin



