ItzSiden/TLumina_Pretrain_Dataset
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ItzSiden/TLumina_Pretrain_Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# T-Lumina Pretrain Dataset
## 📊 Overview
- Size: ~3.3GB
- Total lines: ~5.6M
- Language:
- Bangla: ~89%
- English: ~11%
## 🎯 Purpose
This dataset is designed for pretraining a Bangla-focused educational language model (T-Lumina).
## 📚 Data Sources
- Wikipedia-style articles
- News text
- Informational and educational content
- Mixed Bangla-English contextual data
## 🧹 Processing
- Cleaned (removed noise, refs, junk)
- Filtered short/incomplete lines
- Shuffled for better generalization
## ⚠️ Notes
- Contains formal and semi-formal language
- Not conversation-focused
- Best used for base pretraining, followed by fine-tuning
## 🚀 Usage
Suitable for:
- Bangla LLM pretraining
- Educational AI systems
- Multilingual experiments
---
license: mit
---
提供机构:
ItzSiden



