munzurul/bangla-corpus
收藏Hugging Face2026-02-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/munzurul/bangla-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- bn
license: cc-by-4.0
task_categories:
- text-generation
pretty_name: TituLM Bangla Corpus
dataset_info:
- config_name: common_crawl
features:
- name: document_id
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 147466205018
num_examples: 24310843
download_size: 50490079447
dataset_size: 147466205018
- config_name: romanized
features:
- name: text
dtype: string
- name: document_id
dtype: string
splits:
- name: train
num_bytes: 12117078927
num_examples: 5170442
download_size: 7564096164
dataset_size: 12117078927
- config_name: translated
features:
- name: text
dtype: string
- name: document_id
dtype: string
splits:
- name: train
num_bytes: 16287904499
num_examples: 1744165
download_size: 6194606598
dataset_size: 16287904499
configs:
- config_name: default
data_files:
- split: train
path: '**/train-*.parquet'
- config_name: common_crawl
data_files:
- split: train
path: common_crawl/train-*
- config_name: romanized
data_files:
- split: train
path: romanized/train-*
- config_name: translated
data_files:
- split: train
path: translated/train-*
---
## TituLM Bangla Corpus
This dataset is associated with the paper [TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking](https://huggingface.co/papers/2502.11187)
**TituLM Bangla Corpus** is one of the largest Bangla clean corpus prepared for pretraining, continual pretraining or fine-tuning Large Language Model(LLM) for improving Bangla text generation capability.
This dataset contains diverse sources and categories of Bangla text. The largest part of this dataset contains filtered common crawled datasets. As we saw existing all common crawl datasets have issues with proper text extraction from HTML pages and Bangla language specific filtering as all those datasets build for multilingual purposes.
Keeping that in mind we applied [Trafilatura](https://trafilatura.readthedocs.io/en/latest/) tool to extract text from common crawl web pages. Compared to existing extraction pages we found this tool perform better. We generate several Bangla language specific quality signals over the dataset and filtered using different quality signals threshold.
We also prepared a fine-tuned [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) model to translate English text to Bangla, and Bangla text to Romanized. We are hoping this dataset contributes to the Bangla research community to buidl more enhance and improved Bangla language model.
## Getting Started
To download full datasets:
```py
from datasets import load_dataset
dataset = load_dataset("hishab/titulm-bangla-corpus")
```
To download a subset:
```py
from datasets import load_dataset
dataset = load_dataset("hishab/titulm-bangla-corpusa", data_dir="<subset_name>")
# example
# dataset = load_dataset("hishab/titulm-bangla-corpusa", data_dir="common_crawl")
```
## Datasets Summary
TituLM Bangla Corpus contains three different categories:
- **Common Crawl**:
- **Filtered**: Contains Common Crawl filtered data. We downloaded the common crawl dump using Athena(Amazon) by Bangla language and language-specific keywords. Then we extract text using [Trafilatura]() which is good tool for web text extraction. We applied several filtering methods. This is the cleanest version of the datasets.
- **Translation**: Contains Bangla-translated data from English news articles. We used a fine-tuned [NLLB]() model to translate the datasets. In our observation, the fine-tuned **nllb** model performs better than Google or other available translators. We generated the fine-tuned data using GPT-4 and GPT-4o models.
- **Romanized**: Contains transliterated Bangla data from Bangla common crawl and news articles. We used a fine-tuned [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) model to translate the datasets. We generated the fine-tuned data using GPT-4 and GPT-4o models.
## Datasets Statistics
- **Document counts**: Document count describe the total number of document or web pages or page text. For example a news article web page.
- **Word Counts**: Number of total words counted by [basic tokenizer](https://sagorbrur.github.io/bnlp/docs/tokenization#basic-tokenizer).
- **Token Counts**: We trained a Tiktoken tokenizer with a large chunk of Bangla text. Here token counts describe the number of tokens counted by [https://huggingface.co/hishab/titulm-llama-3.2-3b-v2.0](https://huggingface.co/hishab/titulm-llama-3.2-3b-v2.0) tokenizer. This tokenizer contains the original Llama 3.1 tokenizer extended with 48k Bangla tokens.
| Category | Total Documents (In Millions) | Total Words (In Billions) | Total Tokens (In Billions) |
|----------------|-----------------|-------------|------------------------|
| Common Crawl Filtered | 24.3 | 9.94 | 14.80 |
| Translated | 1.74 | 1.08 | 1.47 |
| Romanized | 5.17 | 1.89 | 3.87 |
| **Total** | **31.21** | **12.91** | **20.14** |
## Datasets Preparation in Details
### Common Crawl
- We used Amazon Athena to query the common crawl datasets. We query by content language, URL host TLD, and dumped the query results.
- We used [Trafilatura](https://trafilatura.readthedocs.io/en/latest/) for extracting text from the query-separated common crawl web HTML pages. We found out that Trafilatura works better to extract text from web HTML pages.
- We generated different quality signals like document word counts, character counts, sentence counts, line ending with terminal punctuations, adult content, etc. We generated a total of 20 quality signals for each document.
- In the final steps, we set a threshold for each quality signal followed by **Gopher rule**, like word count must be between 50 to 10000, is adult false, sentence count greater than 5, etc. We applied those quality signal thresholds and separated the documents in pass and failed.
- According to our filtering passed percentage **36.76%** and failed **62.54%**
### Translated
- We prepared custom English-to-Bangla translation datasets using OpenAI GPT-4, and GPT-4o models and reviewed the datasets by human annotator.
- We fine-tuned the [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) model with that dataset and our eval results show promising results on test datasets. Compared to Google Translate our fine-tuned translation seems more natural. We are hoping to publish the model soon.
- Finally, we selected an English newspaper dataset and translated the full dataset to Bangla using the fine-tuned model.
### Romanized
- We prepared custom Bangla-to-Romanized Bangla datasets using OpenAI GPT-4, and GPT-4o models and reviewed the datasets by human annotator.
- We fine-tuned the [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) model with that dataset and our eval results show promising results on test datasets. We are hoping to publish the model soon.
- Finally, we romanized a selected common crawl Bangla dataset using the fine-tuned model.
## Citation
```
@misc{nahin2025titullmsfamilybanglallms,
title={TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking},
author={Shahriar Kabir Nahin and Rabindra Nath Nandi and Sagor Sarker and Quazi Sarwar Muhtaseem and Md Kowsher and Apu Chandraw Shill and Md Ibrahim and Mehadi Hasan Menon and Tareq Al Muntasir and Firoj Alam},
year={2025},
eprint={2502.11187},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.11187},
}
```
提供机构:
munzurul



