munzurul/bangla-corpus

Name: munzurul/bangla-corpus
Creator: munzurul
Published: 2026-02-20 19:47:07
License: 暂无描述

Hugging Face2026-02-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/munzurul/bangla-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - bn license: cc-by-4.0 task_categories: - text-generation pretty_name: TituLM Bangla Corpus dataset_info: - config_name: common_crawl features: - name: document_id dtype: string - name: text dtype: string splits: - name: train num_bytes: 147466205018 num_examples: 24310843 download_size: 50490079447 dataset_size: 147466205018 - config_name: romanized features: - name: text dtype: string - name: document_id dtype: string splits: - name: train num_bytes: 12117078927 num_examples: 5170442 download_size: 7564096164 dataset_size: 12117078927 - config_name: translated features: - name: text dtype: string - name: document_id dtype: string splits: - name: train num_bytes: 16287904499 num_examples: 1744165 download_size: 6194606598 dataset_size: 16287904499 configs: - config_name: default data_files: - split: train path: '**/train-*.parquet' - config_name: common_crawl data_files: - split: train path: common_crawl/train-* - config_name: romanized data_files: - split: train path: romanized/train-* - config_name: translated data_files: - split: train path: translated/train-* --- ## TituLM Bangla Corpus This dataset is associated with the paper [TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking](https://huggingface.co/papers/2502.11187) **TituLM Bangla Corpus** is one of the largest Bangla clean corpus prepared for pretraining, continual pretraining or fine-tuning Large Language Model(LLM) for improving Bangla text generation capability. This dataset contains diverse sources and categories of Bangla text. The largest part of this dataset contains filtered common crawled datasets. As we saw existing all common crawl datasets have issues with proper text extraction from HTML pages and Bangla language specific filtering as all those datasets build for multilingual purposes. Keeping that in mind we applied [Trafilatura](https://trafilatura.readthedocs.io/en/latest/) tool to extract text from common crawl web pages. Compared to existing extraction pages we found this tool perform better. We generate several Bangla language specific quality signals over the dataset and filtered using different quality signals threshold. We also prepared a fine-tuned [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) model to translate English text to Bangla, and Bangla text to Romanized. We are hoping this dataset contributes to the Bangla research community to buidl more enhance and improved Bangla language model. ## Getting Started To download full datasets: ```py from datasets import load_dataset dataset = load_dataset("hishab/titulm-bangla-corpus") ``` To download a subset: ```py from datasets import load_dataset dataset = load_dataset("hishab/titulm-bangla-corpusa", data_dir="<subset_name>") # example # dataset = load_dataset("hishab/titulm-bangla-corpusa", data_dir="common_crawl") ``` ## Datasets Summary TituLM Bangla Corpus contains three different categories: - **Common Crawl**: - **Filtered**: Contains Common Crawl filtered data. We downloaded the common crawl dump using Athena(Amazon) by Bangla language and language-specific keywords. Then we extract text using [Trafilatura]() which is good tool for web text extraction. We applied several filtering methods. This is the cleanest version of the datasets. - **Translation**: Contains Bangla-translated data from English news articles. We used a fine-tuned [NLLB]() model to translate the datasets. In our observation, the fine-tuned **nllb** model performs better than Google or other available translators. We generated the fine-tuned data using GPT-4 and GPT-4o models. - **Romanized**: Contains transliterated Bangla data from Bangla common crawl and news articles. We used a fine-tuned [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) model to translate the datasets. We generated the fine-tuned data using GPT-4 and GPT-4o models. ## Datasets Statistics - **Document counts**: Document count describe the total number of document or web pages or page text. For example a news article web page. - **Word Counts**: Number of total words counted by [basic tokenizer](https://sagorbrur.github.io/bnlp/docs/tokenization#basic-tokenizer). - **Token Counts**: We trained a Tiktoken tokenizer with a large chunk of Bangla text. Here token counts describe the number of tokens counted by [https://huggingface.co/hishab/titulm-llama-3.2-3b-v2.0](https://huggingface.co/hishab/titulm-llama-3.2-3b-v2.0) tokenizer. This tokenizer contains the original Llama 3.1 tokenizer extended with 48k Bangla tokens. | Category | Total Documents (In Millions) | Total Words (In Billions) | Total Tokens (In Billions) | |----------------|-----------------|-------------|------------------------| | Common Crawl Filtered | 24.3 | 9.94 | 14.80 | | Translated | 1.74 | 1.08 | 1.47 | | Romanized | 5.17 | 1.89 | 3.87 | | **Total** | **31.21** | **12.91** | **20.14** | ## Datasets Preparation in Details ### Common Crawl - We used Amazon Athena to query the common crawl datasets. We query by content language, URL host TLD, and dumped the query results. - We used [Trafilatura](https://trafilatura.readthedocs.io/en/latest/) for extracting text from the query-separated common crawl web HTML pages. We found out that Trafilatura works better to extract text from web HTML pages. - We generated different quality signals like document word counts, character counts, sentence counts, line ending with terminal punctuations, adult content, etc. We generated a total of 20 quality signals for each document. - In the final steps, we set a threshold for each quality signal followed by **Gopher rule**, like word count must be between 50 to 10000, is adult false, sentence count greater than 5, etc. We applied those quality signal thresholds and separated the documents in pass and failed. - According to our filtering passed percentage **36.76%** and failed **62.54%** ### Translated - We prepared custom English-to-Bangla translation datasets using OpenAI GPT-4, and GPT-4o models and reviewed the datasets by human annotator. - We fine-tuned the [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) model with that dataset and our eval results show promising results on test datasets. Compared to Google Translate our fine-tuned translation seems more natural. We are hoping to publish the model soon. - Finally, we selected an English newspaper dataset and translated the full dataset to Bangla using the fine-tuned model. ### Romanized - We prepared custom Bangla-to-Romanized Bangla datasets using OpenAI GPT-4, and GPT-4o models and reviewed the datasets by human annotator. - We fine-tuned the [NLLB](https://huggingface.co/facebook/nllb-200-distilled-600M) model with that dataset and our eval results show promising results on test datasets. We are hoping to publish the model soon. - Finally, we romanized a selected common crawl Bangla dataset using the fine-tuned model. ## Citation ``` @misc{nahin2025titullmsfamilybanglallms, title={TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking}, author={Shahriar Kabir Nahin and Rabindra Nath Nandi and Sagor Sarker and Quazi Sarwar Muhtaseem and Md Kowsher and Apu Chandraw Shill and Md Ibrahim and Mehadi Hasan Menon and Tareq Al Muntasir and Firoj Alam}, year={2025}, eprint={2502.11187}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.11187}, } ```

提供机构：

munzurul

5,000+

优质数据集

54 个

任务类型

进入经典数据集