hishab/titulm-bangla-corpus
收藏Hugging Face2025-06-20 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/hishab/titulm-bangla-corpus
下载链接
链接失效反馈官方服务:
资源简介:
TituLM孟加拉语语料库是为预训练、持续预训练或微调大型语言模型以提高孟加拉语文本生成能力而准备的最大的孟加拉语清洁语料库之一。该数据集包含来自不同来源和类别的多样化孟加拉语文本。最大部分的数据集包含经过过滤的公共爬取数据集。为了解决现有公共爬取数据集中的孟加拉语文本提取问题,使用了Trafilatura工具进行文本提取,并通过不同的质量信号阈值进行了过滤。此外,还准备了经过微调的NLLB模型来将英文文本翻译为孟加拉语和将孟加拉语文本罗马化。
TituLM Bangla Corpus is one of the largest Bangla clean corpora prepared for pretraining, continual pretraining, or fine-tuning Large Language Models (LLM) for improving Bangla text generation capability. The dataset includes diverse sources and categories of Bangla text, with the largest part being filtered common crawl datasets. Trafilatura tool is used for text extraction, and the dataset is filtered using various quality signals. Additionally, a fine-tuned NLLB model is prepared for translating English text to Bangla and romanizing Bangla text.
提供机构:
hishab



