sayurio/bangla-wikipedia
收藏Hugging Face2026-04-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sayurio/bangla-wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- bn
license: cc-by-sa-4.0
task_categories:
- text-generation
- fill-mask
- text-classification
- feature-extraction
tags:
- wikipedia
- bengali
- bangla
- jsonl
- nlp
- text-mining
pretty_name: Bangla Wikipedia Dataset
size_categories:
- 100K<n<1M
---
# Bangla (Bengali) Wikipedia Articles Dataset
[Request More Scrapes](https://docs.google.com/forms/d/e/1FAIpQLSdQhqM-YE-1KvLh8E2CKknVBySZh6c58p5SfgfjdSpDUnDdtg/viewform?usp=publish-editor)
[Order Private Scrapes](https://discord.gg/eZ92ZVcDyC)
## Current Progress: Approx 20%
## Dataset Summary
This dataset contains a comprehensive extraction of articles from the Bangla (Bengali) Wikipedia. It is designed for Natural Language Processing (NLP) tasks, linguistic research, and training Large Language Models (LLMs) to better understand and generate the Bengali language.
## Copyright and Fair Use
I do not own the copyrights to any of the Wikipedia articles or texts included in this repository. All materials are extracted and uploaded under the principles of fair use, and this dataset is strictly intended for educational and research purposes only (such as machine learning, data analysis, and academic research).
## Use Cases
This dataset is highly versatile and can be used for:
* **Language Modeling:** Pre-training or fine-tuning foundation models for Bengali text generation.
* **Masked Language Modeling:** Training models to understand Bengali context and grammar.
* **Text Classification & NLP Research:** Topic modeling, semantic analysis, and linguistic studies.
## Data Structure
This dataset is formatted using **JSON Lines (.jsonl)**.
Each line in the dataset files is a valid, standalone JSON object representing a single Wikipedia article. This format is highly optimized for machine learning tasks because it allows you to stream large amounts of text data line-by-line without needing to load the entire dataset into your system's memory at once.
*(Depending on the exact extraction, typical fields inside each JSON object include the article ID, URL, title, and the main text body.)*
## How to Download
You have a few straightforward ways to access this data:
* **Hugging Face Datasets Library:** You can load the dataset directly into your machine learning environment. The library natively understands `.jsonl` files and will parse them for you automatically.
* **Hugging Face CLI:** You can use the command line to download specific `.jsonl` files or the entire repository to your local machine.
* **Direct Download:** You can navigate to the "Files and versions" tab on this repository's web page and manually download the `.jsonl` files directly through your browser.
## Licensing and Attribution
All text content in this dataset is sourced from Wikipedia and is licensed under the **Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)** and the **GNU Free Documentation License (GFDL)**.
If you use or redistribute this dataset, you must provide appropriate credit to the Wikimedia Foundation and the original Wikipedia contributors, and you must distribute your contributions under the same license.
* Learn more about Wikipedia's Copyright guidelines: [Wikipedia:Copyrights](https://en.wikipedia.org/wiki/Wikipedia:Copyrights)
提供机构:
sayurio



