sayurio/bangla-wikipedia

Name: sayurio/bangla-wikipedia
Creator: sayurio
Published: 2026-04-08 06:25:49
License: 暂无描述

Hugging Face2026-04-08 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sayurio/bangla-wikipedia

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - bn license: cc-by-sa-4.0 task_categories: - text-generation - fill-mask - text-classification - feature-extraction tags: - wikipedia - bengali - bangla - jsonl - nlp - text-mining pretty_name: Bangla Wikipedia Dataset size_categories: - 100K<n<1M --- # Bangla (Bengali) Wikipedia Articles Dataset [Request More Scrapes](https://docs.google.com/forms/d/e/1FAIpQLSdQhqM-YE-1KvLh8E2CKknVBySZh6c58p5SfgfjdSpDUnDdtg/viewform?usp=publish-editor) [Order Private Scrapes](https://discord.gg/eZ92ZVcDyC) ## Current Progress: Approx 20% ## Dataset Summary This dataset contains a comprehensive extraction of articles from the Bangla (Bengali) Wikipedia. It is designed for Natural Language Processing (NLP) tasks, linguistic research, and training Large Language Models (LLMs) to better understand and generate the Bengali language. ## Copyright and Fair Use I do not own the copyrights to any of the Wikipedia articles or texts included in this repository. All materials are extracted and uploaded under the principles of fair use, and this dataset is strictly intended for educational and research purposes only (such as machine learning, data analysis, and academic research). ## Use Cases This dataset is highly versatile and can be used for: * **Language Modeling:** Pre-training or fine-tuning foundation models for Bengali text generation. * **Masked Language Modeling:** Training models to understand Bengali context and grammar. * **Text Classification & NLP Research:** Topic modeling, semantic analysis, and linguistic studies. ## Data Structure This dataset is formatted using **JSON Lines (.jsonl)**. Each line in the dataset files is a valid, standalone JSON object representing a single Wikipedia article. This format is highly optimized for machine learning tasks because it allows you to stream large amounts of text data line-by-line without needing to load the entire dataset into your system's memory at once. *(Depending on the exact extraction, typical fields inside each JSON object include the article ID, URL, title, and the main text body.)* ## How to Download You have a few straightforward ways to access this data: * **Hugging Face Datasets Library:** You can load the dataset directly into your machine learning environment. The library natively understands `.jsonl` files and will parse them for you automatically. * **Hugging Face CLI:** You can use the command line to download specific `.jsonl` files or the entire repository to your local machine. * **Direct Download:** You can navigate to the "Files and versions" tab on this repository's web page and manually download the `.jsonl` files directly through your browser. ## Licensing and Attribution All text content in this dataset is sourced from Wikipedia and is licensed under the **Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)** and the **GNU Free Documentation License (GFDL)**. If you use or redistribute this dataset, you must provide appropriate credit to the Wikimedia Foundation and the original Wikipedia contributors, and you must distribute your contributions under the same license. * Learn more about Wikipedia's Copyright guidelines: [Wikipedia:Copyrights](https://en.wikipedia.org/wiki/Wikipedia:Copyrights)

提供机构：

sayurio

5,000+

优质数据集

54 个

任务类型

进入经典数据集