sayurio/pratilipi-bengali-webscrape

Name: sayurio/pratilipi-bengali-webscrape
Creator: sayurio
Published: 2026-03-20 20:52:57
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sayurio/pratilipi-bengali-webscrape

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - text-classification - token-classification language: - bn tags: - bengali - literature - stories - fiction - web-scraped - non-ai pretty_name: Pratilipi Bengali Literature Archive size_categories: - 1K<n<10K --- # Pratilipi Bengali Literature Archive ## Overview This repository contains a large-scale text dataset scraped from [bengali.pratilipi.com](https://bengali.pratilipi.com/), a leading storytelling and self-publishing platform for Bengali literature. The primary goal of this archive is to preserve a vast collection of purely human-written Bengali fiction, serials, poems, and essays, creating a distinct record of human creativity and storytelling. ## Purpose and Usage This dataset is published publicly and strictly for **educational, research, linguistic analysis, and archival purposes**. It is an invaluable resource for Natural Language Processing (NLP) researchers, data scientists, and linguists working on: * Pre-training or fine-tuning Bengali Large Language Models (LLMs) on creative and conversational text. * Sentiment analysis and narrative structure modeling. * Studying modern Bengali literary trends, vocabulary, and dialect variations. ## Dataset Details * **Source:** bengali.pratilipi.com * **Collection Method:** Web scraping * **Content Type:** Text (Bengali stories, novels, and literature written by human authors without the use of AI). * **Repository:** `sayurio/pratilipi-bengali-webscrape` ## Copyright and Fair Use Disclaimer This archive is created under the principles of **Fair Use** (under Section 107 of the Copyright Act) for purposes such as criticism, comment, teaching, scholarship, and research. * **No Ownership Claimed:** The creator of this repository does not claim any ownership, authorship, or copyright over the original stories or literary works. All rights, title, and interest in the original text remain entirely with their respective authors and Nasadiya Technologies Pvt. Ltd. (Pratilipi). * **Non-Commercial:** This dataset is provided completely free of charge and is strictly not intended for commercial gain, monetization, or profit. * **Transformative Use:** The data has been aggregated, extracted from its original web formatting, and compiled specifically for computational analysis, archiving, and educational study. This represents a transformative use of the original publicly available material. **Takedown Requests:** If you are a copyright holder or an author whose work is included in this dataset and wish for it to be removed from this archive, please open an issue or contact the repository owner directly. Please submit a removal request specifying the exact story titles, URLs, or text snippets you wish to have taken down so they can be accurately located within the dataset and removed. ## How to Use You can load this dataset directly into your Python environment using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("sayurio/pratilipi-bengali-webscrape") # View the structure of the first literary entry print(dataset['train'][0]) ``` ***

提供机构：

sayurio

5,000+

优质数据集

54 个

任务类型

进入经典数据集