sayurio/pratilipi-bengali-webscrape
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sayurio/pratilipi-bengali-webscrape
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- text-classification
- token-classification
language:
- bn
tags:
- bengali
- literature
- stories
- fiction
- web-scraped
- non-ai
pretty_name: Pratilipi Bengali Literature Archive
size_categories:
- 1K<n<10K
---
# Pratilipi Bengali Literature Archive
## Overview
This repository contains a large-scale text dataset scraped from [bengali.pratilipi.com](https://bengali.pratilipi.com/), a leading storytelling and self-publishing platform for Bengali literature. The primary goal of this archive is to preserve a vast collection of purely human-written Bengali fiction, serials, poems, and essays, creating a distinct record of human creativity and storytelling.
## Purpose and Usage
This dataset is published publicly and strictly for **educational, research, linguistic analysis, and archival purposes**. It is an invaluable resource for Natural Language Processing (NLP) researchers, data scientists, and linguists working on:
* Pre-training or fine-tuning Bengali Large Language Models (LLMs) on creative and conversational text.
* Sentiment analysis and narrative structure modeling.
* Studying modern Bengali literary trends, vocabulary, and dialect variations.
## Dataset Details
* **Source:** bengali.pratilipi.com
* **Collection Method:** Web scraping
* **Content Type:** Text (Bengali stories, novels, and literature written by human authors without the use of AI).
* **Repository:** `sayurio/pratilipi-bengali-webscrape`
## Copyright and Fair Use Disclaimer
This archive is created under the principles of **Fair Use** (under Section 107 of the Copyright Act) for purposes such as criticism, comment, teaching, scholarship, and research.
* **No Ownership Claimed:** The creator of this repository does not claim any ownership, authorship, or copyright over the original stories or literary works. All rights, title, and interest in the original text remain entirely with their respective authors and Nasadiya Technologies Pvt. Ltd. (Pratilipi).
* **Non-Commercial:** This dataset is provided completely free of charge and is strictly not intended for commercial gain, monetization, or profit.
* **Transformative Use:** The data has been aggregated, extracted from its original web formatting, and compiled specifically for computational analysis, archiving, and educational study. This represents a transformative use of the original publicly available material.
**Takedown Requests:** If you are a copyright holder or an author whose work is included in this dataset and wish for it to be removed from this archive, please open an issue or contact the repository owner directly. Please submit a removal request specifying the exact story titles, URLs, or text snippets you wish to have taken down so they can be accurately located within the dataset and removed.
## How to Use
You can load this dataset directly into your Python environment using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("sayurio/pratilipi-bengali-webscrape")
# View the structure of the first literary entry
print(dataset['train'][0])
```
***
提供机构:
sayurio



