sayurio/storymirror.com-web-scrape
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sayurio/storymirror.com-web-scrape
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- text-classification
- token-classification
language:
- en
- hi
- bn
- mr
- gu
- ta
- te
tags:
- literature
- stories
- poetry
- multilingual
- indian-languages
- web-scraped
- non-ai
pretty_name: StoryMirror Multilingual Literature Archive
size_categories:
- 10K<n<100K
---
# StoryMirror Multilingual Literature Archive
## Overview
This repository contains a large-scale text dataset scraped from [storymirror.com](https://storymirror.com/), a prominent digital platform for Indian literature. The primary goal of this archive is to preserve a massive, multilingual collection of purely human-written stories, poems, and quotes, creating a distinct record of human creativity and storytelling across various Indian languages.
## Purpose and Usage
This dataset is published publicly and strictly for **educational, research, linguistic analysis, and archival purposes**. It is an invaluable resource for Natural Language Processing (NLP) researchers, data scientists, and linguists looking to:
* Pre-train or fine-tune Large Language Models (LLMs) on multilingual creative writing, especially for underrepresented or low-resource Indian languages.
* Study code-switching, regional dialects, and narrative structures in modern Indian digital literature.
* Perform cross-lingual sentiment analysis and text classification.
## Dataset Details
* **Source:** storymirror.com
* **Collection Method:** Web scraping
* **Content Type:** Text (Short stories, micro-tales, poems, and creative writing authored by humans without the use of AI).
* **Repository:** `sayurio/storymirror.com-web-scrape`
## Copyright and Fair Use Disclaimer
This archive is created under the principles of **Fair Use** (under Section 107 of the Copyright Act) for purposes such as criticism, comment, teaching, scholarship, and research.
* **No Ownership Claimed:** The creator of this repository does not claim any ownership, authorship, or copyright over the original literary works. All rights, title, and interest in the original text remain entirely with their respective authors and StoryMirror Infotech Pvt. Ltd.
* **Non-Commercial:** This dataset is provided completely free of charge and is strictly not intended for commercial gain, monetization, or profit.
* **Transformative Use:** The data has been aggregated, extracted from its original web formatting, and compiled specifically for computational analysis, archiving, and educational study. This represents a transformative use of the original publicly available material.
**Takedown Requests:** If you are a copyright holder or an author whose work is included in this dataset and wish for it to be removed from this archive, please open an issue or contact the repository owner directly. Please submit a removal request specifying the exact story/poem titles, URLs, or text snippets you wish to have taken down so they can be accurately located within the dataset and removed.
## How to Use
You can load this dataset directly into your Python environment using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("sayurio/storymirror.com-web-scrape")
# View the structure of the first literary entry
print(dataset['train'][0])
```
提供机构:
sayurio



