sayurio/storymirror.com-web-scrape

Name: sayurio/storymirror.com-web-scrape
Creator: sayurio
Published: 2026-03-20 21:54:44
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sayurio/storymirror.com-web-scrape

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - text-classification - token-classification language: - en - hi - bn - mr - gu - ta - te tags: - literature - stories - poetry - multilingual - indian-languages - web-scraped - non-ai pretty_name: StoryMirror Multilingual Literature Archive size_categories: - 10K<n<100K --- # StoryMirror Multilingual Literature Archive ## Overview This repository contains a large-scale text dataset scraped from [storymirror.com](https://storymirror.com/), a prominent digital platform for Indian literature. The primary goal of this archive is to preserve a massive, multilingual collection of purely human-written stories, poems, and quotes, creating a distinct record of human creativity and storytelling across various Indian languages. ## Purpose and Usage This dataset is published publicly and strictly for **educational, research, linguistic analysis, and archival purposes**. It is an invaluable resource for Natural Language Processing (NLP) researchers, data scientists, and linguists looking to: * Pre-train or fine-tune Large Language Models (LLMs) on multilingual creative writing, especially for underrepresented or low-resource Indian languages. * Study code-switching, regional dialects, and narrative structures in modern Indian digital literature. * Perform cross-lingual sentiment analysis and text classification. ## Dataset Details * **Source:** storymirror.com * **Collection Method:** Web scraping * **Content Type:** Text (Short stories, micro-tales, poems, and creative writing authored by humans without the use of AI). * **Repository:** `sayurio/storymirror.com-web-scrape` ## Copyright and Fair Use Disclaimer This archive is created under the principles of **Fair Use** (under Section 107 of the Copyright Act) for purposes such as criticism, comment, teaching, scholarship, and research. * **No Ownership Claimed:** The creator of this repository does not claim any ownership, authorship, or copyright over the original literary works. All rights, title, and interest in the original text remain entirely with their respective authors and StoryMirror Infotech Pvt. Ltd. * **Non-Commercial:** This dataset is provided completely free of charge and is strictly not intended for commercial gain, monetization, or profit. * **Transformative Use:** The data has been aggregated, extracted from its original web formatting, and compiled specifically for computational analysis, archiving, and educational study. This represents a transformative use of the original publicly available material. **Takedown Requests:** If you are a copyright holder or an author whose work is included in this dataset and wish for it to be removed from this archive, please open an issue or contact the repository owner directly. Please submit a removal request specifying the exact story/poem titles, URLs, or text snippets you wish to have taken down so they can be accurately located within the dataset and removed. ## How to Use You can load this dataset directly into your Python environment using the Hugging Face `datasets` library: ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("sayurio/storymirror.com-web-scrape") # View the structure of the first literary entry print(dataset['train'][0]) ```

提供机构：

sayurio

5,000+

优质数据集

54 个

任务类型

进入经典数据集