five

sayurio/somewhereinblog-article

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sayurio/somewhereinblog-article
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-generation - text-classification language: - bn tags: - bengali - blog - community-posts - articles - web-scraped - non-ai pretty_name: Somewhereinblog Article Archive size_categories: - 10K<n<100K --- # Somewhereinblog Article Archive ## Overview This repository contains a large-scale text dataset scraped from [m.somewhereinblog.net](https://m.somewhereinblog.net/), the largest and first-ever Bengali community blogging platform. The primary goal of this archive is to preserve a massive collection of purely human-written blog posts, personal stories, socio-political opinions, and community discussions, creating a distinct record of human-authored text separate from AI-generated content. ## Purpose and Usage This dataset is published publicly and strictly for **educational, research, linguistic analysis, and archival purposes**. It serves as an invaluable resource for Natural Language Processing (NLP) researchers, data scientists, and linguists working on: * Pre-training or fine-tuning Bengali Large Language Models (LLMs) on informal, conversational, and opinionated text. * Sentiment analysis, stance detection, and topic modeling on user-generated community discussions. * Studying the evolution of Bengali internet slang, regional dialects, and digital sociolinguistics. ## Dataset Details * **Source:** m.somewhereinblog.net * **Collection Method:** Web scraping * **Content Type:** Text (User-generated Bengali blog posts, articles, and community entries written by humans without the use of AI). * **Repository:** `sayurio/somewhereinblog-article` ## Copyright and Fair Use Disclaimer This archive is created under the principles of **Fair Use** (under Section 107 of the Copyright Act) for purposes such as criticism, comment, teaching, scholarship, and research. * **No Ownership Claimed:** The creator of this repository does not claim any ownership, authorship, or copyright over the original blog posts, articles, or comments. All rights, title, and interest in the original text remain entirely with their respective bloggers, authors, and Somewhere In Net Ltd. * **Non-Commercial:** This dataset is provided completely free of charge and is strictly not intended for commercial gain, monetization, or profit. * **Transformative Use:** The data has been aggregated, extracted from its original web formatting, and compiled specifically for computational analysis, archiving, and educational study. This represents a transformative use of the original publicly available community material. **Takedown Requests:** If you are a copyright holder, a blogger whose work is included in this dataset, or a representative of the platform, and wish for specific content to be removed from this archive, please open an issue or contact the repository owner directly. Please submit a removal request specifying the exact blog titles, URLs, or text snippets you wish to have taken down so they can be accurately located within the dataset and removed.
提供机构:
sayurio
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作