five

trentmkelly/sharty-1-million

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/trentmkelly/sharty-1-million
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Sharty 1 Million language: - en license: other size_categories: - 1M<n<10M tags: - web-scrape - imageboard - forum - toxic-content --- # Sharty 1 Million `sharty-1-million` is a raw JSONL export of public posts scraped from `soyjak.st`. This snapshot contains: - 1,013,821 posts - 65,296 threads - 20 boards - 980,579 posts with extracted plaintext in `body_nomarkup` - 335,716 posts with attached-file metadata Export goes up to April 8th, 2026. Based on the raw Unix timestamps in the dataset, the usable post time range runs from September 21, 2020 through April 8, 2026 UTC. ## What Is Included The dataset is a post-level dump of the scraper's `posts` table. It includes: - Board and thread/post identifiers - Raw HTML post bodies - Plaintext post bodies when available - Thread flags such as `sticky`, `locked`, and `cyclical` - Reply/image counts when present in the source JSON - Attached-file metadata such as filename, extension, dimensions, and MD5 - A per-file `nsfw_score` field when present in the source JSON The dataset does not include image binaries or any extra normalization beyond what was already exposed by the source site. ## Files - `sharty.jsonl`: newline-delimited JSON export of all posts Each line is one post record. ## Schema Each record contains the following fields: | Field | Type | Description | | --- | --- | --- | | `board` | string | Board name, for example `soy` or `pol` | | `no` | integer | Post number | | `resto` | integer | Parent thread number; `0` means the post is the thread OP | | `name` | string or null | Display name | | `capcode` | string or null | Capcode if present | | `com` | string or null | Raw HTML comment body | | `body_nomarkup` | string or null | Plaintext body from the source JSON | | `time` | integer or null | Unix timestamp from the source site | | `sticky` | integer | Sticky flag | | `locked` | integer | Locked flag | | `cyclical` | integer | Cyclical thread flag | | `looped` | integer | Looped media flag | | `images_only` | integer | Images-only thread flag | | `desktop_only` | integer | Desktop-only thread flag | | `last_modified` | integer or null | Thread-level last-modified timestamp from the source site | | `replies` | integer or null | Reply count when present | | `images` | integer or null | Image count when present | | `omitted_posts` | integer or null | Omitted post count from board catalog responses | | `omitted_images` | integer or null | Omitted image count from board catalog responses | | `filename` | string or null | Original uploaded filename without extension | | `ext` | string or null | File extension | | `tim` | string or null | Site media identifier | | `fsize` | integer or null | File size in bytes | | `w` | integer or null | Media width | | `h` | integer or null | Media height | | `md5` | string or null | Base64-encoded MD5 from the source JSON | | `nsfw_score` | float or null | NSFW score supplied by the source site | | `filedeleted` | integer | File deleted flag | | `spoiler` | integer | Spoiler flag | ## Collection Method This dataset was generated with a simple two-step pipeline: 1. `scrape.py` fetches board catalogs from `https://soyjak.st/{board}/threads.json` and thread JSON from `https://soyjak.st/{board}/thread/{thread_no}.json`, then upserts posts into a SQLite database. 2. `dump.py` exports the `posts` table to `sharty.jsonl`, ordered by `board, no`. The scraper targets these 20 boards: `soy`, `qa`, `raid`, `r`, `craft`, `int`, `pol`, `a`, `an`, `asp`, `biz`, `mtv`, `r9k`, `tech`, `v`, `sude`, `x`, `q`, `news`, `chive` Board distribution in this snapshot: | Board | Posts | | --- | ---: | | `soy` | 473,682 | | `pol` | 259,112 | | `chive` | 137,694 | | `qa` | 31,001 | | `raid` | 18,645 | | `mtv` | 15,633 | | `int` | 13,594 | | `a` | 12,399 | | `v` | 9,236 | | `r9k` | 7,093 | | `news` | 6,030 | | `tech` | 6,010 | | `q` | 3,970 | | `craft` | 3,593 | | `x` | 3,193 | | `sude` | 3,157 | | `r` | 3,144 | | `asp` | 2,492 | | `an` | 2,146 | | `biz` | 1,997 | ## Usage ```python from datasets import load_dataset dataset = load_dataset("trentmkelly/sharty-1-million", split="train") print(dataset[0]) ``` If you only need text, prefer `body_nomarkup` over `com`. The `com` field contains raw HTML fragments. ## Caveats - This is a raw scrape, not a cleaned or deduplicated corpus. - Content is frequently obscene, hateful, sexually explicit, and otherwise unsafe for general-purpose use. - The dataset may contain slurs, harassment, extremist content, and personal information posted by users. - Deleted posts or files that were gone before scraping are not recoverable here. - Timestamps and counters are taken directly from the source JSON and may contain missing or anomalous values. - `last_modified` is thread metadata from the site, not a verified post edit timestamp. ## Recommended Use This dataset is most appropriate for: - Research on imageboards, anonymous forums, or internet subcultures - Toxicity, moderation, or robustness experiments with strong safety controls - Historical archiving of public board metadata and post text ## Source Original source: `https://soyjak.st` This repository redistributes a structured export of publicly accessible site data. Check the source site's policies and your local legal requirements before reusing or republishing the data.
提供机构:
trentmkelly
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作