five

trentmkelly/scored_co_2025

收藏
Hugging Face2026-05-12 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/trentmkelly/scored_co_2025
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - text-generation - text-classification language: - en tags: - reddit - social-media - scored.co - web-scrape - parquet size_categories: - 10M<n<100M pretty_name: Scored.co 2025 scrape --- # Scored.co 2025 scrape This dataset is a full scrape of public content from [Scored.co](https://scored.co/), a right-wing Reddit-style social media site. It contains **73,045,361 rows** of posts and comments, stored as parquet. The dataset is intended for research into online communities, political discussion, social media moderation, misinformation, platform migration, network dynamics, and large-scale text analysis. ## Data format The main dataset is partitioned by entity type: - `entity_type=post/`: post records - `entity_type=comment/`: comment records Hugging Face also provides converted parquet shards on the `refs/convert/parquet` branch, which can be useful for lightweight inspection or streaming workflows without downloading the full dataset. Example: ```python from datasets import load_dataset # Streams records without downloading the full dataset first. ds = load_dataset( "trentmkelly/scored_co_2025", split="train", streaming=True, ) for row in ds.take(3): print(row) ``` You can also work directly with parquet URLs or the `hf://` filesystem interface if using tools such as DuckDB, Polars, PyArrow, or pandas. ## Columns A sample shard from `refs/convert/parquet` had the following fields: | Column | Description | | --- | --- | | `post_id` | Numeric Scored.co post ID. | | `post_uuid` | Scored.co post UUID/string identifier. | | `comment_id` | Numeric comment ID, when the row is a comment. | | `comment_uuid` | Comment UUID/string identifier, when available. | | `parent_id` | Parent post or comment identifier. | | `author` | Public username associated with the record. | | `community` | Scored.co community name. | | `title` | Post title. For comments, this may repeat the parent post title. | | `body` | Plain-text body/content extracted from the record. | | `created_ms` | Creation timestamp in Unix milliseconds. | | `created_iso` | Creation timestamp as an ISO-8601 UTC string. | | `score` | Total score. | | `score_up` | Upvote score/count, when available. | | `score_down` | Downvote score/count, when available. | | `comments_total` | Total comment count for post rows, when available. | | `top_level_comments` | Top-level comment count for post rows, when available. | | `depth` | Comment nesting depth, when available. | | `is_deleted` | Whether the item was marked deleted. | | `is_removed` | Whether the item was marked removed. | | `is_nsfw` | Whether the item was marked NSFW. | | `is_image` | Whether the item is an image post. | | `is_video` | Whether the item is a video post. | | `vote_state` | Vote-state value from the source record. | | `pro_tier` | Source account/profile tier value. | | `link` | Linked URL for link posts, when present. | | `domain` | Domain extracted from `link`, when present. | | `source_file` | Original scrape/source JSON filename. | | `raw_json` | Raw source JSON payload retained as a string. | Some fields are entity-specific and may be null for posts or comments. ## Notes and limitations - This is scraped public web data and may include offensive, hateful, explicit, false, or otherwise sensitive content. - Usernames and other public identifiers are included. Handle the data carefully and avoid attempts to identify, contact, harass, or profile individual users. - Text may contain HTML-derived artifacts, links, markdown, quoted material, deleted/removed content markers, or source-platform metadata. - The dataset reflects what was available to the scraper at collection time; it should not be treated as a complete or current representation of Scored.co. ## Citation If you use this dataset, cite the Hugging Face dataset repository: ```bibtex @dataset{kelly_scored_co_2025, author = {Kelly, Trent M.}, title = {Scored.co 2025 scrape}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/trentmkelly/scored_co_2025} } ```
提供机构:
trentmkelly
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作