trentmkelly/scored_co_2025
收藏Hugging Face2026-05-12 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/trentmkelly/scored_co_2025
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- text-generation
- text-classification
language:
- en
tags:
- reddit
- social-media
- scored.co
- web-scrape
- parquet
size_categories:
- 10M<n<100M
pretty_name: Scored.co 2025 scrape
---
# Scored.co 2025 scrape
This dataset is a full scrape of public content from [Scored.co](https://scored.co/), a right-wing Reddit-style social media site. It contains **73,045,361 rows** of posts and comments, stored as parquet.
The dataset is intended for research into online communities, political discussion, social media moderation, misinformation, platform migration, network dynamics, and large-scale text analysis.
## Data format
The main dataset is partitioned by entity type:
- `entity_type=post/`: post records
- `entity_type=comment/`: comment records
Hugging Face also provides converted parquet shards on the `refs/convert/parquet` branch, which can be useful for lightweight inspection or streaming workflows without downloading the full dataset.
Example:
```python
from datasets import load_dataset
# Streams records without downloading the full dataset first.
ds = load_dataset(
"trentmkelly/scored_co_2025",
split="train",
streaming=True,
)
for row in ds.take(3):
print(row)
```
You can also work directly with parquet URLs or the `hf://` filesystem interface if using tools such as DuckDB, Polars, PyArrow, or pandas.
## Columns
A sample shard from `refs/convert/parquet` had the following fields:
| Column | Description |
| --- | --- |
| `post_id` | Numeric Scored.co post ID. |
| `post_uuid` | Scored.co post UUID/string identifier. |
| `comment_id` | Numeric comment ID, when the row is a comment. |
| `comment_uuid` | Comment UUID/string identifier, when available. |
| `parent_id` | Parent post or comment identifier. |
| `author` | Public username associated with the record. |
| `community` | Scored.co community name. |
| `title` | Post title. For comments, this may repeat the parent post title. |
| `body` | Plain-text body/content extracted from the record. |
| `created_ms` | Creation timestamp in Unix milliseconds. |
| `created_iso` | Creation timestamp as an ISO-8601 UTC string. |
| `score` | Total score. |
| `score_up` | Upvote score/count, when available. |
| `score_down` | Downvote score/count, when available. |
| `comments_total` | Total comment count for post rows, when available. |
| `top_level_comments` | Top-level comment count for post rows, when available. |
| `depth` | Comment nesting depth, when available. |
| `is_deleted` | Whether the item was marked deleted. |
| `is_removed` | Whether the item was marked removed. |
| `is_nsfw` | Whether the item was marked NSFW. |
| `is_image` | Whether the item is an image post. |
| `is_video` | Whether the item is a video post. |
| `vote_state` | Vote-state value from the source record. |
| `pro_tier` | Source account/profile tier value. |
| `link` | Linked URL for link posts, when present. |
| `domain` | Domain extracted from `link`, when present. |
| `source_file` | Original scrape/source JSON filename. |
| `raw_json` | Raw source JSON payload retained as a string. |
Some fields are entity-specific and may be null for posts or comments.
## Notes and limitations
- This is scraped public web data and may include offensive, hateful, explicit, false, or otherwise sensitive content.
- Usernames and other public identifiers are included. Handle the data carefully and avoid attempts to identify, contact, harass, or profile individual users.
- Text may contain HTML-derived artifacts, links, markdown, quoted material, deleted/removed content markers, or source-platform metadata.
- The dataset reflects what was available to the scraper at collection time; it should not be treated as a complete or current representation of Scored.co.
## Citation
If you use this dataset, cite the Hugging Face dataset repository:
```bibtex
@dataset{kelly_scored_co_2025,
author = {Kelly, Trent M.},
title = {Scored.co 2025 scrape},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/trentmkelly/scored_co_2025}
}
```
提供机构:
trentmkelly



