barrydeen/nspam-corpus
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/barrydeen/nspam-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- multilingual
task_categories:
- text-classification
tags:
- nostr
- spam-classification
- social-media
pretty_name: nspam Nostr Spam Corpus
size_categories:
- 10K<n<100K
configs:
- config_name: authors
data_files:
- split: train
path: authors.jsonl
- config_name: notes
data_files:
- split: train
path: notes.jsonl
---
# nspam — Nostr spam corpus (v2.2)
Labeled dataset for training an on-device spam classifier for Nostr `kind:1`
notes. Labels are at the **pubkey level**: a human reviewer judged each account
holistically, and every note from that account inherits the author's label.
Released under **MIT**.
## Contents
| file | rows | schema |
|---|---|---|
| `authors.jsonl` | 201 (123 real, 78 bot) | `pubkey`, `label`, `labeled_at` |
| `notes.jsonl` | 16620 (11758 real, 4862 bot) | `id`, `pubkey`, `label`, `content`, `tags`, `created_at` |
Labels are one of `real` (human user) or `bot` (automated spam / bot account).
Accounts the labeler was unsure about were *skipped* and are not present in this
release.
## How labels were assigned
1. A live feed of `kind:1` notes was streamed from public relays
(`damus.io`, `nos.lol`, `primal.net`, `nostr.wine`, `relay.nostr.band`,
`snort.social`).
2. A reviewer saw recent notes grouped by author and tagged each author as
`real` / `bot` / `skip`.
3. All `kind:1` notes by labeled authors were fetched and associated with the
author's label.
This is subjective human judgement. Some accounts genuinely blend bot-like and
real behavior and could reasonably be labeled either way.
## How to load
```python
from datasets import load_dataset
authors = load_dataset("barrydeen/nspam-corpus", "authors")
notes = load_dataset("barrydeen/nspam-corpus", "notes")
```
Or raw:
```python
import json
with open("notes.jsonl") as f:
notes = [json.loads(line) for line in f]
```
## Intended use
- Training spam / abuse / bot classifiers for Nostr and similar open social
networks.
- Research on text-based automated-account detection.
- Benchmarking on-device text classification.
## Not for
- Targeting individual accounts for harassment, doxing, or retaliation.
- Making high-stakes moderation decisions without human review.
- Claims about specific accounts being definitively "bots" — these are
judgment calls at a single point in time.
## Ethical considerations
- **Content is public.** Nostr events are broadcast to open relays. This
dataset distributes a curated snapshot; anyone could have collected it.
- **Labels are not.** The `bot` label is one reviewer's judgement and encodes
a value assessment about a public identity. Mis-labeling is possible.
- **Accounts evolve.** A pubkey labeled `bot` in v2.2 may be compromised,
handed off, or change behavior later. Labels are not permanent truth.
- **Pubkeys are public identifiers.** They appear in every note a user has
posted. Including them here does not expose any new information.
- If you believe a specific pubkey is mis-labeled, please open an issue on the
dataset repo.
## Collection caveats
- Single labeler. No inter-annotator-agreement measurements.
- Relay biases: some relays index specific communities / regions more heavily.
- Point-in-time: notes reflect what relays returned at collection time; events
may have been deleted or unreplicated since.
## Citation
```
@misc{nspam,
title = {nspam: Nostr spam corpus},
version = {v2.2},
year = {2026},
license = {MIT},
}
```
提供机构:
barrydeen



