five

barrydeen/nspam-corpus

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/barrydeen/nspam-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - multilingual task_categories: - text-classification tags: - nostr - spam-classification - social-media pretty_name: nspam Nostr Spam Corpus size_categories: - 10K<n<100K configs: - config_name: authors data_files: - split: train path: authors.jsonl - config_name: notes data_files: - split: train path: notes.jsonl --- # nspam — Nostr spam corpus (v2.2) Labeled dataset for training an on-device spam classifier for Nostr `kind:1` notes. Labels are at the **pubkey level**: a human reviewer judged each account holistically, and every note from that account inherits the author's label. Released under **MIT**. ## Contents | file | rows | schema | |---|---|---| | `authors.jsonl` | 201 (123 real, 78 bot) | `pubkey`, `label`, `labeled_at` | | `notes.jsonl` | 16620 (11758 real, 4862 bot) | `id`, `pubkey`, `label`, `content`, `tags`, `created_at` | Labels are one of `real` (human user) or `bot` (automated spam / bot account). Accounts the labeler was unsure about were *skipped* and are not present in this release. ## How labels were assigned 1. A live feed of `kind:1` notes was streamed from public relays (`damus.io`, `nos.lol`, `primal.net`, `nostr.wine`, `relay.nostr.band`, `snort.social`). 2. A reviewer saw recent notes grouped by author and tagged each author as `real` / `bot` / `skip`. 3. All `kind:1` notes by labeled authors were fetched and associated with the author's label. This is subjective human judgement. Some accounts genuinely blend bot-like and real behavior and could reasonably be labeled either way. ## How to load ```python from datasets import load_dataset authors = load_dataset("barrydeen/nspam-corpus", "authors") notes = load_dataset("barrydeen/nspam-corpus", "notes") ``` Or raw: ```python import json with open("notes.jsonl") as f: notes = [json.loads(line) for line in f] ``` ## Intended use - Training spam / abuse / bot classifiers for Nostr and similar open social networks. - Research on text-based automated-account detection. - Benchmarking on-device text classification. ## Not for - Targeting individual accounts for harassment, doxing, or retaliation. - Making high-stakes moderation decisions without human review. - Claims about specific accounts being definitively "bots" — these are judgment calls at a single point in time. ## Ethical considerations - **Content is public.** Nostr events are broadcast to open relays. This dataset distributes a curated snapshot; anyone could have collected it. - **Labels are not.** The `bot` label is one reviewer's judgement and encodes a value assessment about a public identity. Mis-labeling is possible. - **Accounts evolve.** A pubkey labeled `bot` in v2.2 may be compromised, handed off, or change behavior later. Labels are not permanent truth. - **Pubkeys are public identifiers.** They appear in every note a user has posted. Including them here does not expose any new information. - If you believe a specific pubkey is mis-labeled, please open an issue on the dataset repo. ## Collection caveats - Single labeler. No inter-annotator-agreement measurements. - Relay biases: some relays index specific communities / regions more heavily. - Point-in-time: notes reflect what relays returned at collection time; events may have been deleted or unreplicated since. ## Citation ``` @misc{nspam, title = {nspam: Nostr spam corpus}, version = {v2.2}, year = {2026}, license = {MIT}, } ```
提供机构:
barrydeen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作