five

nyuuzyou/nntp-text-387m

收藏
Hugging Face2026-01-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nyuuzyou/nntp-text-387m
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language: - en - de - fr - it - nl - pl - ru - multilingual language_creators: - found license: other multilinguality: - multilingual pretty_name: NNTP Discussion Archives size_categories: - 100M<n<1B source_datasets: - original task_categories: - text-generation - text-classification - question-answering tags: - discussions - historical configs: - config_name: default data_files: - split: train path: "data/articles_*.parquet" dataset_info: features: - name: message_id dtype: string - name: newsgroups dtype: string - name: author dtype: string - name: subject dtype: string - name: date dtype: string - name: content dtype: string splits: - name: train --- # NNTP Discussion Archives A large-scale collection of text discussions from public NNTP (Network News Transfer Protocol) newsgroups spanning over two decades. ## Dataset Statistics | Metric | Value | |--------|-------| | Total messages | 386,629,949 | | Unique newsgroups | 159,345 | | Date range | 2002 - 2026 | | Total size | ~191 GB (compressed) | | File format | Parquet (ZSTD) | | Number of files | 256 | | Average content length | ~1,400 characters | ## Schema | Column | Type | Description | |--------|------|-------------| | `message_id` | `string` | Original message identifier (unchanged) | | `newsgroups` | `string` | Target newsgroup(s), comma-separated if cross-posted | | `author` | `string` | Message author with email addresses redacted as `[email]` | | `subject` | `string` | Subject line | | `date` | `string` | RFC 2822 formatted date string | | `content` | `string` | Message body with email addresses redacted as `[email]` | ## Top Newsgroups by Volume | Newsgroup | Messages | |-----------|----------| | alt.atheism | 5,658,023 | | free.usenet | 4,691,561 | | alt.fan.rush-limbaugh | 4,659,639 | | alt.politics | 3,919,772 | | fr.soc.politique | 3,554,434 | | it.sport.calcio.milan | 2,961,804 | | it.politica | 2,802,687 | | alt.politics.bush | 2,786,316 | | talk.politics.misc | 2,784,668 | | Other (159,336 groups) | 475,430,274 | *Cross-posted messages are counted once per newsgroup, so totals exceed the 386M unique messages.* ## Data Processing **Filtering:** Binary-focused groups (*.binaries.*, *.pictures.*, *.multimedia.*), binary posts with file-sharing indicators, messages exceeding 500KB, and unrecoverable encoding errors are excluded. Spam is almost **not filtered** - the dataset includes advertisements, phishing, and low-quality posts present in raw newsgroups. **Encoding:** Messages are normalized to UTF-8 with the following decoding pipeline: - Quoted-Printable: MIME-encoded content decoded to text - Base64: Text base64 content decoded; binary base64 excluded - Legacy encodings: Invalid UTF-8 sequences re-encoded using Windows-1252, ISO-8859-*, KOI8-R, Shift-JIS, GBK, and other legacy encoding detection - MIME encoded-word headers decoded to UTF-8 **Deduplication:** Exact content duplicates removed via xxHash64 hashing (first occurrence retained). **Privacy:** Email addresses in `author` and `content` fields redacted as `[email]`; `message_id` unchanged. ## Considerations - Messages were posted to public newsgroups - Content reflects unmoderated discussions and may contain controversial opinions
提供机构:
nyuuzyou
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作