nyuuzyou/nntp-text-387m
收藏Hugging Face2026-01-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nyuuzyou/nntp-text-387m
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language:
- en
- de
- fr
- it
- nl
- pl
- ru
- multilingual
language_creators:
- found
license: other
multilinguality:
- multilingual
pretty_name: NNTP Discussion Archives
size_categories:
- 100M<n<1B
source_datasets:
- original
task_categories:
- text-generation
- text-classification
- question-answering
tags:
- discussions
- historical
configs:
- config_name: default
data_files:
- split: train
path: "data/articles_*.parquet"
dataset_info:
features:
- name: message_id
dtype: string
- name: newsgroups
dtype: string
- name: author
dtype: string
- name: subject
dtype: string
- name: date
dtype: string
- name: content
dtype: string
splits:
- name: train
---
# NNTP Discussion Archives
A large-scale collection of text discussions from public NNTP (Network News Transfer Protocol) newsgroups spanning over two decades.
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total messages | 386,629,949 |
| Unique newsgroups | 159,345 |
| Date range | 2002 - 2026 |
| Total size | ~191 GB (compressed) |
| File format | Parquet (ZSTD) |
| Number of files | 256 |
| Average content length | ~1,400 characters |
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `message_id` | `string` | Original message identifier (unchanged) |
| `newsgroups` | `string` | Target newsgroup(s), comma-separated if cross-posted |
| `author` | `string` | Message author with email addresses redacted as `[email]` |
| `subject` | `string` | Subject line |
| `date` | `string` | RFC 2822 formatted date string |
| `content` | `string` | Message body with email addresses redacted as `[email]` |
## Top Newsgroups by Volume
| Newsgroup | Messages |
|-----------|----------|
| alt.atheism | 5,658,023 |
| free.usenet | 4,691,561 |
| alt.fan.rush-limbaugh | 4,659,639 |
| alt.politics | 3,919,772 |
| fr.soc.politique | 3,554,434 |
| it.sport.calcio.milan | 2,961,804 |
| it.politica | 2,802,687 |
| alt.politics.bush | 2,786,316 |
| talk.politics.misc | 2,784,668 |
| Other (159,336 groups) | 475,430,274 |
*Cross-posted messages are counted once per newsgroup, so totals exceed the 386M unique messages.*
## Data Processing
**Filtering:** Binary-focused groups (*.binaries.*, *.pictures.*, *.multimedia.*), binary posts with file-sharing indicators, messages exceeding 500KB, and unrecoverable encoding errors are excluded. Spam is almost **not filtered** - the dataset includes advertisements, phishing, and low-quality posts present in raw newsgroups.
**Encoding:** Messages are normalized to UTF-8 with the following decoding pipeline:
- Quoted-Printable: MIME-encoded content decoded to text
- Base64: Text base64 content decoded; binary base64 excluded
- Legacy encodings: Invalid UTF-8 sequences re-encoded using Windows-1252, ISO-8859-*, KOI8-R, Shift-JIS, GBK, and other legacy encoding detection
- MIME encoded-word headers decoded to UTF-8
**Deduplication:** Exact content duplicates removed via xxHash64 hashing (first occurrence retained).
**Privacy:** Email addresses in `author` and `content` fields redacted as `[email]`; `message_id` unchanged.
## Considerations
- Messages were posted to public newsgroups
- Content reflects unmoderated discussions and may contain controversial opinions
提供机构:
nyuuzyou



