five

OwnedByDanes/Usenet-Corpus-1980-2013

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/OwnedByDanes/Usenet-Corpus-1980-2013
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含1980年至2013年去重和清理的Usenet帖子,源自现存最大的私有Usenet语料库之一。它涵盖了数千个新闻组,几乎包括所有主要层级(如`talk.*`、`sci.*`、`comp.*`等),捕捉了社交媒体前互联网讨论的完整历程。数据集包含1031亿个令牌和408万条记录,覆盖100多种语言,其中英语占96.6%。数据集经过严格的清理和验证,适用于语言模型预训练、领域适应、语言学研究等用途。

This dataset contains deduplicated, sanitized Usenet posts from 1980 through 2013, sourced from one of the largest privately held Usenet corpora in existence. It covers thousands of newsgroups across virtually every major hierarchy (e.g., `talk.*`, `sci.*`, `comp.*`), capturing the full arc of pre-social-media internet discourse. The dataset includes 103.1B tokens and 408M records, spanning over 100 languages, with English comprising 96.6%. It has undergone rigorous cleaning and validation, making it suitable for pre-training large language models, domain adaptation, linguistic research, and more.
提供机构:
OwnedByDanes
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作