Blog Authorship Corpus

arXiv2025-09-30 收录

下载链接：

https://u.cs.biu.ac.il/~koppel/blogcorpus.htm

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了19,320位博主撰写的帖子，共计681,288篇，字数超过1亿4千万，用于根据博主的文字来预测其年龄和性别。数据集按年龄组（10多岁、20多岁、30多岁）划分，男女博主分布均衡。每篇博文都由一个唯一的博主ID标识，且数据集中至少包含200个常见英语单词的实例。规模上，数据集涵盖了来自19,320位博主的681,288篇帖子，任务旨在进行年龄与性别的预测。

This dataset comprises blog posts written by 19,320 bloggers, with a total of 681,288 entries and a combined word count exceeding 140 million. It is designed for age and gender prediction based on the bloggers' written content. The dataset is categorized by age groups (teens, 20s, 30s), with a balanced distribution of male and female bloggers. Each blog post is identified by a unique blogger ID, and the dataset includes instances of at least 200 common English words. Overall, the dataset covers 681,288 posts from 19,320 bloggers, with the core task focusing on age and gender prediction.

搜集汇总

背景与挑战

背景概述

Blog Authorship Corpus是一个大规模文本数据集，包含19,320位博主的681,288篇帖子，总字数超过1.4亿，专门用于基于文本预测博主的年龄和性别。数据集按年龄组（10多岁、20多岁、30多岁）划分，男女分布均衡，每篇博文有唯一ID，且确保至少包含200个常见英语单词，适合进行年龄与性别的机器学习预测任务。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集