five

Blog-1K

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7455622
下载链接
链接失效反馈
官方服务:
资源简介:
The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'. 1. Preprocessing We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria: - accumulatively at least 10,000 characters, - accumulatively at most 49,410 characters, - accumulatively at least 16 posts, - accumulatively at most 40 posts, and  - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose). Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus. 2. Statistics Its creation and statistics can be found in the Jupyter Notebook. Split # Authors # Posts # Characters Avg. Characters Per Author (Std.) Avg. Characters Per Post (Std.) Train 1,000 16,132 30,092,057 30,092 (5,884) 1,865 (1,007) Validation 935 2,017 3,755,362 4,016 (2,269) 1,862 (999) Test 924 2,017 3,732,448 4,039 (2,188) 1,850 (936) 3. Usage import pandas as pd df = pd.read_csv('blog1000.csv.gz', compression='infer') # read in training data train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))   4. License All the materials is licensed under the ISC License. 5. Contact Please contact its maintainer for questions.
创建时间:
2022-12-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作