five

FICSIT: A large-scale cross-topic authorship attribution corpus

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/records/7478179
下载链接
链接失效反馈
官方服务:
资源简介:
Florida Institute for Cyber Security research Inter Topic (FICSIT) corpus is controlled precisely for cross-topic samples. The corpus was compiled from data dumps provided by StackExchange. The StackExchange network contains a large collection of different question-answer forums spanning 176 sites with over three million users. Out of all the topics available on the StackExchange network, cross-topic data was extracted for users contributing to two or more topics. This requirement was satisfied by 293,415 users, who were again constrained to at least 70 samples per user. Finally, a cross-topic corpus was obtained with 308 topics and 188,077 text samples for 1,237 distinct authors. No other pre-processing steps were performed on the collected data.

佛罗里达网络安全研究所跨主题语料库(Florida Institute for Cyber Security research Inter Topic, FICSIT)针对跨主题样本进行了精准管控。该语料库的数据源取自StackExchange提供的数据转储文件。StackExchange网络涵盖176个站点、超300万用户,包含大量各类问答论坛。我们从该网络的所有可用主题中,提取出活跃于两个及以上主题的用户所产生的跨主题数据,共有293415名用户满足该条件;随后进一步限定每位用户至少提供70条样本。最终得到的跨主题语料库包含308个主题、188077条文本样本,对应1237名独立作者。采集到的原始数据未经过任何其他预处理步骤。
创建时间:
2024-07-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作