five

Pretrained Models and Reproducibility Archive for "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification"

收藏
DataCite Commons2026-03-27 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/8nbmbdtwpn/1
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains the pretrained model weights, vectorizer pipelines, and ablation study checkpoints for the paper "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification". The provided models were trained and evaluated across three distinct text domains (fanfiction, personal blogs, and corporate emails) to investigate the fundamental vulnerability of stylometric systems against semantic-preserving paraphrase attacks. Contents of this archive: Robust Siamese: The best-performing cross-domain model utilizing character n-gram features (includes .pth weights, vectorizer.pkl , and scaler.pkl ). Cross-Domain (CD) Siamese: The baseline generalist character 4-gram model trained across all domains. Robust DANN (Domain-Adversarial Neural Network): The multi-view syntactic feature model trained for high adversarial robustness. BERT Baseline: The contextual baseline model used for comparative evaluation. Syntactic Ablation Models: Pretrained checkpoints isolating Part-of-Speech (POS) trigrams, function word frequencies, and readability metrics to demonstrate the specific drivers of robustness in stylometric features. Usage: These weights are intended to be used directly with the PyTorch and Scikit-Learn inference pipelines provided in the official GitHub repository. Researchers can utilize this archive to perfectly reproduce the cross-domain accuracy (up to 86.2%) and attack success rate evaluations presented in the manuscript.
提供机构:
Mendeley Data
创建时间:
2026-03-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作