Pretrained Models and Reproducibility Archive for "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification"
收藏DataCite Commons2026-03-27 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/8nbmbdtwpn/1
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains the pretrained model weights, vectorizer pipelines, and ablation study checkpoints for the paper "From Characters to Syntax: Characterizing the Accuracy–Robustness Trade-off in Cross-Domain Authorship Verification".
The provided models were trained and evaluated across three distinct text domains (fanfiction, personal blogs, and corporate emails) to investigate the fundamental vulnerability of stylometric systems against semantic-preserving paraphrase attacks.
Contents of this archive:
Robust Siamese: The best-performing cross-domain model utilizing character n-gram features (includes
.pth
weights,
vectorizer.pkl
, and
scaler.pkl
).
Cross-Domain (CD) Siamese: The baseline generalist character 4-gram model trained across all domains.
Robust DANN (Domain-Adversarial Neural Network): The multi-view syntactic feature model trained for high adversarial robustness.
BERT Baseline: The contextual baseline model used for comparative evaluation.
Syntactic Ablation Models: Pretrained checkpoints isolating Part-of-Speech (POS) trigrams, function word frequencies, and readability metrics to demonstrate the specific drivers of robustness in stylometric features.
Usage: These weights are intended to be used directly with the PyTorch and Scikit-Learn inference pipelines provided in the official GitHub repository. Researchers can utilize this archive to perfectly reproduce the cross-domain accuracy (up to 86.2%) and attack success rate evaluations presented in the manuscript.
提供机构:
Mendeley Data
创建时间:
2026-03-27



