Replication Data for: Topic Classification for Political Texts with Pretrained Language Models
收藏DataONE2023-09-17 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:a6c02093052921e5d5be844368def1c25dbf8363f01328904453b4263fc3c4ea
下载链接
链接失效反馈官方服务:
资源简介:
Supervised topic classification requires labeled data. This often becomes a bottleneck as high-quality labeled data is expensive to acquire. To overcome the data scarcity problem, scholars have recently proposed to use cross-domain topic classification to take advantage of pre-existing labeled datasets. Cross-domain topic classification only requires limited annotation in the target domain to verify its cross-domain accuracy. In this letter, we propose supervised topic classification with pre-trained language models as an alternative. We show that language models finetuned with 70% of the small annotated dataset in the target corpus could outperform models trained using large cross-domain datasets by 27% and that models finetuned with 10% of the annotated dataset could already outperform the cross-domain classifiers. Our models are competitive in terms of training time and inference time. Researchers interested in supervised learning with limited labeled data should find our results useful. Our code and data are publicly available.
创建时间:
2023-11-08



