five

Replication Data for: Topic Classification for Political Texts with Pretrained Language Models

收藏
DataONE2023-09-17 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:a6c02093052921e5d5be844368def1c25dbf8363f01328904453b4263fc3c4ea
下载链接
链接失效反馈
官方服务:
资源简介:
Supervised topic classification requires labeled data. This often becomes a bottleneck as high-quality labeled data is expensive to acquire. To overcome the data scarcity problem, scholars have recently proposed to use cross-domain topic classification to take advantage of pre-existing labeled datasets. Cross-domain topic classification only requires limited annotation in the target domain to verify its cross-domain accuracy. In this letter, we propose supervised topic classification with pre-trained language models as an alternative. We show that language models finetuned with 70% of the small annotated dataset in the target corpus could outperform models trained using large cross-domain datasets by 27% and that models finetuned with 10% of the annotated dataset could already outperform the cross-domain classifiers. Our models are competitive in terms of training time and inference time. Researchers interested in supervised learning with limited labeled data should find our results useful. Our code and data are publicly available.
创建时间:
2023-11-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作