Credibility corpus fine-tuned ELMo contextual language model for early rumor detection on social media
收藏Figshare2020-01-14 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Credibility_corpus_fine-tuned_ELMo_contextual_language_model_for_early_rumor_detection_on_social_media/11591775
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains rumor task specific contextual neural language model that are fine-tuned on large credibility-focused social media dataset. The model file contains fine-tuned and fixed bidirectional Language Model (biLM) weights that can be used to compute the sentence representation of candidate rumor tweets. The purpose of this release is for research only and for reproducing our results in the paper.Contextual language model like ELMo provides deep, contextualised, and character based word representations by using bidirectional language models. Previous research shows that fine-tuning Neural Language Models (NLMs) with domain-specific data allows them to learn more meaningful word representations and provides a performance gain.In our research, we fine-tuned pre-trained ELMo for early rumor detection task on social media dataset, we generate a dataset from CREDBANK. Sentences in original corpus are shuffled and split into training and hold-out sets. About 0.02% of the original data is used as the hold-out set. We also generate a test set using the PHEME data containing 6,162 tweets related to 9 events in the hope that it will offer an independent and robust evaluation of our hypothesis.The model fine-tuned on Credbank dataset (denoted as "elmo_credbank") was trained more than 800 hours on a Intel E5-2630-v3 CPU with maximum 50GiB RAM used. For a comparative evaluation of its effectiveness, we also fine-tuned pre-trained ELMo model on SNAP corpus (denoted as "elmo_snap") which was trained more than 500 hours on a NVIDIA Kepler K40M GPU. Our results shows that a large improvement in perplexity on both hold-out set and test set with CREDBANK in comparison to the fine-tuned model with SNAP corpus.Our research shows that a state-of-the-art NLMs and large credibility focused Twitter corpora can be employed to learn context-sensitive representations of rumor tweets.For more details, please refer our papers as follows. Version "12262018.hdf5" was used in [2] and Version "10052019.hdf5" was used in [1]. The code using this language model can be found on github (https://github.com/soojihan/Multitask4Veracity).[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019[2] Han S., Gao, J., Ciravegna, F. (2019). "Data Augmentation for Rumor Detection Using Context-Sensitive Neural Language Model With Large-Scale Credibility Corpus", Seventh International Conference on Learning Representations (ICLR) LLD,New Orleans, Louisiana, US
本仓库包含面向谣言任务的专用上下文神经语言模型,该模型在大规模聚焦可信度的社交媒体数据集上完成微调。模型文件包含微调后且固定的双向语言模型(bidirectional Language Model, biLM)权重,可用于计算候选谣言推文的句子表征。本项目发布的目的仅用于学术研究,以及复现论文中的实验结果。
类似ELMo的上下文语言模型通过双向语言模型,能够生成深度、上下文感知且基于字符的词表征。既往研究表明,针对领域特定数据微调神经语言模型(Neural Language Model, NLM),可使其学习到更具语义价值的词表征,并带来性能提升。
在本研究中,我们针对社交媒体数据集上的早期谣言检测任务,对预训练ELMo进行了微调;我们从CREDBANK中构建了专属数据集,将原始语料中的句子打乱并划分为训练集与留出测试集(hold-out set),其中留出测试集占原始数据的约0.02%。此外,我们还使用PHEME数据集构建了测试集,该测试集包含与9个事件相关的6162条推文,以期能够对我们的假设进行独立且可靠的评估。
基于Credbank数据集微调的模型(记为"elmo_credbank")在Intel E5-2630-v3 CPU上训练时长超过800小时,峰值内存占用达50GiB。为对比评估其有效性,我们同时在SNAP语料上微调了预训练ELMo模型(记为"elmo_snap"),该模型在NVIDIA Kepler K40M GPU上训练时长超过500小时。我们的实验结果显示,相较于基于SNAP语料微调的模型,本模型在Credbank的留出测试集与测试集上的困惑度均有大幅提升。
本研究表明,借助当前主流的神经语言模型与大规模聚焦可信度的Twitter语料,可学习到谣言推文的上下文敏感表征。
更多细节请参阅如下论文:版本"12262018.hdf5"已在文献[2]中使用,版本"10052019.hdf5"已在文献[1]中使用。使用该语言模型的代码可在GitHub(https://github.com/soojihan/Multitask4Veracity)获取。
[1] Han S., Gao, J., Ciravegna, F. (2019). "基于神经语言模型的弱监督早期谣言检测训练数据增强方法", 2019年IEEE/ACM社交网络分析与挖掘进展国际会议(ASONAM 2019), 加拿大温哥华, 2019年8月27日至30日
[2] Han S., Gao, J., Ciravegna, F. (2019). "基于上下文敏感神经语言模型与大规模可信度语料的谣言检测数据增强", 第七届学习表示国际会议(ICLR)LLD分论坛, 美国路易斯安那州新奥尔良
创建时间:
2020-01-14



