Credibility corpus fine-tuned ELMo contextual language model for early rumor detection on social media
收藏orda.shef.ac.uk2020-01-14 更新2025-03-25 收录
下载链接:
https://orda.shef.ac.uk/articles/dataset/Credibility_corpus_fine-tuned_ELMo_contextual_language_model_for_early_rumor_detection_on_social_media/11591775/1
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains rumor task specific contextual neural language model that are fine-tuned on large credibility-focused social media dataset. The model file contains fine-tuned and fixed bidirectional Language Model (biLM) weights that can be used to compute the sentence representation of candidate rumor tweets. The purpose of this release is for research only and for reproducing our results in the paper.Contextual language model like ELMo provides deep, contextualised, and character based word representations by using bidirectional language models. Previous research shows that fine-tuning Neural Language Models (NLMs) with domain-specific data allows them to learn more meaningful word representations and provides a performance gain.In our research, we fine-tuned pre-trained ELMo for early rumor detection task on social media dataset, we generate a dataset from CREDBANK. Sentences in original corpus are shuffled and split into training and hold-out sets. About 0.02% of the original data is used as the hold-out set. We also generate a test set using the PHEME data containing 6,162 tweets related to 9 events in the hope that it will offer an independent and robust evaluation of our hypothesis.The model fine-tuned on Credbank dataset (denoted as "elmo_credbank") was trained more than 800 hours on a Intel E5-2630-v3 CPU with maximum 50GiB RAM used. For a comparative evaluation of its effectiveness, we also fine-tuned pre-trained ELMo model on SNAP corpus (denoted as "elmo_snap") which was trained more than 500 hours on a NVIDIA Kepler K40M GPU. Our results shows that a large improvement in perplexity on both hold-out set and test set with CREDBANK in comparison to the fine-tuned model with SNAP corpus.Our research shows that a state-of-the-art NLMs and large credibility focused Twitter corpora can be employed to learn context-sensitive representations of rumor tweets.For more details, please refer our papers as follows. Version "12262018.hdf5" was used in [2] and Version "10052019.hdf5" was used in [1]. The code using this language model can be found on github (https://github.com/soojihan/Multitask4Veracity).[1] Han S., Gao, J., Ciravegna, F. (2019). "Neural Language Model Based Training Data Augmentation for Weakly Supervised Early Rumor Detection", The 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2019), Vancouver, Canada, 27-30 August, 2019[2] Han S., Gao, J., Ciravegna, F. (2019). "Data Augmentation for Rumor Detection Using Context-Sensitive Neural Language Model With Large-Scale Credibility Corpus", Seventh International Conference on Learning Representations (ICLR) LLD,New Orleans, Louisiana, US
本仓库包含针对谣言任务特定情境神经语言模型,该模型经过大规模以可信度为核心的社会媒体数据集的微调。模型文件中包含了经过微调和固定的双向语言模型(biLM)权重,可用于计算候选谣言推文的句子表征。本发布版本仅用于研究目的,旨在重现论文中的研究结果。如同 ELMo 这样的情境语言模型,通过运用双向语言模型,提供了深入、情境化且基于字符的词表征。既往研究指出,使用特定领域的数据对神经语言模型(NLMs)进行微调,能使其学习到更有意义的词表征,并带来性能上的提升。在本项研究中,我们对预先训练的 ELMo 进行了微调,以用于社交媒体数据集中的早期谣言检测任务,并从 CREDBANK 生成了一个数据集。原始语料库中的句子经过打乱并分为训练集和保留集。约 0.02% 的原始数据被用作保留集。我们还利用 PHEME 数据生成了一个测试集,其中包含与 9 个事件相关的 6,162 条推文,以期提供对假设的独立且稳健的评估。在 Credbank 数据集上微调的模型(标记为 "elmo_credbank")在 Intel E5-2630-v3 CPU 上训练了超过 800 小时,最大使用了 50GiB 的 RAM。为了比较其有效性,我们还对预先训练的 ELMo 模型在 SNAP 语料库(标记为 "elmo_snap")上进行了微调,该语料库在 NVIDIA Kepler K40M GPU 上训练了超过 500 小时。我们的结果表明,与基于 SNAP 语料库微调的模型相比,在包含 Credbank 的保留集和测试集上,困惑度均得到了显著提升。我们的研究证明,最先进的 NLMs 和大规模可信度聚焦的 Twitter 语料库可以被用来学习谣言推文的情境敏感表征。更详细的内容,请参阅以下论文。[1] Han S., Gao, J., Ciravegna, F. (2019). "基于神经语言模型的训练数据增强用于弱监督早期谣言检测
提供机构:
orda.shef.ac.uk



