Tran et al. Final_Dataset.xlsx
收藏DataCite Commons2024-11-12 更新2025-01-06 收录
下载链接:
https://figshare.com/articles/dataset/Tran_et_al_Final_Dataset_xlsx/27619839
下载链接
链接失效反馈官方服务:
资源简介:
Artificial Intelligence (AI) has emerged as a critical challenge to the authenticity of journalistic content, raising concerns over the ease with which artificially generated articles can mimic human-written news. This study focuses on using machine learning to identify distinguishing features, or “stylistic fingerprints,” of AI-generated and human-authored journalism. By analyzing these unique characteristics, we aim to classify news pieces with high accuracy, enhancing our ability to verify the authenticity of digital news.To conduct this study, we gathered a balanced dataset of 150 original journalistic articles and their 150 AI-generated counterparts, sourced from popular news websites. A variety of lexical, syntactic, and readability features were extracted from each article to serve as input data for training machine learning models. Five classifiers were then trained to evaluate how accurately they could distinguish between authentic and artificial articles, with each model learning specific patterns and variations in writing style.In addition to model training, BERTopic, a topic modeling technique, was applied to extract salient keywords from the journalistic articles. These keywords were used to prompt Google’s Gemini, an AI text generation model, to create artificial articles on the same topics as the original human-written pieces. This ensured a high level of relevance between authentic and AI-generated articles, which added complexity to the classification task.Among the five classifiers tested, the Random Forest model delivered the best performance, achieving an accuracy of 98.3% along with high precision (0.984), recall (0.983), and F1-score (0.983). Feature importance analyses were conducted using methods like Random Forest Feature Importance, Analysis of Variance (ANOVA), Mutual Information, and Recursive Feature Elimination. This analysis revealed that the top five discriminative features were sentence length range, paragraph length coefficient of variation, verb ratio, sentence complexity tags, and paragraph length range. These features appeared to encapsulate subtle but meaningful stylistic differences between human and AI-generated content.This research makes a significant contribution to combating disinformation by offering a robust method for authenticating journalistic content. By employing machine learning to identify subtle linguistic patterns, this study not only advances our understanding of AI in journalism but also enhances the tools available to ensure the credibility of news in the digital age.
提供机构:
figshare
创建时间:
2024-11-12



