andyP/fake_news_en_opensources

Name: andyP/fake_news_en_opensources
Creator: andyP
Published: 2024-02-12 21:04:30
License: 暂无描述

Hugging Face2024-02-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/andyP/fake_news_en_opensources

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 annotations_creators: - expert-generated language_creators: - found task_categories: - text-classification language: - en multilinguality: - monolingual source_datasets: - Opensources https://github.com/BigMcLargeHuge/opensources - FakeNews Corpus https://github.com/several27/FakeNewsCorpus tags: - fake-news-detection - fake news - english - nlp task_ids: - topic-classification - fact-checking pretty_name: Fake News Opensources size_categories: - 1M<n<10M dataset_info: features: - name: id dtype: int64 - name: type dtype: string - name: domain dtype: string - name: scraped_at dtype: string - name: url dtype: string - name: authors dtype: string - name: title dtype: string - name: content dtype: string --- # Dataset Card for "Fake News Opensources" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description  - **Homepage:** [https://github.com/AndyTheFactory/FakeNewsDataset](https://github.com/AndyTheFactory/FakeNewsDataset) - **Repository:** [https://github.com/AndyTheFactory/FakeNewsDataset](https://github.com/AndyTheFactory/FakeNewsDataset) - **Point of Contact:** [Andrei Paraschiv](https://github.com/AndyTheFactory) - ### Dataset Summary a consolidated and cleaned up version of the opensources Fake News dataset Fake News Corpus comprises 8,529,090 individual articles, classified into 12 classes: reliable, unreliable, political, bias, fake, conspiracy, rumor clickbait, junk science, satire, hate and unknown. The articles were scraped between the end of 2017 and the beginning of 2018 from various news websites, totaling 647 distinct sources, collecting articles dating from various years leading to the 2016 US elections and the year after. Documents were classified based on their source, based on the curated website list provided by opensources.co using a leading to a high imbalanced class distribution. Their proposed source classification method, was based on six criteria: - Title and Domain name analysis, - “About Us” analysis, - source or study mentioning, - writing style analysis, - aesthetic analysis and social media analysis. After extensive data cleaning and duplicate removal we retain **5,915,569** records ### Languages English ## Dataset Structure ### Data Instances An example record looks as follows. ``` { 'id': 4059480, 'type': 'political', 'domain': 'dailycaller.com', 'scraped_at': '2017-11-27', 'url': 'http://dailycaller.com/buzz/massachusettsunited-states/page/2/', 'authors': 'Jeff Winkler, Jonathan Strong, Ken Blackwell, Pat Mcmahon, Julia Mcclatchy, Admin, Matt Purple', 'title': 'The Daily Caller', 'content':'New Hampshire is the state with the highest median income in the nation, according to the U.S. Census Bureauâ€™s report on income, poverty and health insurance', } ``` ### Data Fields - `id`: The unique article ID - `type`: the label of the record (one of: reliable, unreliable, political, bias, fake, conspiracy, rumor clickbait, junk science, satire, hate) - 'scraped_at': date of the original scrape run - 'url': original article url - 'authors': comma separated list of scraped authors - 'title': original scraped article title - `content`: full article text ### Data Splits Label | Nr Records :---| :---: reliable | 1807323 political | 968205 bias | 769874 fake | 762178 conspiracy | 494184 rumor | 375963 unknown | 230532 clickbait | 174176 unreliable | 104537 satire | 84735 junksci | 79099 hate | 64763 | total | 5915569 ## Dataset Creation ### Source Data News Articles from various sites #### Who are the source language producers? News Articles, Blogs ### Annotations #### Who are the annotators? Journalists ### Other Known Limitations The dataset was not manually filtered, therefore some of the labels might not be correct and some of the URLs might not point to the actual articles but other pages on the website. However, because the corpus is intended for use in training machine learning algorithms, those problems should not pose a practical issue. Additionally, when the dataset will be finalised (as for now only about 80% was cleaned and published), I do not intend to update it, therefore it might quickly become outdated for other purposes than content-based algorithms. However, any contributions are welcome! ### Licensing Information This data is available and distributed under Apache-2.0 license ### Citation Information ``` tbd ```

The Fake News Opensources dataset is a consolidated and cleaned-up version of the opensources Fake News dataset, containing 5,915,569 articles classified into 12 classes. The articles were scraped from various news websites and classified based on their source using a method involving six criteria. The dataset is intended for text classification tasks such as topic classification and fact-checking. It is in English and is monolingual. The dataset includes fields such as id, type, domain, scraped_at, url, authors, title, and content. The dataset is licensed under the Apache-2.0 license.

提供机构：

andyP

原始信息汇总

数据集卡片："Fake News Opensources"

数据集描述

数据集摘要

"Fake News Opensources" 数据集是 opensources Fake News 数据集的整合和清理版本。该数据集包含 8,529,090 篇文章，分为 12 个类别：可靠、不可靠、政治、偏见、虚假、阴谋、谣言、点击诱饵、垃圾科学、讽刺、仇恨和未知。这些文章是从 2017 年底到 2018 年初从各种新闻网站上抓取的，共有 647 个不同的来源，收集了从不同年份到 2016 年美国大选及之后几年的文章。文档根据其来源进行分类，基于 opensources.co 提供的经过筛选的网站列表，导致类别分布高度不平衡。他们的源分类方法基于六个标准：

标题和域名分析
“关于我们”分析
来源或研究提及
写作风格分析
美学分析和社交媒体分析

经过广泛的数据清理和重复删除，我们保留了 5,915,569 条记录。

支持的任务和排行榜

文本分类
事实核查

语言

英语

数据集结构

数据实例

一个示例记录如下： json { id: 4059480, type: political, domain: dailycaller.com, scraped_at: 2017-11-27, url: http://dailycaller.com/buzz/massachusettsunited-states/page/2/, authors: Jeff Winkler, Jonathan Strong, Ken Blackwell, Pat Mcmahon, Julia Mcclatchy, Admin, Matt Purple, title: The Daily Caller, content: New Hampshire is the state with the highest median income in the nation, according to the U.S. Census Bureauâ€™s report on income, poverty and health insurance }

数据字段

id: 文章的唯一标识符
type: 记录的标签（可靠、不可靠、政治、偏见、虚假、阴谋、谣言、点击诱饵、垃圾科学、讽刺、仇恨之一）
scraped_at: 原始抓取日期
url: 原始文章链接
authors: 抓取的作者列表，以逗号分隔
title: 原始抓取的文章标题
content: 完整的文章内容

数据分割

标签	记录数
reliable	1807323
political	968205
bias	769874
fake	762178
conspiracy	494184
rumor	375963
unknown	230532
clickbait	174176
unreliable	104537
satire	84735
junksci	79099
hate	64763
总计	5915569

数据集创建

源数据

新闻文章来自各种网站。

源语言生产者是谁？

新闻文章、博客

标注

标注者是谁？

记者

其他已知限制

数据集未经过手动筛选，因此某些标签可能不正确，某些 URL 可能不指向实际文章，而是指向网站上的其他页面。然而，由于该语料库旨在用于训练机器学习算法，这些问题不应构成实际问题。

此外，当数据集最终确定（目前仅清理和发布了约 80%）时，我不打算更新它，因此它可能很快就会过时，除了基于内容的算法之外，其他用途可能不再适用。然而，欢迎任何贡献！

许可信息

该数据集根据 Apache-2.0 许可证提供和分发。

引用信息

tbd

搜集汇总

数据集介绍

构建方式

该数据集通过整合和清洗来自Opensources和FakeNews Corpus的原始数据构建而成。数据集包含了从2017年底至2018年初从647个不同新闻网站抓取的8,529,090篇文章，这些文章涵盖了从2016年美国大选前后至2018年的内容。文章根据其来源网站的分类进行标注，分类基于六个标准：标题和域名分析、‘关于我们’分析、来源或研究提及、写作风格分析、美学分析以及社交媒体分析。经过数据清洗和去重处理后，最终保留了5,915,569条记录。

特点

该数据集的主要特点在于其广泛的内容覆盖和详细的分类体系。数据集包含12个类别，涵盖了从可靠新闻到虚假信息、政治偏见、谣言、讽刺等多种类型，能够为假新闻检测和事实核查任务提供丰富的训练数据。此外，数据集的高不平衡类别分布反映了现实世界中各类新闻的实际分布情况，为模型训练提供了更具挑战性的环境。

使用方法

该数据集适用于文本分类和事实核查任务，特别适合用于训练和评估假新闻检测模型。用户可以通过访问数据集的GitHub仓库获取数据，并根据提供的字段（如文章ID、类型、域名、抓取日期、URL、作者、标题和内容）进行分析和建模。建议在使用前进行数据预处理，以确保数据质量和模型性能。

背景与挑战

背景概述

在信息爆炸的时代，虚假新闻的泛滥成为了一个全球性的问题，尤其在政治和社会事件中，其影响尤为显著。andyP/fake_news_en_opensources数据集由Andrei Paraschiv创建，旨在通过提供一个经过整理和清洗的虚假新闻数据集，帮助研究人员和开发者更好地理解和应对虚假新闻的挑战。该数据集包含了从2017年底到2018年初从647个不同新闻网站抓取的8,529,090篇文章，经过清洗和去重后保留了5,915,569条记录。这些文章被分类为12个类别，包括可靠、不可靠、政治、偏见、虚假、阴谋、谣言、点击诱饵、垃圾科学、讽刺、仇恨和未知。数据集的创建基于opensources.co提供的网站列表，并采用了六种标准进行分类，包括标题和域名分析、‘关于我们’分析、来源或研究提及、写作风格分析、美学分析和社交媒体分析。

当前挑战

andyP/fake_news_en_opensources数据集面临的挑战主要集中在数据质量和时效性上。首先，由于数据未经手动过滤，部分标签可能不准确，部分URL可能指向错误页面，这可能影响数据集的可靠性。其次，数据集的时效性也是一个问题，因为数据集的最终版本不会更新，随着时间的推移，数据可能会变得过时，特别是在内容基于算法的应用中。此外，数据集中的类别分布高度不平衡，这可能对模型训练和评估带来挑战。最后，数据集的创建和维护需要大量的资源和专业知识，如何确保数据集的持续更新和质量控制是一个长期挑战。

常用场景

经典使用场景

在自然语言处理领域，andyP/fake_news_en_opensources数据集被广泛应用于假新闻检测任务。该数据集通过整合和清理来自多个新闻网站的文章，提供了丰富的文本数据，涵盖了从可靠新闻到虚假信息等多种类别。研究者常利用此数据集训练和评估假新闻检测模型，通过分析文章的标题、内容、作者等信息，模型能够有效区分不同类型的新闻，从而提升假新闻检测的准确性。

解决学术问题

andyP/fake_news_en_opensources数据集为解决假新闻检测这一学术难题提供了重要支持。通过提供大规模、多类别的新闻文章数据，该数据集帮助研究者深入探索假新闻的特征和传播机制，推动了相关领域的研究进展。其多样的标签类别和详细的文本信息，使得研究者能够开发出更为精准的假新闻检测算法，对提升信息真实性和社会信任具有重要意义。

衍生相关工作

基于andyP/fake_news_en_opensources数据集，研究者们开发了多种假新闻检测模型和算法。例如，有研究利用该数据集进行多标签分类，探索不同类型假新闻的特征；还有研究通过分析文章的写作风格和内容结构，提出了新的假新闻检测方法。此外，该数据集还激发了关于假新闻传播机制和影响因素的深入研究，推动了假新闻检测领域的理论和实践发展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集