community-datasets/clickbait_news_bg

Name: community-datasets/clickbait_news_bg
Creator: community-datasets
Published: 2024-01-18 14:25:02
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/community-datasets/clickbait_news_bg

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - bg license: - unknown multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - fact-checking pretty_name: Clickbait/Fake News in Bulgarian dataset_info: features: - name: fake_news_score dtype: class_label: names: '0': legitimate '1': fake - name: click_bait_score dtype: class_label: names: '0': normal '1': clickbait - name: content_title dtype: string - name: content_url dtype: string - name: content_published_time dtype: string - name: content dtype: string splits: - name: train num_bytes: 24480386 num_examples: 2815 - name: validation num_bytes: 6752226 num_examples: 761 download_size: 11831065 dataset_size: 31232612 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- # Dataset Card for Clickbait/Fake News in Bulgarian ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Data Science Society / Case Fake News](https://gitlab.com/datasciencesociety/case_fake_news) - **Repository:** [Data Science Society / Case Fake News / Data](https://gitlab.com/datasciencesociety/case_fake_news/-/tree/master/data) - **Paper:** [This paper uses the dataset.](https://www.acl-bg.org/proceedings/2017/RANLP%202017/pdf/RANLP045.pdf) - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This is a corpus of Bulgarian news over a fixed period of time, whose factuality had been questioned. The news come from 377 different sources from various domains, including politics, interesting facts and tips&tricks. The dataset was prepared for the Hack the Fake News hackathon. It was provided by the [Bulgarian Association of PR Agencies](http://www.bapra.bg/) and is available in [Gitlab](https://gitlab.com/datasciencesociety/). The corpus was automatically collected, and then annotated by students of journalism. The training dataset contains 2,815 examples, where 1,940 (i.e., 69%) are fake news and 1,968 (i.e., 70%) are click-baits; There are 761 testing examples. There is 98% correlation between fake news and clickbaits. One important aspect about the training dataset is that it contains many repetitions. This should not be surprising as it attempts to represent a natural distribution of factual vs. fake news on-line over a period of time. As publishers of fake news often have a group of websites that feature the same deceiving content, we should expect some repetition. In particular, the training dataset contains 434 unique articles with duplicates. These articles have three reposts each on average, with the most reposted article appearing 45 times. If we take into account the labels of the reposted articles, we can see that if an article is reposted, it is more likely to be fake news. The number of fake news that have a duplicate in the training dataset are 1018 whereas, the number of articles with genuine content that have a duplicate article in the training set is 322. (The dataset description is from the following [paper](https://www.acl-bg.org/proceedings/2017/RANLP%202017/pdf/RANLP045.pdf).) ### Supported Tasks and Leaderboards [More Information Needed] ### Languages Bulgarian ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields Each entry in the dataset consists of the following elements: * `fake_news_score` - a label indicating whether the article is fake or not * `click_bait_score` - another label indicating whether it is a click-bait * `content_title` - article heading * `content_url` - URL of the original article * `content_published_time` - date of publication * `content` - article content ### Data Splits The **training dataset** contains 2,815 examples, where 1,940 (i.e., 69%) are fake news and 1,968 (i.e., 70%) are click-baits; The **validation dataset** contains 761 testing examples. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions Thanks to [@tsvm](https://github.com/tsvm), [@lhoestq](https://github.com/lhoestq) for adding this dataset.

提供机构：

community-datasets

原始信息汇总

数据集卡片：Clickbait/Fake News in Bulgarian

数据集描述

数据集摘要

这是一个在固定时间段内收集的保加利亚新闻语料库，其真实性受到质疑。新闻来自377个不同领域的来源，包括政治、有趣的事实和技巧。

该数据集是为“Hack the Fake News”黑客马拉松准备的。它由保加利亚公共关系机构协会提供，并在Gitlab上可用。

语料库是自动收集的，然后由新闻学学生进行标注。

训练数据集包含2,815个样本，其中1,940个（即69%）是假新闻，1,968个（即70%）是点击诱饵；有761个测试样本。

假新闻和点击诱饵之间有98%的相关性。

训练数据集的一个重要特点是包含许多重复内容。这并不奇怪，因为它试图代表在线事实与假新闻在一段时间内的自然分布。由于假新闻的发布者通常拥有一组发布相同欺骗内容的网站，我们应该预期会有一些重复。特别是，训练数据集包含434篇唯一文章和重复内容。这些文章平均每篇有三篇转载，最常转载的文章出现了45次。如果我们考虑转载文章的标签，我们可以看到，如果一篇文章被转载，它更有可能是假新闻。训练数据集中有重复的假新闻数量为1018篇，而真实内容的重复文章数量为322篇。

支持的任务和排行榜

[更多信息需要]

语言

保加利亚语

数据集结构

数据实例

[更多信息需要]

数据字段

数据集中的每个条目包含以下元素：

fake_news_score - 标签，指示文章是否为假新闻
click_bait_score - 标签，指示是否为点击诱饵
content_title - 文章标题
content_url - 原始文章的URL
content_published_time - 发布日期
content - 文章内容

数据分割

训练数据集 包含2,815个样本，其中1,940个（即69%）是假新闻，1,968个（即70%）是点击诱饵；
验证数据集 包含761个测试样本。

数据集创建

策划理由

[更多信息需要]

源数据

初始数据收集和规范化

[更多信息需要]

源语言生产者是谁？

[更多信息需要]

标注

标注过程

[更多信息需要]

标注者是谁？

[更多信息需要]

个人和敏感信息

[更多信息需要]

使用数据的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

[更多信息需要]

许可信息

[更多信息需要]

引用信息

[更多信息需要]

贡献

感谢@tsvm，@lhoestq添加此数据集。

搜集汇总

数据集介绍

构建方式

在保加利亚语新闻文本分析领域，该数据集的构建体现了严谨的学术流程。数据源自377个不同领域的新闻源，涵盖政治、趣闻及生活技巧等内容，通过自动化手段进行初步收集。随后，新闻学专业的学生对这批语料进行了精细的人工标注，区分虚假新闻与点击诱饵内容，确保了标注的专业性与可靠性。数据集最终包含2815条训练样本与761条验证样本，其构建过程注重反映网络新闻的自然分布，为相关研究提供了扎实的数据基础。

特点

该数据集在保加利亚语虚假新闻检测领域展现出鲜明的特征。其核心在于同时标注了虚假新闻与点击诱饵两类标签，且二者之间存在高达98%的相关性，揭示了网络误导性内容的常见共生现象。数据集中包含了新闻标题、原文内容、发布时间及来源URL等丰富字段，支持多维度分析。值得注意的是，训练集内存在一定的内容重复，这并非缺陷，而是刻意保留了虚假新闻在网络传播中多源分发的真实生态，为模型训练提供了贴近现实的数据分布。

使用方法

该数据集主要服务于文本分类任务，特别是虚假新闻检测与点击诱饵识别。研究者可借助其提供的训练集与验证集，开发或评估相关的自然语言处理模型。使用时应关注数据字段的完整性，结合`content_title`与`content`进行内容分析，并利用`fake_news_score`和`click_bait_score`双标签进行监督学习或联合任务建模。鉴于数据集中存在的重复样本反映了真实传播模式，在划分训练集或进行数据增强时需谨慎处理，以保持评估结果的客观性。

背景与挑战

背景概述

在数字媒体时代，虚假新闻与点击诱饵内容的泛滥已成为全球性社会问题，对信息生态与公众认知构成严峻挑战。保加利亚语虚假新闻与点击诱饵数据集由保加利亚公共关系协会与数据科学社群于2017年前后联合构建，旨在针对保加利亚语新闻领域进行事实核查与内容可信度研究。该数据集汇集了来自377个不同来源的新闻条目，涵盖政治、生活技巧等多领域，通过新闻学专业学生的人工标注，区分了虚假新闻与正常内容、点击诱饵与常规标题，为自然语言处理领域的文本分类任务提供了重要资源。其构建背景源于“黑客虚假新闻”黑客松活动，推动了东欧语言环境下的媒体可信度分析与自动化检测技术发展，对跨语言虚假新闻研究具有示范意义。

当前挑战

该数据集致力于解决保加利亚语新闻领域的事实核查与点击诱饵检测双重挑战，其核心问题在于如何精准识别兼具误导性内容与诱导性标题的混合型虚假信息。在构建过程中，面临的主要挑战包括：首先，数据收集需覆盖广泛且多样的新闻来源以确保代表性，同时处理内容重复性问题——数据集中存在大量重复文章，反映了虚假新闻在网络传播中的自然分布规律，但给模型训练带来了样本偏差风险；其次，标注工作依赖人工专家，但保加利亚语资源相对稀缺，标注一致性与质量保障成为难点；此外，数据集中虚假新闻与点击诱饵标签高度相关（达98%），可能导致模型难以区分两类任务的独立特征，增加了多任务学习的复杂性。这些挑战共同凸显了低资源语言环境下媒体内容可信度研究的特殊困难。

常用场景

经典使用场景

在数字媒体与自然语言处理领域，保加利亚语点击诱饵/虚假新闻数据集为文本分类任务提供了关键资源。该数据集通过专家标注的虚假新闻与点击诱饵标签，支持机器学习模型训练，以自动识别新闻内容的真实性。其经典使用场景聚焦于构建分类器，利用新闻标题与正文内容，区分真实信息与误导性报道，从而在信息验证研究中发挥核心作用。

衍生相关工作

围绕该数据集，已衍生出多项经典研究工作。例如，相关学术论文利用其探索了虚假新闻与点击诱饵的统计关联性，并开发了基于重复内容检测的模型优化方法。这些工作进一步拓展至多语言虚假信息检测框架的构建，促进了跨文化背景下的信息可信度研究，为后续低资源语言数据集创建与算法开发提供了重要参考。

数据集最近研究