SemEvalWorkshop/hyperpartisan_news_detection

Name: SemEvalWorkshop/hyperpartisan_news_detection
Creator: SemEvalWorkshop
Published: 2023-06-13 07:46:19
License: 暂无描述

Hugging Face2023-06-13 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/SemEvalWorkshop/hyperpartisan_news_detection

下载链接

链接失效反馈

官方服务：

资源简介：

Hyperpartisan News Detection数据集是为PAN @ SemEval 2019 Task 4任务创建的，主要用于检测新闻文章是否具有极端党派倾向。数据集分为两部分：byarticle和bypublisher。byarticle部分通过众包方式对文章进行标注，而bypublisher部分则根据出版物的整体偏见进行标注。数据集包含文本、标题、URL、发布日期等字段，并且提供了训练集和验证集的划分。数据集的创建目的是为了自动化检测极端党派新闻。

提供机构：

SemEvalWorkshop

原始信息汇总

数据集概述

基本信息

数据集名称: HyperpartisanNewsDetection
语言: 英语
许可证: CC-BY-4.0
多语言性: 单语种
数据集大小: 1M<n<10M
源数据: 原始数据
任务类别: 文本分类
标签: 偏见分类

数据集配置

byarticle

特征:
- text: 字符串
- title: 字符串
- hyperpartisan: 布尔值
- url: 字符串
- published_at: 字符串
分割:
- train: 645个样本, 2803943字节
下载大小: 1000352字节
数据集大小: 2803943字节

bypublisher

特征:
- text: 字符串
- title: 字符串
- hyperpartisan: 布尔值
- url: 字符串
- published_at: 字符串
- bias: 分类标签, 可能值包括 right (0), right-center (1), least (2), left-center (3), left (4)
分割:
- train: 600000个样本, 2805711609字节
- validation: 150000个样本, 960356598字节
下载大小: 1003195420字节
数据集大小: 5611423218字节

数据集创建

数据集摘要

Hyperpartisan News Detection 数据集是为 PAN @ SemEval 2019 Task 4 创建的。该数据集包含两部分：

byarticle: 通过众包在文章基础上标注。数据仅包含众包工作者之间存在共识的文章。
bypublisher: 由 BuzzFeed 记者或 MediaBiasFactCheck.com 提供的出版商整体偏见标注。

数据集结构

数据实例

byarticle

下载大小: 1.00 MB
生成数据集大小: 2.80 MB
总磁盘使用量: 3.81 MB

bypublisher

下载大小: 1.00 GB
生成数据集大小: 5.61 GB
总磁盘使用量: 6.61 GB

数据字段

byarticle

text: 字符串
title: 字符串
hyperpartisan: 布尔值
url: 字符串
published_at: 字符串

bypublisher

text: 字符串
title: 字符串
hyperpartisan: 布尔值
url: 字符串
published_at: 字符串
bias: 分类标签, 可能值包括 right (0), right-center (1), least (2), left-center (3), left (4)

数据分割

byarticle

train: 645个样本

bypublisher

train: 600000个样本
validation: 150000个样本

许可证信息

该数据集（包括标签）在 Creative Commons Attribution 4.0 International License 下授权。

引用信息

@inproceedings{kiesel-etal-2019-semeval, title = "{S}em{E}val-2019 Task 4: Hyperpartisan News Detection", author = "Kiesel, Johannes and Mestre, Maria and Shukla, Rishabh and Vincent, Emmanuel and Adineh, Payam and Corney, David and Stein, Benno and Potthast, Martin", booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation", month = jun, year = "2019", address = "Minneapolis, Minnesota, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/S19-2145", doi = "10.18653/v1/S19-2145", pages = "829--839", abstract = "Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集