SemEvalWorkshop/hyperpartisan_news_detection
收藏数据集概述
基本信息
- 数据集名称: HyperpartisanNewsDetection
- 语言: 英语
- 许可证: CC-BY-4.0
- 多语言性: 单语种
- 数据集大小: 1M<n<10M
- 源数据: 原始数据
- 任务类别: 文本分类
- 标签: 偏见分类
数据集配置
byarticle
- 特征:
text: 字符串title: 字符串hyperpartisan: 布尔值url: 字符串published_at: 字符串
- 分割:
train: 645个样本, 2803943字节
- 下载大小: 1000352字节
- 数据集大小: 2803943字节
bypublisher
- 特征:
text: 字符串title: 字符串hyperpartisan: 布尔值url: 字符串published_at: 字符串bias: 分类标签, 可能值包括right(0),right-center(1),least(2),left-center(3),left(4)
- 分割:
train: 600000个样本, 2805711609字节validation: 150000个样本, 960356598字节
- 下载大小: 1003195420字节
- 数据集大小: 5611423218字节
数据集创建
数据集摘要
Hyperpartisan News Detection 数据集是为 PAN @ SemEval 2019 Task 4 创建的。该数据集包含两部分:
- byarticle: 通过众包在文章基础上标注。数据仅包含众包工作者之间存在共识的文章。
- bypublisher: 由 BuzzFeed 记者或 MediaBiasFactCheck.com 提供的出版商整体偏见标注。
数据集结构
数据实例
byarticle
- 下载大小: 1.00 MB
- 生成数据集大小: 2.80 MB
- 总磁盘使用量: 3.81 MB
bypublisher
- 下载大小: 1.00 GB
- 生成数据集大小: 5.61 GB
- 总磁盘使用量: 6.61 GB
数据字段
byarticle
text: 字符串title: 字符串hyperpartisan: 布尔值url: 字符串published_at: 字符串
bypublisher
text: 字符串title: 字符串hyperpartisan: 布尔值url: 字符串published_at: 字符串bias: 分类标签, 可能值包括right(0),right-center(1),least(2),left-center(3),left(4)
数据分割
byarticle
train: 645个样本
bypublisher
train: 600000个样本validation: 150000个样本
许可证信息
该数据集(包括标签)在 Creative Commons Attribution 4.0 International License 下授权。
引用信息
@inproceedings{kiesel-etal-2019-semeval, title = "{S}em{E}val-2019 Task 4: Hyperpartisan News Detection", author = "Kiesel, Johannes and Mestre, Maria and Shukla, Rishabh and Vincent, Emmanuel and Adineh, Payam and Corney, David and Stein, Benno and Potthast, Martin", booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation", month = jun, year = "2019", address = "Minneapolis, Minnesota, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/S19-2145", doi = "10.18653/v1/S19-2145", pages = "829--839", abstract = "Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.", }



