community-datasets/id_clickbait

Name: community-datasets/id_clickbait
Creator: community-datasets
Published: 2024-01-18 11:06:03
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/community-datasets/id_clickbait

下载链接

链接失效反馈

官方服务：

资源简介：

CLICK-ID数据集是一个印尼新闻标题的集合，收集自12家本地在线新闻出版商，包括detikNews、Fimela、Kapanlagi等。该数据集主要由两部分组成：(i) 46,119条原始文章数据，以及(ii) 15,000条带有点击诱饵标注的标题样本。标注过程由三位标注者进行，最终以多数意见作为真实标签。在标注样本中，标注结果显示有6,290条点击诱饵和8,710条非点击诱饵。该数据集支持的任务是文本分类，特别是点击诱饵检测。数据集的创建目的是为了填补印尼语在点击诱饵任务上的资源空白。

提供机构：

community-datasets

原始信息汇总

数据集卡片 - Indonesian Clickbait Headlines

数据集描述

数据集摘要

CLICK-ID数据集是从12个本地在线新闻出版商收集的印度尼西亚新闻标题集合，包括detikNews、Fimela、Kapanlagi、Kompas、Liputan6、Okezone、Posmetro-Medan、Republika、Sindonews、Tempo、Tribunnews和Wowkeren。该数据集主要包含两部分：(i) 46,119条原始文章数据，和 (ii) 15,000条经过点击诱饵标注的样本标题。标注过程由3名标注者对每个标题进行检查，基于标题的多数判断作为真实标签。在标注样本中，标注结果显示6,290条点击诱饵和8,710条非点击诱饵。

支持的任务和排行榜

[更多信息需要]

语言

印度尼西亚语

数据集结构

数据实例

一个标注文章的示例： json { "id": "100", "label": 1, "title": "SAH! Ini Daftar Nama Menteri Kabinet Jokowi - Maruf Amin" }

数据字段

标注数据

id: 样本的ID
title: 新闻文章的标题
label: 文章的标签，非点击诱饵或点击诱饵

原始数据

id: 样本的ID
title: 新闻文章的标题
source: 出版商/报纸的名称
date: 日期
category: 文章的类别
sub-category: 文章的子类别
content: 文章的内容
url: 文章的URL

数据分割

数据集包含训练集。

数据集创建

策划理由

[更多信息需要]

源数据

初始数据收集和规范化

[更多信息需要]

源语言生产者是谁？

[更多信息需要]

标注

标注过程

[更多信息需要]

标注者是谁？

[更多信息需要]

个人和敏感信息

[更多信息需要]

使用数据的考虑

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

[更多信息需要]

许可信息

Creative Commons Attribution 4.0 International license

引用信息

plaintext @article{WILLIAM2020106231, title = "CLICK-ID: A novel dataset for Indonesian clickbait headlines", journal = "Data in Brief", volume = "32", pages = "106231", year = "2020", issn = "2352-3409", doi = "https://doi.org/10.1016/j.dib.2020.106231", url = "http://www.sciencedirect.com/science/article/pii/S2352340920311252", author = "Andika William and Yunita Sari", keywords = "Indonesian, Natural Language Processing, News articles, Clickbait, Text-classification", abstract = "News analysis is a popular task in Natural Language Processing (NLP). In particular, the problem of clickbait in news analysis has gained attention in recent years [1, 2]. However, the majority of the tasks has been focused on English news, in which there is already a rich representative resource. For other languages, such as Indonesian, there is still a lack of resource for clickbait tasks. Therefore, we introduce the CLICK-ID dataset of Indonesian news headlines extracted from 12 Indonesian online news publishers. It is comprised of 15,000 annotated headlines with clickbait and non-clickbait labels. Using the CLICK-ID dataset, we then developed an Indonesian clickbait classification model achieving favourable performance. We believe that this corpus will be useful for replicable experiments in clickbait detection or other experiments in NLP areas." }

贡献

感谢@cahya-wirawan添加此数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集