community-datasets/id_clickbait
收藏数据集卡片 - Indonesian Clickbait Headlines
数据集描述
数据集摘要
CLICK-ID数据集是从12个本地在线新闻出版商收集的印度尼西亚新闻标题集合,包括detikNews、Fimela、Kapanlagi、Kompas、Liputan6、Okezone、Posmetro-Medan、Republika、Sindonews、Tempo、Tribunnews和Wowkeren。该数据集主要包含两部分:(i) 46,119条原始文章数据,和 (ii) 15,000条经过点击诱饵标注的样本标题。标注过程由3名标注者对每个标题进行检查,基于标题的多数判断作为真实标签。在标注样本中,标注结果显示6,290条点击诱饵和8,710条非点击诱饵。
支持的任务和排行榜
[更多信息需要]
语言
印度尼西亚语
数据集结构
数据实例
一个标注文章的示例: json { "id": "100", "label": 1, "title": "SAH! Ini Daftar Nama Menteri Kabinet Jokowi - Maruf Amin" }
数据字段
标注数据
id: 样本的IDtitle: 新闻文章的标题label: 文章的标签,非点击诱饵或点击诱饵
原始数据
id: 样本的IDtitle: 新闻文章的标题source: 出版商/报纸的名称date: 日期category: 文章的类别sub-category: 文章的子类别content: 文章的内容url: 文章的URL
数据分割
数据集包含训练集。
数据集创建
策划理由
[更多信息需要]
源数据
初始数据收集和规范化
[更多信息需要]
源语言生产者是谁?
[更多信息需要]
标注
标注过程
[更多信息需要]
标注者是谁?
[更多信息需要]
个人和敏感信息
[更多信息需要]
使用数据的考虑
数据集的社会影响
[更多信息需要]
偏见的讨论
[更多信息需要]
其他已知限制
[更多信息需要]
附加信息
数据集策展人
[更多信息需要]
许可信息
Creative Commons Attribution 4.0 International license
引用信息
plaintext @article{WILLIAM2020106231, title = "CLICK-ID: A novel dataset for Indonesian clickbait headlines", journal = "Data in Brief", volume = "32", pages = "106231", year = "2020", issn = "2352-3409", doi = "https://doi.org/10.1016/j.dib.2020.106231", url = "http://www.sciencedirect.com/science/article/pii/S2352340920311252", author = "Andika William and Yunita Sari", keywords = "Indonesian, Natural Language Processing, News articles, Clickbait, Text-classification", abstract = "News analysis is a popular task in Natural Language Processing (NLP). In particular, the problem of clickbait in news analysis has gained attention in recent years [1, 2]. However, the majority of the tasks has been focused on English news, in which there is already a rich representative resource. For other languages, such as Indonesian, there is still a lack of resource for clickbait tasks. Therefore, we introduce the CLICK-ID dataset of Indonesian news headlines extracted from 12 Indonesian online news publishers. It is comprised of 15,000 annotated headlines with clickbait and non-clickbait labels. Using the CLICK-ID dataset, we then developed an Indonesian clickbait classification model achieving favourable performance. We believe that this corpus will be useful for replicable experiments in clickbait detection or other experiments in NLP areas." }
贡献
感谢@cahya-wirawan添加此数据集。



