DankMemes Task C Dataset
收藏DataCite Commons2022-06-01 更新2024-07-13 收录
下载链接:
https://live.european-language-grid.eu/catalogue/corpus/8093
下载链接
链接失效反馈官方服务:
资源简介:
The DANKMEMES Task C Dataset consists of 1,000 images, half memes and half not, automatically extracted from Instagram through a Python script aimed at the hashtag related to the Italian government crisis (“#crisidigoverno”). It was created and used in the context of the DankMemes (https://dankmemes2020.fileli.unipi.it), a shared task proposed for the 2020 EVALITA campaign (http://www.evalita.it/2020), focusing on the automatic classification of Internet memes.<p>The dataset is split into training and test sets, in a proportion of 80-20% of items. The test dataset has been provided without gold labels, provided in a separate file.<p>The dataset consists of:<p>- a folder with images in .jpg format <p>- a .csv file with the associated image embeddigs, computed employing ResNet (He et al., 2016), a state-of-the-art model for image recognition based on Deep Residual Learning<p>- a .csv file with the associated variables.<p>The variables provided for this task are:<p><p>- File: the name of the image file associated with the variables;<p>- Engagement: the number of comments and likes of the image;<p>- Date: when the image has first been posted on Instagram;<p>- Picture manipulation: entails the degree of visual modification of the images. Non-manipulated or low impact changes are labeled 0 (e.g. addition of text, or logo). Heavily manipulated, impactful changes (e.g. images altered to include political actors) are labeled 1;<p>- Visual actors: the political actors (i.e. politicians, parties’ logos) portrayed visually, as edited into the picture or portrayed in the original image;<p>- Text: the textual content of the image has been extracted through optical character recognition (OCR) using Google’s Tesseract-OCR Engine, and further manually corrected;<p>- Event: feature only for meme images, categorizing them according to 4 events related to the 2019 Italian government crisis. <p>
DANKMEMES任务C数据集包含1000张图像,其中一半为网络迷因(memes)图像,另一半为非网络迷因图像,通过针对意大利政府危机相关话题标签“#crisidigoverno”的Python脚本从Instagram平台自动爬取获取。本数据集为面向2020年EVALITA(EVALITA 2020)大会设立的DANKMEMES共享任务(https://dankmemes2020.fileli.unipi.it)所创建并使用,该任务聚焦于网络迷因的自动分类任务。
本数据集按8:2的比例划分为训练集与测试集,测试集未附带金标准标签,金标准标签将以单独文件另行提供。
本数据集包含以下内容:
- 一个存储.jpg格式图像的文件夹
- 一个包含对应图像嵌入的.csv文件,该嵌入由基于深度残差学习的当前主流图像识别模型ResNet(He等人,2016)计算得到
- 一个包含对应变量的.csv文件。
本任务所提供的变量如下:
- File(文件名称):与对应变量绑定的图像文件名;
- Engagement(互动量):该图像的评论与点赞总数;
- Date(发布日期):该图像首次在Instagram发布的时间;
- Picture manipulation(图像篡改程度):表征图像的视觉修改程度。未篡改或低影响修改(如添加文字或logo)标注为0;重度篡改且影响显著的修改(如添加政治人物的图像)标注为1;
- Visual actors(视觉政治主体):图像中视觉呈现的政治主体,包括编辑入图或原图中自带的政治家、政党logo等;
- Text(文本内容):图像中的文本内容通过谷歌Tesseract-OCR引擎的光学字符识别(Optical Character Recognition,OCR)技术提取,并经人工进一步校正;
- Event(事件标签):仅针对网络迷因图像设置的特征项,根据2019年意大利政府危机相关的4类事件对图像进行分类。
提供机构:
ELG
创建时间:
2022-06-01



