DankMemes Task C Dataset

DataCite Commons2022-06-01 更新2024-07-13 收录

下载链接：

https://live.european-language-grid.eu/catalogue/corpus/8093

下载链接

链接失效反馈

官方服务：

资源简介：

The DANKMEMES Task C Dataset consists of 1,000 images, half memes and half not, automatically extracted from Instagram through a Python script aimed at the hashtag related to the Italian government crisis (“#crisidigoverno”). It was created and used in the context of the DankMemes (https://dankmemes2020.fileli.unipi.it), a shared task proposed for the 2020 EVALITA campaign (http://www.evalita.it/2020), focusing on the automatic classification of Internet memes.The dataset is split into training and test sets, in a proportion of 80-20% of items. The test dataset has been provided without gold labels, provided in a separate file.The dataset consists of:- a folder with images in .jpg format - a .csv file with the associated image embeddigs, computed employing ResNet (He et al., 2016), a state-of-the-art model for image recognition based on Deep Residual Learning- a .csv file with the associated variables.The variables provided for this task are:- File: the name of the image file associated with the variables;- Engagement: the number of comments and likes of the image;- Date: when the image has first been posted on Instagram;- Picture manipulation: entails the degree of visual modification of the images. Non-manipulated or low impact changes are labeled 0 (e.g. addition of text, or logo). Heavily manipulated, impactful changes (e.g. images altered to include political actors) are labeled 1;- Visual actors: the political actors (i.e. politicians, parties’ logos) portrayed visually, as edited into the picture or portrayed in the original image;- Text: the textual content of the image has been extracted through optical character recognition (OCR) using Google’s Tesseract-OCR Engine, and further manually corrected;- Event: feature only for meme images, categorizing them according to 4 events related to the 2019 Italian government crisis.

DANKMEMES任务C数据集包含1000张图像，其中一半为网络迷因（memes）图像，另一半为非网络迷因图像，通过针对意大利政府危机相关话题标签“#crisidigoverno”的Python脚本从Instagram平台自动爬取获取。本数据集为面向2020年EVALITA（EVALITA 2020）大会设立的DANKMEMES共享任务（https://dankmemes2020.fileli.unipi.it）所创建并使用，该任务聚焦于网络迷因的自动分类任务。本数据集按8:2的比例划分为训练集与测试集，测试集未附带金标准标签，金标准标签将以单独文件另行提供。本数据集包含以下内容： - 一个存储.jpg格式图像的文件夹 - 一个包含对应图像嵌入的.csv文件，该嵌入由基于深度残差学习的当前主流图像识别模型ResNet（He等人，2016）计算得到 - 一个包含对应变量的.csv文件。本任务所提供的变量如下： - File（文件名称）：与对应变量绑定的图像文件名； - Engagement（互动量）：该图像的评论与点赞总数； - Date（发布日期）：该图像首次在Instagram发布的时间； - Picture manipulation（图像篡改程度）：表征图像的视觉修改程度。未篡改或低影响修改（如添加文字或logo）标注为0；重度篡改且影响显著的修改（如添加政治人物的图像）标注为1； - Visual actors（视觉政治主体）：图像中视觉呈现的政治主体，包括编辑入图或原图中自带的政治家、政党logo等； - Text（文本内容）：图像中的文本内容通过谷歌Tesseract-OCR引擎的光学字符识别（Optical Character Recognition，OCR）技术提取，并经人工进一步校正； - Event（事件标签）：仅针对网络迷因图像设置的特征项，根据2019年意大利政府危机相关的4类事件对图像进行分类。

提供机构：

ELG

创建时间：

2022-06-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集