DankMemes Dataset
收藏DataCite Commons2022-06-01 更新2024-07-13 收录
下载链接:
https://live.european-language-grid.eu/catalogue/corpus/8094
下载链接
链接失效反馈官方服务:
资源简介:
The DANKMEMES Dataset is composed of 2,361 images, half memes and half not, automatically extracted from Instagram through a Python script aimed at the hashtag related to the Italian government crisis (“#crisidigoverno”). It was created and used in the context of the DankMemes (https://dankmemes2020.fileli.unipi.it), a shared task proposed for the 2020 EVALITA campaign (http://www.evalita.it/2020), focusing on the automatic classification of In- ternet memes. The task encompasses three subtasks, aimed at: detecting memes (Task A), detecting the hate speech in memes (Task B) and clustering memes according to events (Task C).<p>The dataset is split into training and test sets, in a proportion of 80-20% of items. The test dataset has been provided without gold labels, provided in a separate file for each subtask.<p><p>For each subtask, the dataset consists of:<p>a folder with images in .jpg format <p>- a .csv file with the associated image embeddigs, computed employing ResNet (He et al., 2016), a state-of-the-art model for image recognition based on Deep Residual Learning.<p>- a .csv file with the associated variables<p><p><p>The variables provided are:<p><p>- File: the name of the image file associated with the variables;<p><p>- Engagement: the number of comments and likes of the image;<p><p>- Date: when the image has first been posted on Instagram;<p><p>- Picture manipulation: entails the degree of visual modification of the images. Non-manipulated or low impact changes are labeled 0 (e.g. addition of text, or logo). Heavily manipulated, impactful changes (e.g. images altered to include political actors) are labeled 1;<p><p>- Visual actors: the political actors (i.e. politicians, parties’ logos) portrayed visually, as edited into the picture or portrayed in the original image;<p><p>- Text: the textual content of the image has been extracted through optical character recognition (OCR) using Google’s Tesseract-OCR Engine, and further manually corrected;<p><p>- (for task A) Meme: binary feature, where 0 represents non meme images and 1 meme images. <p><p>- (for task B) Hate speech: binary feature only for memes. It differentiates memes with offensive language (1) from non offensive memes (0).<p><p>- (for task C) Event: feature only for meme images, categorizing them according to 4 events related to the 2019 Italian government crisis<p>
提供机构:
ELG
创建时间:
2022-06-01



