DL Addressee Estimation Model for HRI - data

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10711587

下载链接

链接失效反馈

官方服务：

资源简介：

This repository holds the data and models created by training and testing a Hybrid Deep Learning model whose results are published in the Conference Paper "To Whom are You Talking? A DL model to Endow Social Robots with Addressee Estimation Skills" presented at the International Joint Conference on Neural Networks (IJCNN) 2023. https://ieeexplore.ieee.org/document/10191452 OA version: http://doi.org/10.48550/arXiv.2308.10757/ Addressee Estimation is the ability to understand to whom a person is directing an utterance. This ability is crucial for social robot engaging in multi-party interaction to understand the basic dynamics of social communication. In this project, we trained a DL model composed of convolutional layers and LSTM cells and taking as input visual information of the speaker to estimate the placement of the addressee. We used a supervised learning approach. The data to train our model were taken from the Vernissage Corpus, a dataset collected in multi-party Human-Robot Interaction from the robot's sensors. For the original HRI corpus, see http://vernissage.humavips.eu/ This repository contains the /labels used for the training, the /models resulting from the training, and the /resuls, i.e., the files containing the results of tests. The codes to obtain this data can be found at http://doi.org/10.5281/zenodo.10709858 The folder /labels contains the .csv files that are used for the supervised training. The sheets list the frames extracted from the videos of the Vernissage Corpus temporally organized in sequences of 10 frames (the minimum chunk of data the hybrid CNN-LSTM model is trained on). Labels are divided in two folders: 3CLASSES (to train models classifying the addressee in terms of left/right/nao) or BINARY (to train models with a binary classification (the robot is involved in the conversation or not). The Vernissage corpus was annotated with four possible Addressees: left/right/nao/group. In this work we used the first three for the 3-class classification and we grouped "left" and "right" together into "not-involved" and "nao" and "grouped" together into "involved" for the binary classification. (see paper for more details).The sheets connect each frame with: their ground truth: LABEL (int) ADDRESSEE (string) the number of the SEQUENCE (int) the number of the SLOT of the Vernissage Dataset the frame is extracted from (int) the number of the SPEAKER, because for each slot of the Vernissage Corpus there are two speakers (int) the number of the interval (N_INTERVAL), representing the number of the utterance, i.e. the interval of speech divided by at least 0.8 s of silence (each interval/utterance can contains one or more sequences) temporal information to capture the frame from the video of the slot: 4 temporal information are given START_MSEC: start of the capture (for audio -> not used in this project) END_MSEC: stop of the capture (for audio -> not used in this project) FRAME_MSEC: exact time of the capture (for image) DURATION_MSEC: the duration in msec of the frame is referred to (sequences are made of 10 frames, each captured every 80 msec) the filename of the frame (the image captured from the video): IMG (.jpg) the filename of the speaker's face image cropped from the frame: FACE_IMAGE (.jpg) the filename of the speaker's pose vector extracted from the frame: FACE_POSE (.npy) the filename of the face image of the second person cropped from the frame: FACE_OTHER (.jpg) (Not used) the filename of the pose vector of the second person extracted from the frame: FACE_OTHER (.npy) (Not used) the filename of the audio captured from START_MSEC to END_MSEC (Not used) the filename of the audio captured from START_MSEC to END_MSEC (Not used) the filename of the mel spectrogram extracted from the audio (Not used) the filename of the MFCC features extracted from the audio (Not used)TIP: since label files contains temporal information about when frames, faces and pose were captured, cropped and extracted from the original naovideo.avi contained in each slot of the vernissage corpus, in order to obtain the dataset of this project from the vernissage corpus, it is possible to use these label files, capture frames with the temporal information provided in the .csv file and then crop faces and extract poses as explained in the code connected to this dataset. The folder /models contains three subfolders: /three_class , /binary and /completeIn the /three_class folder there are models obtained from the 10-fold cross-validation explained in the paper, using 4 different architectures: hybrid CNN-LSTM taking in input the speaker's face and pose with intermediate fusion before the LSTM hybrid CNN-LSTM taking in input the speaker's face and pose with late fusion after the LSTM hybrid CNN-LSTM taking in input only the speaker's face hybrid CNN-LSTM taking in input only the speaker's poseEach of them contains 10 folders (coming from the 10 fold cross validation) with the .pth files obtained from the training and the plots graphically representing the training Loss and Accuracy. The /binary folder containes the same files but only for the hybrid CNN-LSTM taking in input the speaker's face and pose with intermediate fusion before the LSTMThe /complete folder contains the hybrid CNN-LSTM taking in input the speaker's face and pose with intermediate fusion before the LSTM trained on the entire vernissage corpus The folder /results is divided in /three_class and /binary as well./three_class is divided in the same subfolder as /models. Each of them contains .csv files of the results coming from the testing phase on the only slot left out of the training. In results.csv, for every sequence of the slot left out for the test, is given also the ground truth (label), the prediction of the model after training (predictions) and the score/confidence of the prediction (score) In errors.csv are listed the sequence that have been classified badly by the model. label and prediction are given for each of them. the .npy file provides information about the model The folder /dataset_slots contains 10 empty folders, you can use those to organize the Vernissage Dataset, according to the code released in http://doi.org/10.5281/zenodo.10709858

创建时间：

2024-02-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集