Code underlying the publication: "Long-term behaviour recognition in videos with actor-focused region attention"

Name: Code underlying the publication: "Long-term behaviour recognition in videos with actor-focused region attention"
Creator: Strafforello, Ombretta
Published: 2024-05-24 00:00:00
License: 暂无描述

4TU.ResearchData2024-05-24 更新2026-04-23 收录

下载链接：

https://data.4tu.nl/datasets/0dd08a4e-cab6-49e2-98e4-f00f7d3cfccb/1

下载链接

链接失效反馈

官方服务：

资源简介：

Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark. In this repository, we provide our code implementation.

长时行为指人类完成的、时长可达数分钟的复杂动作。与传统动作识别任务不同，复杂行为通常由一系列子动作构成，这些子动作的出现顺序、持续时长与出现频次均可灵活变化。上述特性会带来极大的类内差异性，难以通过建模进行有效刻画。本研究提出的方法旨在针对数分钟级的行为分类任务，自适应地捕捉并学习视频时空区域的重要性。受此前区域注意力（Region Attention）相关研究的启发，我们的模型架构将多个视频区域提取出的时空特征嵌入至紧凑的定长表征中，这些特征通过经过专门微调的3D卷积骨干网络（3D convolutional backbone）提取。此外，基于“视频中最具判别力的位置通常围绕执行行为的人类主体”这一先验假设，我们提出了演员聚焦（Actor Focus）机制，用于在训练与推理阶段均增强特征提取效果。我们的实验表明，结合演员聚焦（Actor Focus）与区域注意力（Region Attention）的多区域微调3D卷积神经网络（3D-CNN），可大幅提升基础3D架构的性能，在知名长时行为识别基准数据集Breakfast上取得了最先进的结果。本代码仓库已公开我们的实现代码。

提供机构：

Strafforello, Ombretta

创建时间：

2024-05-24