Event detection and recounting from large-scale consumer videos

Mendeley Data2024-01-31 更新2024-06-28 收录

下载链接：

https://digitallibrary.usc.edu/asset-management/2A3BF16H6E_V

下载链接

链接失效反馈

官方服务：

资源简介：

Multimedia event detection and recounting are important Computer Vision problems that are closely related to Machine Learning, Natural Language Processing and other research areas in Computer Science. Given a query consumer video, multimedia event detection (MED) generates a high-level event label (e.g. birthday party, cleaning an appliance) for the entire video, while multimedia event recounting (MER) aims at selecting supporting evidence for the detected event. Typical form of evidence includes short video snippets and text descriptions. Event detection and recounting are challenging problems for the high video quality variation, the complex temporal structures, missing or unreliable concept detectors, and also the large problem scale. ❧ This thesis describes my solutions to event detection and recounting from large scale consumer videos. The first part focuses on extracting robust features for event detection. The proposed pipeline utilizes both low-level motion features and mid-level semantic features. For low-level features, the pipeline extracts local motion descriptors from videos, and aggregates them into video-level representations by applying Fisher vector techniques. For mid-level features, the pipeline encodes temporal transition information from noisy object and action concept detection scores. The two feature types are suitable to train linear event classifiers to handle large amount of query videos, and have complementary performance. ❧ The second part of the thesis addresses the event recounting problem, which includes the evidence localization task and description generation task. Evidence localization searches for video snippets with supporting evidence for an event. It is inherently weakly supervised, as most of the training videos have only video-level annotations rather than segment-level annotations. My proposed framework treats evidence locations as hidden variables, and exploits activity co-occurrences and temporal transitions to model events. Model parameters are learned with the latent SVM framework. For text description generation, my proposed pipelines aim at connecting vision and language by considering both semantic similarity from text and visual similarity from videos and images. The pipelines are able to generate video transcription in subject-verb-object triplets or visual concept tags. ❧ This thesis demonstrates the effectiveness of all the algorithms on a range of publicly available video or image datasets.

创建时间：

2024-01-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集