five

HMDB 人类动作视频数据集

收藏
帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-1773.html
下载链接
链接失效反馈
官方服务:
资源简介:
HMDB数据集是当前识别动作研究领域最为重要的几个数据集之一。 随着每天近10亿个在线视频的观看,计算机视觉研究的一个新兴前沿领域是视频识别和搜索。尽管人们在收集和注释包含数千种图像类别的大型可伸缩静态图像数据集方面付出了大量努力,但人类行为数据集却远远落后于此。 为此,Brown university大学于2011年发布HMDB51数据集,该数据集视频多数来源于电影,还有一部分来自公共数据库以及YouTube等网络视频库。数据库包含有6849段样本,分为51类,每类至少包含有101段样本。 动作主要分为五类: 1)一般面部动作:微笑,大笑,咀嚼,交谈。 2)面部操作与对象操作:吸烟,吃,喝。 3)一般的身体动作:侧手翻,拍手,爬,爬楼梯,跳,落在地板上,反手翻转、倒立、跳、拉、推、跑,坐下来,坐起来,翻跟头,站起来,转身,走,波。 4)与对象交互动作:梳头,抓,抽出宝剑,运球、高尔夫、打东西,球、挑、倒、推东西,骑自行车,骑马,射球,射弓、枪、摆棒球棍、剑锻炼,扔。 5)人体动作:击剑,拥抱,踢某人,亲吻,拳打,握手,剑战。 51个动作的图示: 数据集,元标签,统计: 除了动作类别的标签之外,每个剪辑还带有动作标签以及描述剪辑属性的元标签。因为HMDB51视频序列是从商业电影以及YouTube中提取的,所以它代表了光线条件,情况和周围环境的多种多样,可以使用不同的相机类型和录制技术(例如视点)捕获动作的出现。观点是HMDB支持的另一个细分标准。对于全方位的覆盖,可以从正面,侧面(左右)和向后的角度观察运动。另外,我们有两个不同的类别,即“不运动”和“相机运动”。后者是变焦,旅行镜头和相机晃动等的结果。视频质量的3级分级适用于评估大量剪辑。只有那些视频样本被评为“好”,其质量足以使您在运动过程中识别出单根手指。如果执行动作时身体部位或四肢消失,则不符合此要求的人员将被评定为“中级”或“不良”。您可以在下面找到每个年级的示例,以显示差异。 质量分级示例 动作类别,身体部位,相机动作,视点 剪辑质量,剪辑持续时间,剪辑持续时间计数 与使用从真实世界视频中提取的视频剪辑相关的一个主要挑战是可能存在重大的相机/背景运动,这被认为会干扰局部运动的计算,应予以纠正。为了消除摄像机的运动,我们使用了标准的图像拼接技术来对齐剪辑的帧。这些技术通过检测然后匹配两个相邻帧中的显着特征来估计背景平面。使用包括绝对像素差和所检测点的欧拉距离的距离度量来计算两个帧的对应度。然后匹配具有最小距离的点,并且使用RANSAC算法来估计所有相邻帧之间的几何变换(对于每对帧都是独立的)。使用此估计,可以对单个帧进行扭曲和组合以实现稳定的剪辑。 原始imgs与稳定的imgs 其他动作识别基准 这项工作始于KTH:KTH数据集包含六种类型的动作,每个动作类别包含100个剪辑。其次是在魏茨曼研究所收集的魏茨曼数据集,其中包含十个动作类别和每个类别九个片段。在受控和简化设置下记录了以上两组。然后,从电影中收集并从电影脚本中进行注释的第一个现实动作数据集是在INRIA中创建的;好莱坞人类动作集包含8种类型的动作,每个动作类别的剪辑数量在每个类别60至140之间变化。它的扩展版本Hollywood2 Human Actions Set提供了总共3669个视频,分布在十种类型的场景下的十类人类行为中。 UCF小组还一直在收集动作数据集,大部分是从YouTube收集的。 UCF Sports包含9种运动类型和182个剪辑,UCF YouTube包含11个动作类,UCF50包含50个动作类。我们将在论文中表明,来自YouTube的视频可能会受到低级功能的偏见,这意味着低级功能(即颜色和要点)比中级恐惧(即运动和形状)更具歧视性。 For questions about the datasets and benchmarks, please contact Hueihan Jhuang ( hueihan.jhuang [at] tuebingen.mpg.de). The benchmark and database are described in the following article. We request that authors cite this paper in publications describing work carried out with this system and/or the video database. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. ICCV, 2011.PDF Bibtex The first benchmark STIP features are described in the following paper and we request the authors cite this paper if they use STIP features. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning Realistic Human Actions From Movies. CVPR, 2008. PDF The second benchmark C2 features are described in the following paper and we request the authors cite this paper if they use C2 codes. H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A Biologically Inspired System for Action Recognition. ICCV, 2007. PDF

The HMDB dataset is one of the most important datasets in the field of action recognition research today. With nearly 1 billion online video views per day, video recognition and search has emerged as a cutting-edge frontier in computer vision research. Despite extensive efforts in collecting and annotating large-scale scalable static image datasets containing thousands of image categories, human action datasets lag far behind this progress. To this end, Brown University released the HMDB51 dataset in 2011. Most of the videos in this dataset are sourced from movies, with some from public databases and online video repositories such as YouTube. The database contains 6,849 video clips divided into 51 categories, with each category containing at least 101 clips. The actions are mainly divided into five categories: 1) General facial actions: smile, laugh, chew, talk. 2) Facial and object manipulation actions: smoke, eat, drink. 3) General body movements: cartwheel, clap hands, climb, climb stairs, jump, fall to the floor, back handspring, handstand, pull, push, run, sit down, sit up, somersault, stand up, turn around, walk, wave. 4) Object interaction actions: comb hair, grasp, draw sword, dribble, play golf, hit objects, throw balls, pick up, pour, push objects, ride a bicycle, ride a horse, shoot balls, shoot bow/gun, swing baseball bat, sword exercise, throw. 5) Human-human interaction actions: fencing, hug, kick someone, kiss, punch, shake hands, sword fight. Visualization of the 51 actions: Dataset, meta tags, statistics: In addition to action category labels, each clip also carries action labels and meta tags that describe the clip's attributes. Since HMDB51 video sequences are extracted from commercial movies and YouTube, it represents a wide variety of lighting conditions, scenarios, and surroundings, with actions captured using different camera types and recording techniques (e.g., viewpoint). Viewpoint is another segmentation criterion supported by HMDB. For full coverage, movements can be observed from frontal, lateral (left/right), and rear angles. Additionally, there are two distinct categories: "no motion" and "camera motion". The latter results from zoom, tracking shot, camera shake, etc. A 3-level grading scale for video quality is applied to evaluate a large number of clips. Only video samples rated "good" have quality sufficient to allow identification of individual fingers during movement. If body parts or limbs disappear while performing an action, those that do not meet this requirement will be rated "medium" or "poor". You can find examples of each grade below to illustrate the differences. Quality grading examples Action categories, body parts, camera motion, viewpoint Clip quality, clip duration, clip duration count A major challenge associated with using video clips extracted from real-world videos is the potential presence of significant camera/background motion, which is known to interfere with the calculation of local motion and should be corrected. To eliminate camera motion, we use standard image stitching techniques to align the frames of the clips. These techniques estimate the background plane by detecting and then matching salient features in two adjacent frames. A distance metric including absolute pixel difference and Euclidean distance of detected points is used to calculate the correspondence between two frames. Points with the smallest distance are then matched, and the RANSAC algorithm is used to estimate the geometric transformation between all adjacent frames (independently for each pair of frames). Using this estimation, individual frames can be warped and composited to produce stable clips. Original images and stabilized images Other action recognition benchmarks This work started with the KTH dataset: the KTH dataset contains six types of actions, with 100 clips per action category. Next is the Weizmann dataset collected at the Weizmann Institute, which contains ten action categories and nine clips per category. Both of the above datasets were recorded under controlled and simplified settings. The first realistic action dataset collected from movies and annotated from film scripts was created at INRIA; the Hollywood Human Actions Dataset contains 8 types of actions, with the number of clips per action category ranging from 60 to 140. Its extended version, the Hollywood2 Human Actions Set, provides a total of 3,669 videos distributed across 10 types of human behaviors in 10 scene categories. The UCF team has also been collecting action datasets, mostly sourced from YouTube. UCF Sports contains 9 sports types and 182 clips, UCF YouTube contains 11 action categories, and UCF50 contains 50 action categories. We will show in this paper that videos from YouTube may be biased towards low-level features, meaning that low-level features (i.e., color and keypoints) are more discriminative than mid-level features (i.e., motion and shape). For questions about the datasets and benchmarks, please contact Hueihan Jhuang ( hueihan.jhuang [at] tuebingen.mpg.de). The benchmark and database are described in the following article. We request that authors cite this paper in publications describing work carried out with this system and/or the video database. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. ICCV, 2011.PDF Bibtex The first benchmark STIP features are described in the following paper and we request the authors cite this paper if they use STIP features. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning Realistic Human Actions From Movies. CVPR, 2008. PDF The second benchmark C2 features are described in the following paper and we request the authors cite this paper if they use C2 codes. H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A Biologically Inspired System for Action Recognition. ICCV, 2007. PDF
提供机构:
帕依提提
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
HMDB人类动作视频数据集是一个发布于2011年的重要动作识别数据集,包含6849段视频样本,分为51类动作,每类至少101段样本,涵盖一般面部动作、身体动作等五类。数据来源于电影和网络视频库,具有多样化的视频质量和元标签,适用于动作检测和分类研究。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务