five

hbfreed/Picklebot-50K

收藏
Hugging Face2024-02-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/hbfreed/Picklebot-50K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - video-classification tags: - baseball - sports - video-classification - computer-vision size_categories: - 10K<n<100K --- # Dataset Card for Picklebot50k <!-- Provide a quick summary of the dataset. --> 50 thousand video clips of balls and strikes from MLB games from the 2016 season through the 2022 season. ![Example Clip](example.gif) ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> The dataset consists of roughly 50 thousand video clips of balls and strikes in .mp4 format, resized to 224x224 resolution. The calculated standard deviation and mean for the dataset are std: (0.2104, 0.1986, 0.1829) mean: (0.3939, 0.3817, 0.3314). - **Curated by:** Henry Freed - **License:** MIT ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** The original project that this dataset was compiled for can be found here on [github](https://github.com/hbfreed/Picklebot). - **Demo:** The demo for a neural net trained on this dataset can be found here on [huggingface spaces](https://huggingface.co/spaces/hbfreed/picklebot_demo). ## Uses <!-- Address questions around how the dataset is intended to be used. --> The dataset was originally collected to call balls and strikes using neural networks. There are many other potential use cases, but they would almost certainly require relabeling. For more videos and more complete information about each pitch, see [Picklebot-2M](https://huggingface.co/datasets/hbfreed/Picklebot-2M). ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset is structured as .tar files of the train, val, and test splits. The labels are contained in .csv files. The .csvs are structured as follows: "filename.mp4",label where the label is 0 for balls and 1 for strikes. ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> The source data were scraped from Baseball Savant's [Statcast Search](https://baseballsavant.mlb.com/statcast_search). It's a pretty powerful search page, and a lot of fun to play around with. #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> After downloading the videos, they were cropped from 1280x720 at 60fps to the middle 600x600 pixels at 60fps. Finally, they were downsampled to 224x224 resolution at 15 fps (this can all be done using one ffmpeg command). Some of the longer clips where there was a lot of noise (shots of the crowd, instant replays, etc.) were trimmed (mostly by hand) down to a more manageable length. #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> [Baseball Savant](https://baseballsavant.mlb.com/) and MLB/the broadcasters (whoever it is) originally created the videos. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> It's important to note that only balls and called strikes were collected. No swinging strikes, foul balls, hit by pitches, or anything else are included in the dataset. Additionally, most pitchers and batters are right handed, and nothing was done to try and balance that in this dataset.
提供机构:
hbfreed
原始信息汇总

数据集卡片 for Picklebot50k

数据集概述

50千个来自MLB比赛(2016赛季至2022赛季)的球和好球视频片段。

数据集详情

数据集描述

该数据集包含约50千个球和好球的视频片段,格式为.mp4,分辨率调整为224x224。

  • 标准差和均值:

    • 标准差:(0.2104, 0.1986, 0.1829)
    • 均值:(0.3939, 0.3817, 0.3314)
  • 创建者: Henry Freed

  • 许可证: MIT

数据集来源

数据来源于Baseball Savant的Statcast Search

数据收集和处理

视频从1280x720@60fps裁剪到中间的600x600像素@60fps,最后下采样到224x224分辨率@15fps。一些较长的片段(包含大量噪声,如观众镜头、即时回放等)被手动修剪到更合适的长度。

源数据生产者

Baseball Savant和MLB/广播公司(具体不详)。

使用

该数据集最初用于使用神经网络判定球和好球。其他潜在用途可能需要重新标注。

数据集结构

数据集分为.tar格式的训练、验证和测试集,标签包含在.csv文件中。.csv文件结构如下:

"filename.mp4",label

其中,label为0表示球,1表示好球。

偏差、风险和限制

该数据集仅包含球和好球,不包括挥棒未中、界外球、触身球等其他情况。此外,大多数投手和击球手是右撇子,且未尝试平衡这一点。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作