five

VLM-Video-Understanding

收藏
魔搭社区2025-12-03 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/VLM-Video-Understanding
下载链接
链接失效反馈
官方服务:
资源简介:
# **VLM-Video-Understanding** > A minimalistic demo for image inference and video understanding using OpenCV, built on top of several popular open-source Vision-Language Models (VLMs). This repository provides Colab notebooks demonstrating how to apply these VLMs to video and image tasks using Python and Gradio. ## Overview This project showcases lightweight inference pipelines for the following: - Video frame extraction and preprocessing - Image-level inference with VLMs - Real-time or pre-recorded video understanding - OCR-based text extraction from video frames ## Models Included The repository supports a variety of open-source models and configurations, including: - Aya-Vision-8B - Florence-2-Base - Gemma3-VL - MiMo-VL-7B-RL - MiMo-VL-7B-SFT - Qwen2-VL - Qwen2.5-VL - Qwen-2VL-MessyOCR - RolmOCR-Qwen2.5-VL - olmOCR-Qwen2-VL - typhoon-ocr-7b-Qwen2.5VL Each model has a dedicated Colab notebook to help users understand how to use it with video inputs. ## Technologies Used - **Python** - **OpenCV** – for video and image processing - **Gradio** – for interactive UI - **Jupyter Notebooks** – for easy experimentation - **Hugging Face Transformers** – for loading VLMs ## Folder Structure ``` ├── Aya-Vision-8B/ ├── Florence-2-Base/ ├── Gemma3-VL/ ├── MiMo-VL-7B-RL/ ├── MiMo-VL-7B-SFT/ ├── Qwen2-VL/ ├── Qwen2.5-VL/ ├── Qwen-2VL-MessyOCR/ ├── RolmOCR-Qwen2.5-VL/ ├── olmOCR-Qwen2-VL/ ├── typhoon-ocr-7b-Qwen2.5VL/ ├── LICENSE └── README.md ```` ## Getting Started 1. Clone the repository: ```bash git clone https://github.com/PRITHIVSAKTHIUR/VLM-Video-Understanding.git cd VLM-Video-Understanding ```` 2. Open any of the Colab notebooks and follow the instructions to run image or video inference. 3. Optionally, install dependencies locally: ```bash pip install opencv-python gradio transformers ``` ## Hugging Face Dataset The models and examples are supported by a dataset on Hugging Face: [VLM-Video-Understanding](https://huggingface.co/datasets/prithivMLmods/VLM-Video-Understanding) ## License This project is licensed under the Apache-2.0 License.

# **VLM视频理解(VLM-Video-Understanding)** > 本项目是一款轻量化演示工具,依托多款主流开源视觉语言模型(Vision-Language Models,VLMs),结合OpenCV实现图像推理与视频理解任务。本仓库提供Colab笔记本教程,演示如何通过Python与Gradio将这些VLMs应用于视频与图像相关任务。 ## 项目概览 本项目展示了轻量化推理流水线,可实现以下功能: - 视频帧提取与预处理 - 基于视觉语言模型的图像级推理 - 实时或预录视频理解 - 从视频帧中基于光学字符识别(Optical Character Recognition,OCR)提取文本 ## 支持模型 本仓库支持多款开源模型及其配置方案,具体包括: - Aya-Vision-8B - Florence-2-Base - Gemma3-VL - MiMo-VL-7B-RL - MiMo-VL-7B-SFT - Qwen2-VL - Qwen2.5-VL - Qwen-2VL-MessyOCR - RolmOCR-Qwen2.5-VL - olmOCR-Qwen2-VL - typhoon-ocr-7b-Qwen2.5VL 每个模型均配有专属Colab笔记本,帮助用户掌握如何将其应用于视频输入场景。 ## 技术栈 - **Python** - **OpenCV**:用于视频与图像处理 - **Gradio**:用于构建交互式用户界面 - **Jupyter Notebooks**:便于快速实验验证 - **Hugging Face Transformers**:用于加载视觉语言模型 ## 目录结构 ├── Aya-Vision-8B/ ├── Florence-2-Base/ ├── Gemma3-VL/ ├── MiMo-VL-7B-RL/ ├── MiMo-VL-7B-SFT/ ├── Qwen2-VL/ ├── Qwen2.5-VL/ ├── Qwen-2VL-MessyOCR/ ├── RolmOCR-Qwen2.5-VL/ ├── olmOCR-Qwen2-VL/ ├── typhoon-ocr-7b-Qwen2.5VL/ ├── LICENSE └── README.md ## 快速开始 1. 克隆本仓库: bash git clone https://github.com/PRITHIVSAKTHIUR/VLM-Video-Understanding.git cd VLM-Video-Understanding 2. 打开任意Colab笔记本,按照指引运行图像或视频推理任务。 3. (可选)本地安装依赖项: bash pip install opencv-python gradio transformers ## Hugging Face 数据集 本项目的模型与示例依托Hugging Face平台上的数据集: [VLM视频理解(VLM-Video-Understanding)](https://huggingface.co/datasets/prithivMLmods/VLM-Video-Understanding) ## 许可证 本项目采用Apache-2.0许可证进行授权。
提供机构:
maas
创建时间:
2025-05-31
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
VLM-Video-Understanding是一个演示项目,展示了如何使用多种开源视觉语言模型进行视频理解和图像推理,支持实时或预录视频处理,并包含OCR文本提取功能。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作