VLM-Video-Understanding

Name: VLM-Video-Understanding
Creator: maas
Published: 2025-12-03 17:17:25
License: 暂无描述

魔搭社区2025-12-03 更新2025-06-07 收录

下载链接：

https://modelscope.cn/datasets/prithivMLmods/VLM-Video-Understanding

下载链接

链接失效反馈

官方服务：

资源简介：

# **VLM-Video-Understanding** > A minimalistic demo for image inference and video understanding using OpenCV, built on top of several popular open-source Vision-Language Models (VLMs). This repository provides Colab notebooks demonstrating how to apply these VLMs to video and image tasks using Python and Gradio. ## Overview This project showcases lightweight inference pipelines for the following: - Video frame extraction and preprocessing - Image-level inference with VLMs - Real-time or pre-recorded video understanding - OCR-based text extraction from video frames ## Models Included The repository supports a variety of open-source models and configurations, including: - Aya-Vision-8B - Florence-2-Base - Gemma3-VL - MiMo-VL-7B-RL - MiMo-VL-7B-SFT - Qwen2-VL - Qwen2.5-VL - Qwen-2VL-MessyOCR - RolmOCR-Qwen2.5-VL - olmOCR-Qwen2-VL - typhoon-ocr-7b-Qwen2.5VL Each model has a dedicated Colab notebook to help users understand how to use it with video inputs. ## Technologies Used - **Python** - **OpenCV** – for video and image processing - **Gradio** – for interactive UI - **Jupyter Notebooks** – for easy experimentation - **Hugging Face Transformers** – for loading VLMs ## Folder Structure ``` ├── Aya-Vision-8B/ ├── Florence-2-Base/ ├── Gemma3-VL/ ├── MiMo-VL-7B-RL/ ├── MiMo-VL-7B-SFT/ ├── Qwen2-VL/ ├── Qwen2.5-VL/ ├── Qwen-2VL-MessyOCR/ ├── RolmOCR-Qwen2.5-VL/ ├── olmOCR-Qwen2-VL/ ├── typhoon-ocr-7b-Qwen2.5VL/ ├── LICENSE └── README.md ```` ## Getting Started 1. Clone the repository: ```bash git clone https://github.com/PRITHIVSAKTHIUR/VLM-Video-Understanding.git cd VLM-Video-Understanding ```` 2. Open any of the Colab notebooks and follow the instructions to run image or video inference. 3. Optionally, install dependencies locally: ```bash pip install opencv-python gradio transformers ``` ## Hugging Face Dataset The models and examples are supported by a dataset on Hugging Face: [VLM-Video-Understanding](https://huggingface.co/datasets/prithivMLmods/VLM-Video-Understanding) ## License This project is licensed under the Apache-2.0 License.

# **VLM视频理解（VLM-Video-Understanding）** > 本项目是一款轻量化演示工具，依托多款主流开源视觉语言模型（Vision-Language Models，VLMs），结合OpenCV实现图像推理与视频理解任务。本仓库提供Colab笔记本教程，演示如何通过Python与Gradio将这些VLMs应用于视频与图像相关任务。 ## 项目概览本项目展示了轻量化推理流水线，可实现以下功能： - 视频帧提取与预处理 - 基于视觉语言模型的图像级推理 - 实时或预录视频理解 - 从视频帧中基于光学字符识别（Optical Character Recognition，OCR）提取文本 ## 支持模型本仓库支持多款开源模型及其配置方案，具体包括： - Aya-Vision-8B - Florence-2-Base - Gemma3-VL - MiMo-VL-7B-RL - MiMo-VL-7B-SFT - Qwen2-VL - Qwen2.5-VL - Qwen-2VL-MessyOCR - RolmOCR-Qwen2.5-VL - olmOCR-Qwen2-VL - typhoon-ocr-7b-Qwen2.5VL 每个模型均配有专属Colab笔记本，帮助用户掌握如何将其应用于视频输入场景。 ## 技术栈 - **Python** - **OpenCV**：用于视频与图像处理 - **Gradio**：用于构建交互式用户界面 - **Jupyter Notebooks**：便于快速实验验证 - **Hugging Face Transformers**：用于加载视觉语言模型 ## 目录结构 ├── Aya-Vision-8B/ ├── Florence-2-Base/ ├── Gemma3-VL/ ├── MiMo-VL-7B-RL/ ├── MiMo-VL-7B-SFT/ ├── Qwen2-VL/ ├── Qwen2.5-VL/ ├── Qwen-2VL-MessyOCR/ ├── RolmOCR-Qwen2.5-VL/ ├── olmOCR-Qwen2-VL/ ├── typhoon-ocr-7b-Qwen2.5VL/ ├── LICENSE └── README.md ## 快速开始 1. 克隆本仓库： bash git clone https://github.com/PRITHIVSAKTHIUR/VLM-Video-Understanding.git cd VLM-Video-Understanding 2. 打开任意Colab笔记本，按照指引运行图像或视频推理任务。 3. （可选）本地安装依赖项： bash pip install opencv-python gradio transformers ## Hugging Face 数据集本项目的模型与示例依托Hugging Face平台上的数据集： [VLM视频理解（VLM-Video-Understanding）](https://huggingface.co/datasets/prithivMLmods/VLM-Video-Understanding) ## 许可证本项目采用Apache-2.0许可证进行授权。

提供机构：

maas

创建时间：

2025-05-31

搜集汇总

数据集介绍