VLM-Video-Understanding
收藏魔搭社区2025-12-03 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/VLM-Video-Understanding
下载链接
链接失效反馈官方服务:
资源简介:
# **VLM-Video-Understanding**
> A minimalistic demo for image inference and video understanding using OpenCV, built on top of several popular open-source Vision-Language Models (VLMs). This repository provides Colab notebooks demonstrating how to apply these VLMs to video and image tasks using Python and Gradio.
## Overview
This project showcases lightweight inference pipelines for the following:
- Video frame extraction and preprocessing
- Image-level inference with VLMs
- Real-time or pre-recorded video understanding
- OCR-based text extraction from video frames
## Models Included
The repository supports a variety of open-source models and configurations, including:
- Aya-Vision-8B
- Florence-2-Base
- Gemma3-VL
- MiMo-VL-7B-RL
- MiMo-VL-7B-SFT
- Qwen2-VL
- Qwen2.5-VL
- Qwen-2VL-MessyOCR
- RolmOCR-Qwen2.5-VL
- olmOCR-Qwen2-VL
- typhoon-ocr-7b-Qwen2.5VL
Each model has a dedicated Colab notebook to help users understand how to use it with video inputs.
## Technologies Used
- **Python**
- **OpenCV** – for video and image processing
- **Gradio** – for interactive UI
- **Jupyter Notebooks** – for easy experimentation
- **Hugging Face Transformers** – for loading VLMs
## Folder Structure
```
├── Aya-Vision-8B/
├── Florence-2-Base/
├── Gemma3-VL/
├── MiMo-VL-7B-RL/
├── MiMo-VL-7B-SFT/
├── Qwen2-VL/
├── Qwen2.5-VL/
├── Qwen-2VL-MessyOCR/
├── RolmOCR-Qwen2.5-VL/
├── olmOCR-Qwen2-VL/
├── typhoon-ocr-7b-Qwen2.5VL/
├── LICENSE
└── README.md
````
## Getting Started
1. Clone the repository:
```bash
git clone https://github.com/PRITHIVSAKTHIUR/VLM-Video-Understanding.git
cd VLM-Video-Understanding
````
2. Open any of the Colab notebooks and follow the instructions to run image or video inference.
3. Optionally, install dependencies locally:
```bash
pip install opencv-python gradio transformers
```
## Hugging Face Dataset
The models and examples are supported by a dataset on Hugging Face:
[VLM-Video-Understanding](https://huggingface.co/datasets/prithivMLmods/VLM-Video-Understanding)
## License
This project is licensed under the Apache-2.0 License.
# **VLM视频理解(VLM-Video-Understanding)**
> 本项目是一款轻量化演示工具,依托多款主流开源视觉语言模型(Vision-Language Models,VLMs),结合OpenCV实现图像推理与视频理解任务。本仓库提供Colab笔记本教程,演示如何通过Python与Gradio将这些VLMs应用于视频与图像相关任务。
## 项目概览
本项目展示了轻量化推理流水线,可实现以下功能:
- 视频帧提取与预处理
- 基于视觉语言模型的图像级推理
- 实时或预录视频理解
- 从视频帧中基于光学字符识别(Optical Character Recognition,OCR)提取文本
## 支持模型
本仓库支持多款开源模型及其配置方案,具体包括:
- Aya-Vision-8B
- Florence-2-Base
- Gemma3-VL
- MiMo-VL-7B-RL
- MiMo-VL-7B-SFT
- Qwen2-VL
- Qwen2.5-VL
- Qwen-2VL-MessyOCR
- RolmOCR-Qwen2.5-VL
- olmOCR-Qwen2-VL
- typhoon-ocr-7b-Qwen2.5VL
每个模型均配有专属Colab笔记本,帮助用户掌握如何将其应用于视频输入场景。
## 技术栈
- **Python**
- **OpenCV**:用于视频与图像处理
- **Gradio**:用于构建交互式用户界面
- **Jupyter Notebooks**:便于快速实验验证
- **Hugging Face Transformers**:用于加载视觉语言模型
## 目录结构
├── Aya-Vision-8B/
├── Florence-2-Base/
├── Gemma3-VL/
├── MiMo-VL-7B-RL/
├── MiMo-VL-7B-SFT/
├── Qwen2-VL/
├── Qwen2.5-VL/
├── Qwen-2VL-MessyOCR/
├── RolmOCR-Qwen2.5-VL/
├── olmOCR-Qwen2-VL/
├── typhoon-ocr-7b-Qwen2.5VL/
├── LICENSE
└── README.md
## 快速开始
1. 克隆本仓库:
bash
git clone https://github.com/PRITHIVSAKTHIUR/VLM-Video-Understanding.git
cd VLM-Video-Understanding
2. 打开任意Colab笔记本,按照指引运行图像或视频推理任务。
3. (可选)本地安装依赖项:
bash
pip install opencv-python gradio transformers
## Hugging Face 数据集
本项目的模型与示例依托Hugging Face平台上的数据集:
[VLM视频理解(VLM-Video-Understanding)](https://huggingface.co/datasets/prithivMLmods/VLM-Video-Understanding)
## 许可证
本项目采用Apache-2.0许可证进行授权。
提供机构:
maas
创建时间:
2025-05-31
搜集汇总
数据集介绍

背景与挑战
背景概述
VLM-Video-Understanding是一个演示项目,展示了如何使用多种开源视觉语言模型进行视频理解和图像推理,支持实时或预录视频处理,并包含OCR文本提取功能。
以上内容由遇见数据集搜集并总结生成



