five

MMPR-Tiny

收藏
魔搭社区2026-01-06 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/OpenGVLab/MMPR-Tiny
下载链接
链接失效反馈
官方服务:
资源简介:
# MMPR-Tiny ***This is the training data used during the online RL stage of InternVL3.5, which greatly improves the overall performance of [InternVL3.5](https://huggingface.co/papers/2508.18265) across all scales. Our [training code](https://github.com/Weiyun1025/verl-internvl) is also open-sourced.*** Based on [MMPR-v1.2](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2), we compute the accuracy of each query using the provided rollouts and select those whose model accuracy falls between 0.2 and 0.8 for online RL. We further extend the dataset with recent multimodal datasets to enhance diversity. Please refer to [our paper](https://huggingface.co/papers/2508.18265) for more details about this dataset. Using this training data, the reasoning abilities of InternVL3.5 across all model scales are significantly enhanced. Notably, [InternVL3.5-MPO](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B-MPO) is initialized from [InternVL3.5-Instruct](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B-Instruct) and fine-tuned with [MPO](https://arxiv.org/abs/2411.10442) on [MMPR-v1.2](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2), whereas [InternVL3.5-CascadeRL](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B) is initialized from InternVL3.5-MPO and further fine-tuned with [GSPO](https://arxiv.co/abs/2507.18071) on [MMPR-Tiny](https://huggingface.co/datasets/OpenGVLab/MMPR-Tiny). ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_cascade_rl.jpg) ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_cascade_rl_table.jpg) ## Resources * **Paper:** [InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://huggingface.co/papers/2508.18265) * **Main Project GitHub:** [OpenGVLab/InternVL](https://github.com/OpenGVLab/InternVL) * **Training Code GitHub (for MMPR-Tiny):** [Weiyun1025/verl-internvl](https://github.com/Weiyun1025/verl-internvl) * **Project Page / Chat Demo:** [https://chat.intern-ai.org.cn/](https://chat.intern-ai.org.cn/) * **InternVL Blog:** [https://internvl.github.io/blog/](https://internvl.github.io/blog/) * **MPO Paper:** [Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization](https://arxiv.org/abs/2411.10442) * **Documents:** [InternVL Documentation](https://internvl.readthedocs.io/en/latest/internvl3.0/preference_optimization.html) ## Sample Usage The MMPR-Tiny dataset is designed for training advanced multimodal models. The following Python snippet, adapted from the [InternVL GitHub repository](https://github.com/OpenGVLab/InternVL), demonstrates how to perform a single-image, single-round conversation using an `InternVL` model (such as `InternVL3_5-8B`), which benefits from training with datasets like MMPR-Tiny. ```python import torch import torchvision.transforms as T from PIL import Image from torchvision.transforms.functional import InterpolationMode from transformers import AutoModel, AutoTokenizer IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_STD = (0.229, 0.224, 0.225) def build_transform(input_size): MEAN, STD = IMAGENET_MEAN, IMAGENET_STD transform = T.Compose([ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD) ]) return transform def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): best_ratio_diff = float('inf') best_ratio = (1, 1) area = width * height for ratio in target_ratios: target_aspect_ratio = ratio[0] / ratio[1] ratio_diff = abs(aspect_ratio - target_aspect_ratio) if ratio_diff < best_ratio_diff: best_ratio_diff = ratio_diff best_ratio = ratio elif ratio_diff == best_ratio_diff: if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: best_ratio = ratio return best_ratio def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): orig_width, orig_height = image.size aspect_ratio = orig_width / orig_height target_ratios = set( (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num) target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) target_aspect_ratio = find_closest_aspect_ratio( aspect_ratio, target_ratios, orig_width, orig_height, image_size) target_width = image_size * target_aspect_ratio[0] target_height = image_size * target_aspect_ratio[1] blocks = target_aspect_ratio[0] * target_aspect_ratio[1] resized_img = image.resize((target_width, target_height)) processed_images = [] for i in range(blocks): box = ( (i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size ) split_img = resized_img.crop(box) processed_images.append(split_img) assert len(processed_images) == blocks if use_thumbnail and len(processed_images) != 1: thumbnail_img = image.resize((image_size, image_size)) processed_images.append(thumbnail_img) return processed_images def load_image(image_file, input_size=448, max_num=12): image = Image.open(image_file).convert('RGB') transform = build_transform(input_size=input_size) images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(image) for image in images] pixel_values = torch.stack(pixel_values) return pixel_values # Load model and tokenizer (example model from InternVL family) path = 'OpenGVLab/InternVL3_5-8B' # Replace with a model trained with this data model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) # Prepare image for demonstration # You will need an image file, e.g., 'examples/image1.jpg'. # For a quick test, you can create a dummy image: # `from PIL import Image; Image.new('RGB', (1024, 1024), color = 'red').save('examples/image1.jpg')` # Or download an example: # `!mkdir -p examples && wget -O examples/image1.jpg https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_cascade_rl.jpg` try: pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda() generation_config = dict(max_new_tokens=1024, do_sample=False) # Single-image, single-round conversation question = '<image> Please describe the image shortly.' response = model.chat(tokenizer, pixel_values, question, generation_config) print(f'User: {question} Assistant: {response}') except FileNotFoundError: print("Example image not found. Please ensure 'examples/image1.jpg' exists or replace with your image path.") print("You can create a dummy image or download one as suggested in the comments above.") except Exception as e: print(f"An error occurred during sample usage: {e}") ``` ## Citation If you find this project useful in your research, please consider citing: ```BibTeX @article{wang2025internvl3_5, title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency}, author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others}, journal={arXiv preprint arXiv:2508.18265}, year={2025} } @article{wang2024mpo, title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization}, author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng}, journal={arXiv preprint arXiv:2411.10442}, year={2024} } ``` ## License This project is released under the [MIT license](LICENSE). Parts of this project contain code and models from other sources, which are subject to their respective licenses.

# MMPR-Tiny 本数据集为InternVL3.5在线强化学习(Reinforcement Learning,RL)阶段所用的训练数据,可显著提升全尺寸维度下InternVL3.5的整体性能。我们的[训练代码](https://github.com/Weiyun1025/verl-internvl)也已开源。 本数据集基于[MMPR-v1.2](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2)构建:我们通过给定的推演轨迹(rollouts)计算每个查询样本的准确率,并筛选出模型准确率介于0.2至0.8之间的样本用于在线强化学习训练。我们进一步引入最新多模态数据集扩充该数据集,以提升数据多样性。有关本数据集的更多细节,请参阅[我们的论文](https://huggingface.co/papers/2508.18265)。 通过本训练数据的加持,InternVL3.5全尺寸维度下的推理能力均得到显著提升。值得注意的是,[InternVL3.5-MPO](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B-MPO)基于[InternVL3.5-Instruct](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B-Instruct)初始化,并在[MMPR-v1.2](https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2)上使用[混合偏好优化(Mixed Preference Optimization,MPO)](https://arxiv.org/abs/2411.10442)进行微调;而[InternVL3.5-CascadeRL](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B)则基于InternVL3.5-MPO初始化,并在[MMPR-Tiny](https://huggingface.co/datasets/OpenGVLab/MMPR-Tiny)上使用GSPO进行进一步微调。 ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_cascade_rl.jpg) ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_cascade_rl_table.jpg) ## 资源 * **论文:** [InternVL3.5:提升开源多模态模型的通用性、推理能力与效率](https://huggingface.co/papers/2508.18265) * **项目主仓库GitHub:** [OpenGVLab/InternVL](https://github.com/OpenGVLab/InternVL) * **MMPR-Tiny训练代码GitHub:** [Weiyun1025/verl-internvl](https://github.com/Weiyun1025/verl-internvl) * **项目页面/对话演示:** [https://chat.intern-ai.org.cn/](https://chat.intern-ai.org.cn/) * **InternVL官方博客:** [https://internvl.github.io/blog/](https://internvl.github.io/blog/) * **MPO论文:** [通过混合偏好优化提升多模态大语言模型的推理能力](https://arxiv.org/abs/2411.10442) * **文档:** [InternVL官方文档](https://internvl.readthedocs.io/en/latest/internvl3.0/preference_optimization.html) ## 示例用法 本数据集专为训练先进多模态模型设计。以下改编自InternVL官方GitHub仓库的Python代码示例,演示了如何使用`InternVL`系列模型(如`InternVL3_5-8B`)完成单图像单轮对话,这类模型可通过MMPR-Tiny等数据集的训练获得性能提升。 python import torch import torchvision.transforms as T from PIL import Image from torchvision.transforms.functional import InterpolationMode from transformers import AutoModel, AutoTokenizer IMAGENET_MEAN = (0.485, 0.456, 0.406) IMAGENET_STD = (0.229, 0.224, 0.225) def build_transform(input_size): MEAN, STD = IMAGENET_MEAN, IMAGENET_STD transform = T.Compose([ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD) ]) return transform def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): best_ratio_diff = float('inf') best_ratio = (1, 1) area = width * height for ratio in target_ratios: target_aspect_ratio = ratio[0] / ratio[1] ratio_diff = abs(aspect_ratio - target_aspect_ratio) if ratio_diff < best_ratio_diff: best_ratio_diff = ratio_diff best_ratio = ratio elif ratio_diff == best_ratio_diff: if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: best_ratio = ratio return best_ratio def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False): orig_width, orig_height = image.size aspect_ratio = orig_width / orig_height target_ratios = set( (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num) target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) target_aspect_ratio = find_closest_aspect_ratio( aspect_ratio, target_ratios, orig_width, orig_height, image_size) target_width = image_size * target_aspect_ratio[0] target_height = image_size * target_aspect_ratio[1] blocks = target_aspect_ratio[0] * target_aspect_ratio[1] resized_img = image.resize((target_width, target_height)) processed_images = [] for i in range(blocks): box = ( (i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size ) split_img = resized_img.crop(box) processed_images.append(split_img) assert len(processed_images) == blocks if use_thumbnail and len(processed_images) != 1: thumbnail_img = image.resize((image_size, image_size)) processed_images.append(thumbnail_img) return processed_images def load_image(image_file, input_size=448, max_num=12): image = Image.open(image_file).convert('RGB') transform = build_transform(input_size=input_size) images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) pixel_values = [transform(image) for image in images] pixel_values = torch.stack(pixel_values) return pixel_values # Load model and tokenizer (example model from InternVL family) path = 'OpenGVLab/InternVL3_5-8B' # Replace with a model trained with this data model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) # Prepare image for demonstration # You will need an image file, e.g., 'examples/image1.jpg'. # For a quick test, you can create a dummy image: # `from PIL import Image; Image.new('RGB', (1024, 1024), color = 'red').save('examples/image1.jpg')` # Or download an example: # `!mkdir -p examples && wget -O examples/image1.jpg https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/ablation_cascade_rl.jpg` try: pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda() generation_config = dict(max_new_tokens=1024, do_sample=False) # Single-image, single-round conversation question = '<image> Please describe the image shortly.' response = model.chat(tokenizer, pixel_values, question, generation_config) print(f'User: {question} Assistant: {response}') except FileNotFoundError: print("Example image not found. Please ensure 'examples/image1.jpg' exists or replace with your image path.") print("You can create a dummy image or download one as suggested in the comments above.") except Exception as e: print(f"An error occurred during sample usage: {e}") ## 引用 若本项目对您的研究有所帮助,请引用以下文献: BibTeX @article{wang2025internvl3_5, title={InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency}, author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others}, journal={arXiv preprint arXiv:2508.18265}, year={2025} } @article{wang2024mpo, title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization}, author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng}, journal={arXiv preprint arXiv:2411.10442}, year={2024} } ## 许可协议 本项目采用[MIT许可证](LICENSE)开源。本项目部分代码与模型源自第三方开源项目,需遵循其对应许可协议。
提供机构:
maas
创建时间:
2025-08-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作