multimodal-open-r1-8192-filtered-mid-ic
收藏魔搭社区2025-08-15 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/oumi-ai/multimodal-open-r1-8192-filtered-mid-ic
下载链接
链接失效反馈官方服务:
资源简介:
# multimodal-open-r1-8192-filtered-mid-ic
Original dataset structure preserved, filtered by token length and image quality
## Dataset Description
This dataset was processed using the [data-preproc](https://github.com/oumi-ai/ml-preproc) package for vision-language model training.
### Processing Configuration
- **Base Model**: Qwen/Qwen2.5-7B-Instruct
- **Tokenizer**: Qwen/Qwen2.5-7B-Instruct
- **Sequence Length**: 16384
- **Processing Type**: Vision Language (VL)
### Dataset Features
- **input_ids**: Tokenized input sequences
- **attention_mask**: Attention masks for the sequences
- **labels**: Labels for language modeling
- **images**: PIL Image objects
- **messages**: Original conversation messages
- **metadata**: Processing metadata
### Processing Statistics
- **Original Samples**: 2085
- **Processed Samples**: 2085
- **Success Rate**: 100.0%
- **Average Token Length**: N/A
- **Max Token Length**: N/A
- **Truncation Rate**: N/A
### Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("your-org/your-dataset-name")
# Access samples
sample = dataset["train"][0]
print(f"Input tokens: {len(sample['input_ids'])}")
print(f"Images: {len(sample['images'])}")
print(f"Messages: {sample['messages']}")
```
## License
This dataset is released under the specified license. Please check the license field for details.
# multimodal-open-r1-8192-filtered-mid-ic
保留原始数据集结构,基于Token长度与图像质量完成筛选。
## 数据集说明
本数据集使用[data-preproc](https://github.com/oumi-ai/ml-preproc)工具包进行处理,用于视觉语言模型(Vision-Language Model, VLM)训练。
### 处理配置
- **基础模型**:Qwen/Qwen2.5-7B-Instruct
- **分词器**:Qwen/Qwen2.5-7B-Instruct
- **序列长度**:16384
- **处理类型**:视觉语言(Vision Language, VL)
### 数据集特征
- **input_ids**:经过分词的输入序列
- **attention_mask**:序列对应的注意力掩码
- **labels**:语言建模任务的标签
- **images**:PIL图像对象
- **messages**:原始对话消息内容
- **metadata**:处理过程元数据
### 处理统计数据
- **原始样本量**:2085
- **处理后样本量**:2085
- **处理成功率**:100.0%
- **平均Token长度**:无(N/A)
- **最大Token长度**:无(N/A)
- **截断率**:无(N/A)
### 使用方法
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("your-org/your-dataset-name")
# 访问样本
sample = dataset["train"][0]
print(f"输入Token数量: {len(sample['input_ids'])}")
print(f"图像数量: {len(sample['images'])}")
print(f"对话消息: {sample['messages']}")
## 授权协议
本数据集已按指定许可协议发布,详情请查阅数据集的license字段。
提供机构:
maas
创建时间:
2025-07-31



