Rexverse-2M
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/IDEA-Research/Rexverse-2M
下载链接
链接失效反馈官方服务:
资源简介:
# 2. Parse the Rexverse-2M Dataset
We release 500K data for detailed region caption and referring-style region caption data. After you clone the repository, you need to first merge the image folder with the following command:
```bash
cat 0_500000_image_part_0.tsv 0_500000_image_part_1.tsv 0_500000_image_part_2.tsv 0_500000_image_part_3.tsv 0_500000_image_part_4.tsv > 0_500000_image_merged.tsv
```
## 2.1 Dataset Structure
We use TSV format to store the dataset. The dataset is divided into two parts: image and annotations. Here is an example code to parse the dataset:
```python
import json
import os
from base64 import b64decode
from io import BytesIO
import numpy as np
from torch.utils.data import Dataset
class TSVBase(Dataset):
"""Base class for TSV dataset. This class is used to load image and annotations from TSV file.
Args:
img_tsv_file (str): The path to the image TSV file.
ann_tsv_file (str): The path to the annotation TSV file.
ann_lineidx_file (str): The path to the annotation lineidx file.
num_workers (int): The number of workers.
data_ratio (float, optional): The ratio of data to use. Defaults to 1.0.
filter_empty (bool): If filter the samples without annotations. When training, set it to True.
dataset_type (str): The data source.
"""
def __init__(
self,
img_tsv_file: str,
ann_tsv_file: str,
ann_lineidx_file: str,
):
self.data = []
f = open(ann_lineidx_file)
for line in tqdm(f):
self.data.append(int(line.strip()))
self.img_handle = None
self.ann_handle = None
self.img_tsv_file = img_tsv_file
self.ann_tsv_file = ann_tsv_file
self.preparer = None
self.captionbuilder = None
self._transforms = None
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
ann_line_idx = self.data[idx]
if self.ann_handle is None:
self.ann_handle = open(self.ann_tsv_file)
self.ann_handle.seek(ann_line_idx)
img_line_idx, ann = self.ann_handle.readline().strip().split("\t")
img_line_idx = int(img_line_idx)
if self.img_handle is None:
self.img_handle = open(self.img_tsv_file)
self.img_handle.seek(img_line_idx)
img = self.img_handle.readline().strip().split("\t")[1]
if img.startswith("b'"):
img = img[1:-1]
img = BytesIO(b64decode(img))
img = Image.open(img).convert("RGB")
target = json.loads(ann)
return img, target
```
## 2.2 Visualize the Dataset
We provide a script to visualize the dataset. You can run the following command to visualize the dataset:
```bash
python visualize_dataset.py --img_tsv_file 0_500000_image_merged.tsv --ann_tsv_file 0_500000_referring.annotations.tsv --ann_lineidx_file 0_500000_referring.annotations.tsv.lineidx --vis_path visualize_referring --num_images 200
python visualize_dataset.py --img_tsv_file 0_500000_image_merged.tsv --ann_tsv_file 0_500000_one_sentence.annotation.tsv --ann_lineidx_file 0_500000_one_sentence.annotations.tsv.lineidx --vis_path visualize_one_sentence --num_images 200
```
# 3. LICENSE
Rexverse-2M is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. Note that this project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses including but not limited to the:
- [OpenAI Terms of Use](https://openai.com/policies/terms-of-use) for the dataset.
- For the LLM used in this project, the model is [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main), which is licensed under [Llama 2 Community License Agreement](https://huggingface.co/lmsys/vicuna-7b-v1.5).
- For the high resolution vision encoder, we are using [laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) which is licensed under [MIT LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md).
- For the low resolution vision encoder, we are using [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) which is licensed under [MIT LICENSE](https://github.com/openai/CLIP/blob/main/LICENSE)
# BibTeX 📚
```
@misc{jiang2024chatrextamingmultimodalllm,
title={ChatRex: Taming Multimodal LLM for Joint Perception and Understanding},
author={Qing Jiang and Gen Luo and Yuqin Yang and Yuda Xiong and Yihao Chen and Zhaoyang Zeng and Tianhe Ren and Lei Zhang},
year={2024},
eprint={2411.18363},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.18363},
}
```
# 2. Rexverse-2M 数据集解析
本次发布50万条用于精细化区域描述(region caption)与指代式区域描述(referring-style region caption)的数据。克隆仓库后,你需先通过以下命令合并图像分卷文件:
bash
cat 0_500000_image_part_0.tsv 0_500000_image_part_1.tsv 0_500000_image_part_2.tsv 0_500000_image_part_3.tsv 0_500000_image_part_4.tsv > 0_500000_image_merged.tsv
## 2.1 数据集结构
本数据集采用TSV(制表符分隔值,Tab-Separated Values)格式存储,分为图像与标注两个部分。以下为数据集解析的示例代码:
python
import json
import os
from base64 import b64decode
from io import BytesIO
import numpy as np
from torch.utils.data import Dataset
class TSVBase(Dataset):
"""TSV数据集基类,用于从TSV文件加载图像与标注。
参数:
img_tsv_file (str):图像TSV文件路径;
ann_tsv_file (str):标注TSV文件路径;
ann_lineidx_file (str):标注行索引文件路径;
num_workers (int):工作进程数;
data_ratio (float, 可选):使用的数据比例,默认为1.0;
filter_empty (bool):是否过滤无标注的样本,训练时请设为True;
dataset_type (str):数据源类型。
"""
def __init__(
self,
img_tsv_file: str,
ann_tsv_file: str,
ann_lineidx_file: str,
):
self.data = []
f = open(ann_lineidx_file)
for line in tqdm(f):
self.data.append(int(line.strip()))
self.img_handle = None
self.ann_handle = None
self.img_tsv_file = img_tsv_file
self.ann_tsv_file = ann_tsv_file
self.preparer = None
self.captionbuilder = None
self._transforms = None
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
ann_line_idx = self.data[idx]
if self.ann_handle is None:
self.ann_handle = open(self.ann_tsv_file)
self.ann_handle.seek(ann_line_idx)
img_line_idx, ann = self.ann_handle.readline().strip().split(" ")
img_line_idx = int(img_line_idx)
if self.img_handle is None:
self.img_handle = open(self.img_tsv_file)
self.img_handle.seek(img_line_idx)
img = self.img_handle.readline().strip().split(" ")[1]
if img.startswith("b'"):
img = img[1:-1]
img = BytesIO(b64decode(img))
img = Image.open(img).convert("RGB")
target = json.loads(ann)
return img, target
## 2.2 数据集可视化
我们提供了数据集可视化脚本,可通过以下命令实现数据集可视化:
bash
python visualize_dataset.py --img_tsv_file 0_500000_image_merged.tsv --ann_tsv_file 0_500000_referring.annotations.tsv --ann_lineidx_file 0_500000_referring.annotations.tsv.lineidx --vis_path visualize_referring --num_images 200
python visualize_dataset.py --img_tsv_file 0_500000_image_merged.tsv --ann_tsv_file 0_500000_one_sentence.annotation.tsv --ann_lineidx_file 0_500000_one_sentence.annotations.tsv.lineidx --vis_path visualize_one_sentence --num_images 200
# 3. 许可证
Rexverse-2M 采用 IDEA 许可证1.0协议进行授权,版权归IDEA所有,保留所有权利。请注意,本项目使用的部分数据集与模型权重快照(checkpoint)需遵循其各自的原始许可证条款。用户必须遵守所有相关原始许可证的全部条款与条件,包括但不限于:
- 数据集需遵循 [OpenAI 使用条款](https://openai.com/policies/terms-of-use);
- 本项目使用的大语言模型(LLM)为 [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main),需遵循 [Llama 2 社区许可证协议](https://huggingface.co/lmsys/vicuna-7b-v1.5);
- 本项目使用的高分辨率视觉编码器为 [laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg),采用 [MIT许可证](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md) 进行授权;
- 本项目使用的低分辨率视觉编码器为 [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14),采用 [MIT许可证](https://github.com/openai/CLIP/blob/main/LICENSE) 进行授权。
# BibTeX 📚
@misc{jiang2024chatrextamingmultimodalllm,
title={ChatRex: Taming Multimodal LLM for Joint Perception and Understanding},
author={Qing Jiang and Gen Luo and Yuqin Yang and Yuda Xiong and Yihao Chen and Zhaoyang Zeng and Tianhe Ren and Lei Zhang},
year={2024},
eprint={2411.18363},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.18363},
}
提供机构:
maas
创建时间:
2025-10-20



