five

cosmos-imagenet

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/fal/cosmos-imagenet
下载链接
链接失效反馈
官方服务:
资源简介:
# Tiny Cosmos-Tokenized Imagenet <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6311151c64939fabc00c8436/2Wrz6bzvwIHVATbtYAujs.png" alt="small" width="800"> </p> Similar fashion to [Simo's Imagenet.int8](https://github.com/cloneofsimo/imagenet.int8), here we provide [Cosmos-tokenized](https://github.com/NVIDIA/Cosmos-Tokenizer) imagenet for rapid prototyping. Noticeably, the discrete tokenizer is able to compress entire imagenet into **shocking 2.45 GB of data!** # How to use This time, we dumped it all on simple pytorch safetensor format. ```python import torch import torch.nn as nn from safetensors.torch import safe_open # for continuous tokenizer with safe_open("tokenize_dataset/imagenet_ci8x8.safetensors", framework="pt") as f: data = f.get_tensor("latents") * 16.0 / 255.0 labels = f.get_tensor("labels") print(data.shape) # 1281167, 16, 32, 32 print(labels.shape) # 1281167 ``` To decode, you would need to install cosmos tokenizer. ```bash git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git cd Cosmos-Tokenizer apt-get install -y ffmpeg pip install -e . ``` And decode using either `"Cosmos-Tokenizer-CI8x8"` or `"Cosmos-Tokenizer-DI8x8"` **IMPORTANT** * For continuous token, we've quantized & normalized to int8 format. Thus, you need to multiply 16.0 / 255.0 * For discrete token, saved format is int16. To use it properly just do uint16. Example below: ```python model_name = "Cosmos-Tokenizer-CI8x8" if is_continuous else "Cosmos-Tokenizer-DI8x8" decoder = ImageTokenizer( checkpoint_dec=f"pretrained_ckpts/{model_name}/decoder.jit" ).to(device) with safe_open("imagenet_ci8x8.safetensors", framework="pt") as f: if tokenizer_type == "continuous": data = f.get_tensor("latents").to(torch.bfloat16) * 16.0 / 255.0 else: data = f.get_tensor("indices").to(torch.uint16) labels = f.get_tensor("labels") data = data[:1] if is_continuous: data = data.reshape(1, 16, 32, 32).to(device) else: # For discrete tokenizer, reshape to [1, 32, 32] data = data.reshape(1, 32, 32).to(device).long() # Decode the image with torch.no_grad(): reconstructed = decoder.decode(data) img = ( ((reconstructed[0].cpu().float() + 1) * 127.5).clamp(0, 255).to(torch.uint8) ) img = img.permute(1, 2, 0).numpy() img = Image.fromarray(img) ```

# 小型Cosmos分词化ImageNet数据集 <p align="center"><img src="https://cdn-uploads.huggingface.co/production/uploads/6311151c64939fabc00c8436/2Wrz6bzvwIHVATbtYAujs.png" alt="数据集示例图" width="800"></p> 本数据集参考[Simo's Imagenet.int8](https://github.com/cloneofsimo/imagenet.int8)的构建思路,旨在为快速原型开发提供经[Cosmos分词器(Cosmos-Tokenizer)](https://github.com/NVIDIA/Cosmos-Tokenizer)处理的ImageNet数据集。值得注意的是,该离散分词器(discrete tokenizer)可将完整ImageNet数据集压缩至仅**2.45 GB**,体量令人惊叹! # 使用方法 本次发布的数据集采用轻量化PyTorch SafeTensor格式存储。 python import torch import torch.nn as nn from safetensors.torch import safe_open # 针对连续分词器(continuous tokenizer) with safe_open("tokenize_dataset/imagenet_ci8x8.safetensors", framework="pt") as f: data = f.get_tensor("latents") * 16.0 / 255.0 labels = f.get_tensor("labels") print(data.shape) # 输出:(1281167, 16, 32, 32) print(labels.shape) # 输出:(1281167,) 若需对数据进行解码,需先安装Cosmos分词器。 bash git clone https://github.com/NVIDIA/Cosmos-Tokenizer.git cd Cosmos-Tokenizer apt-get install -y ffmpeg pip install -e . 可使用`"Cosmos-Tokenizer-CI8x8"`或`"Cosmos-Tokenizer-DI8x8"`两种模型进行解码。 **重要提示** * 对于连续型Token,数据集已完成int8量化与归一化处理,因此需通过`16.0 / 255.0`进行反归一化还原。 * 对于离散型Token,数据集采用int16格式存储,正确使用时需转换为uint16类型,示例如下: python model_name = "Cosmos-Tokenizer-CI8x8" if is_continuous else "Cosmos-Tokenizer-DI8x8" decoder = 图像分词器(ImageTokenizer)( checkpoint_dec=f"pretrained_ckpts/{model_name}/decoder.jit" ).to(device) with safe_open("imagenet_ci8x8.safetensors", framework="pt") as f: if tokenizer_type == "continuous": data = f.get_tensor("latents").to(torch.bfloat16) * 16.0 / 255.0 else: data = f.get_tensor("indices").to(torch.uint16) labels = f.get_tensor("labels") data = data[:1] if is_continuous: data = data.reshape(1, 16, 32, 32).to(device) else: # 针对离散分词器,需将张量重塑为[1, 32, 32] data = data.reshape(1, 32, 32).to(device).long() # 执行图像解码 with torch.no_grad(): reconstructed = decoder.decode(data) img = ( ((reconstructed[0].cpu().float() + 1) * 127.5).clamp(0, 255).to(torch.uint8) ) img = img.permute(1, 2, 0).numpy() img = Image.fromarray(img)
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作