siglip-adaptive-size
收藏魔搭社区2025-12-03 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/DaDing777/siglip-adaptive-size
下载链接
链接失效反馈官方服务:
资源简介:
# SigLIP (shape-optimized model)
SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Zhai et al. and first released in [this repository](https://github.com/google-research/big_vision).
This model has the SoViT-400m architecture, which is the shape-optimized version as presented in [Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design](https://arxiv.org/abs/2305.13035) by Alabdulmohsin et al.
Disclaimer: The team releasing SigLIP did not write a model card for this model so this model card has been written by the Hugging Face team.
## Model description
SigLIP is [CLIP](https://huggingface.co/docs/transformers/model_doc/clip), a multimodal model, with a better loss function. The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. This allows further scaling up the batch size, while also performing better at smaller batch sizes.
A TLDR of SigLIP by one of the authors can be found [here](https://twitter.com/giffmana/status/1692641733459267713).
## Intended uses & limitations
You can use the raw model for tasks like zero-shot image classification and image-text retrieval. See the [model hub](https://huggingface.co/models?search=google/siglip) to look for
other versions on a task that interests you.
### How to use
Here is how to use this model to perform zero-shot image classification:
```python
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")
```
Alternatively, one can leverage the pipeline API which abstracts away the complexity for the user:
```python
from transformers import pipeline
from PIL import Image
import requests
# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")
# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)
```
For more code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#).
## Training procedure
### Training data
SigLIP is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794).
### Preprocessing
Images are resized/rescaled to the same resolution (384x384) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
Texts are tokenized and padded to the same length (64 tokens).
### Compute
The model was trained on 16 TPU-v4 chips for three days.
## Evaluation results
Evaluation of SigLIP compared to CLIP is shown below (taken from the paper).
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip_table.jpeg"
alt="drawing" width="600"/>
### BibTeX entry and citation info
```bibtex
@misc{zhai2023sigmoid,
title={Sigmoid Loss for Language Image Pre-Training},
author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
year={2023},
eprint={2303.15343},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
# SigLIP(形状优化模型)
在WebLi数据集上以384×384分辨率预训练的SigLIP模型。该模型由Zhai等人在论文《Sigmoid Loss for Language Image Pre-Training》[https://arxiv.org/abs/2303.15343] 中提出,并首次在[该仓库](https://github.com/google-research/big_vision)中发布。
该模型采用SoViT-400m架构,即Alabdulmohsin等人在论文《Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design》[https://arxiv.org/abs/2305.13035] 中提出的形状优化版本。
免责声明:发布SigLIP的团队并未为该模型撰写模型卡片,本模型卡片由Hugging Face团队撰写。
## 模型描述
SigLIP本质上是CLIP,一款多模态模型,但其采用了更优异的损失函数。该Sigmoid损失仅针对图像-文本对进行计算,无需对所有配对相似度进行全局视角的归一化操作。这一设计不仅支持进一步扩大批量尺寸,同时在小批量尺寸下也能取得更优的性能。
该模型的一位作者整理的SigLIP要点总结可参见[此处](https://twitter.com/giffmana/status/1692641733459267713)。
## 预期用途与局限性
可将该原生模型应用于零样本(Zero-shot)图像分类、图像-文本检索等任务。您可通过[模型仓库](https://huggingface.co/models?search=google/siglip)查找适配您感兴趣任务的其他版本模型。
### 使用方法
以下为使用该模型执行零样本(Zero-shot)图像分类的示例代码:
python
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # 此处为预测概率
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")
或者,您也可以使用简化了操作流程的Pipeline API:
python
from transformers import pipeline
from PIL import Image
import requests
# 加载分类管道
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")
# 加载图像
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
# 执行推理
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)
如需更多代码示例,请参考[官方文档](https://huggingface.co/transformers/main/model_doc/siglip.html#)。
## 训练流程
### 训练数据
SigLIP在WebLI数据集上进行预训练[(Chen等人, 2023)](https://arxiv.org/abs/2209.06794)。
### 预处理
将图像调整至统一分辨率(384×384),并沿RGB通道进行归一化处理,均值与标准差均为(0.5, 0.5, 0.5)。
对文本进行Token(Token)标记并填充至统一长度(64个Token)。
### 计算资源
该模型使用16块TPU-v4芯片训练了三天。
## 评估结果
SigLIP与CLIP的对比评估结果如下(摘自原论文)。
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip_table.jpeg" alt="对比表格" width="600"/>
## BibTeX引用格式
bibtex
@misc{zhai2023sigmoid,
title={Sigmoid Loss for Language Image Pre-Training},
author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
year={2023},
eprint={2303.15343},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
提供机构:
maas
创建时间:
2025-12-03



