filtered-wit
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/laion/filtered-wit
下载链接
链接失效反馈官方服务:
资源简介:
# Filtered WIT, an Image-Text Dataset.
A reliable Dataset to run Image-Text models.
You can find WIT, Wikipedia Image Text Dataset, [here](https://github.com/google-research-datasets/wit)
Data was taken from [dalle-mini/wit](https://huggingface.co/datasets/dalle-mini/wit)
## Author
- [Aarush Katta](https://github.com/ARKseal)
## Data Structure
The data is stored as tars, containing 10,000 samples per tar.
The parquets contain the metadata of each tar, which was crated using [this script](https://huggingface.co/datasets/laion/filtered-wit/blob/main/wit_create_meta.py)
Each tar contains a `.jpg`, `.txt`, and `.json`.
The image is stored in `.jpg`, the caption in `.txt.` and the metadata in `.json`
The preferred method to read the data is [WebDataset](https://github.com/webdataset/webdataset)
Here's an example:
```python
import webdataset as wds
dataset = wds.WebDataset('data/00000.tar').to_tuple('txt', 'jpg', 'json')
for text, image, meta in dataset:
print(
text[:50],
image[:50],
meta[:50]
)
```
## Filteration
Each sample has 8 possible captions which were compared to the image using [CLIP ViT-B32](https://arxiv.org/abs/2103.00020)
The text was encoded using [multilingual CLIP text encoder](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1)
Each possible caption was compared to the encoded image using Cosine Similarity
and kept if the sim was greater than `0.26`
Then the new caption was the filtered captions concatenated, and samples with no filtered caption were dropped.
The script used is [filter_wit.py](https://huggingface.co/datasets/laion/filtered-wit/blob/main/filter_wit.py)
## 经过筛选的WIT图文数据集(Filtered WIT, an Image-Text Dataset)
一款适用于图文模型训练的可靠数据集。
维基百科图文数据集(WIT, Wikipedia Image Text Dataset)的原始版本可参见[此处](https://github.com/google-research-datasets/wit),本数据集的数据源自[dalle-mini/wit](https://huggingface.co/datasets/dalle-mini/wit)。
## 作者
- [Aarush Katta](https://github.com/ARKseal)
## 数据结构
数据集以tar包形式存储,每个tar包内含10000条样本。
Parquet文件存储了每个tar包的元数据,元数据的生成使用了[该脚本](https://huggingface.co/datasets/laion/filtered-wit/blob/main/wit_create_meta.py)。
每个tar包包含`.jpg`、`.txt`与`.json`三类文件:其中`.jpg`为图像文件,`.txt`存储图像描述文本,`.json`则存储元数据。
读取该数据集的推荐工具为[WebDataset](https://github.com/webdataset/webdataset),示例代码如下:
python
import webdataset as wds
dataset = wds.WebDataset('data/00000.tar').to_tuple('txt', 'jpg', 'json')
for text, image, meta in dataset:
print(
text[:50],
image[:50],
meta[:50]
)
## 筛选机制
每条样本原本配有8条候选描述文本,我们使用[CLIP ViT-B32](https://arxiv.org/abs/2103.00020)完成图像与文本的匹配比对:
首先使用[多语言CLIP文本编码器(multilingual CLIP text encoder)](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1)对文本进行编码,随后通过余弦相似度(Cosine Similarity)计算每条候选描述与编码后图像的匹配度,仅保留相似度大于0.26的候选描述。将保留的候选描述拼接为最终的图像描述文本,若样本无符合要求的描述则直接剔除。
本次筛选使用的脚本为[filter_wit.py](https://huggingface.co/datasets/laion/filtered-wit/blob/main/filter_wit.py)。
提供机构:
maas
创建时间:
2025-10-04



