five

filtered-wit

收藏
魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/laion/filtered-wit
下载链接
链接失效反馈
官方服务:
资源简介:
# Filtered WIT, an Image-Text Dataset. A reliable Dataset to run Image-Text models. You can find WIT, Wikipedia Image Text Dataset, [here](https://github.com/google-research-datasets/wit) Data was taken from [dalle-mini/wit](https://huggingface.co/datasets/dalle-mini/wit) ## Author - [Aarush Katta](https://github.com/ARKseal) ## Data Structure The data is stored as tars, containing 10,000 samples per tar. The parquets contain the metadata of each tar, which was crated using [this script](https://huggingface.co/datasets/laion/filtered-wit/blob/main/wit_create_meta.py) Each tar contains a `.jpg`, `.txt`, and `.json`. The image is stored in `.jpg`, the caption in `.txt.` and the metadata in `.json` The preferred method to read the data is [WebDataset](https://github.com/webdataset/webdataset) Here's an example: ```python import webdataset as wds dataset = wds.WebDataset('data/00000.tar').to_tuple('txt', 'jpg', 'json') for text, image, meta in dataset: print( text[:50], image[:50], meta[:50] ) ``` ## Filteration Each sample has 8 possible captions which were compared to the image using [CLIP ViT-B32](https://arxiv.org/abs/2103.00020) The text was encoded using [multilingual CLIP text encoder](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) Each possible caption was compared to the encoded image using Cosine Similarity and kept if the sim was greater than `0.26` Then the new caption was the filtered captions concatenated, and samples with no filtered caption were dropped. The script used is [filter_wit.py](https://huggingface.co/datasets/laion/filtered-wit/blob/main/filter_wit.py)

## 经过筛选的WIT图文数据集(Filtered WIT, an Image-Text Dataset) 一款适用于图文模型训练的可靠数据集。 维基百科图文数据集(WIT, Wikipedia Image Text Dataset)的原始版本可参见[此处](https://github.com/google-research-datasets/wit),本数据集的数据源自[dalle-mini/wit](https://huggingface.co/datasets/dalle-mini/wit)。 ## 作者 - [Aarush Katta](https://github.com/ARKseal) ## 数据结构 数据集以tar包形式存储,每个tar包内含10000条样本。 Parquet文件存储了每个tar包的元数据,元数据的生成使用了[该脚本](https://huggingface.co/datasets/laion/filtered-wit/blob/main/wit_create_meta.py)。 每个tar包包含`.jpg`、`.txt`与`.json`三类文件:其中`.jpg`为图像文件,`.txt`存储图像描述文本,`.json`则存储元数据。 读取该数据集的推荐工具为[WebDataset](https://github.com/webdataset/webdataset),示例代码如下: python import webdataset as wds dataset = wds.WebDataset('data/00000.tar').to_tuple('txt', 'jpg', 'json') for text, image, meta in dataset: print( text[:50], image[:50], meta[:50] ) ## 筛选机制 每条样本原本配有8条候选描述文本,我们使用[CLIP ViT-B32](https://arxiv.org/abs/2103.00020)完成图像与文本的匹配比对: 首先使用[多语言CLIP文本编码器(multilingual CLIP text encoder)](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1)对文本进行编码,随后通过余弦相似度(Cosine Similarity)计算每条候选描述与编码后图像的匹配度,仅保留相似度大于0.26的候选描述。将保留的候选描述拼接为最终的图像描述文本,若样本无符合要求的描述则直接剔除。 本次筛选使用的脚本为[filter_wit.py](https://huggingface.co/datasets/laion/filtered-wit/blob/main/filter_wit.py)。
提供机构:
maas
创建时间:
2025-10-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作