nvidia/heb-clip

Name: nvidia/heb-clip
Creator: nvidia
Published: 2024-08-27 11:44:44
License: 暂无描述

Hugging Face2024-08-27 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/nvidia/heb-clip

下载链接

链接失效反馈

官方服务：

资源简介：

Hebrew-CLIP数据集是一个希伯来语图像字幕的集合，旨在促进希伯来语的视觉-语言模型（如CLIP）的训练。该数据集提供字幕但不包含实际图像，而是提供预计算图像嵌入的引用。数据集由两部分组成：从Recap-DataComp-1B翻译的400万条希伯来语字幕和从LAION-5B多语言子集中提取的378万条原始希伯来语字幕。数据格式包括唯一标识符、希伯来语字幕、对应的图像嵌入文件名和嵌入索引。使用该数据集需要与CLIP ViT-L/14图像嵌入配对，这些嵌入不包含在数据集中，但可以通过提供的链接访问。数据集存在一些局限性，如仅提供字幕和嵌入引用，翻译字幕的质量可能参差不齐，原始字幕可能包含网络抓取的内容，可能存在偏见或质量问题。

The Hebrew-CLIP dataset is a collection of Hebrew image captions designed to facilitate training of vision-language models like CLIP for the Hebrew language. This dataset provides captions without actual images, instead offering references to pre-computed image embeddings. The dataset consists of two parquet files: one containing 4 million translated captions from the Recap-DataComp-1B dataset, and another with 3.78 million original Hebrew captions from the LAION-5B dataset. Each parquet file contains four columns: key, heb_caption, file_name, and file_index. To use this dataset for training CLIP or similar models, youll need to pair the captions with their corresponding CLIP ViT-L/14 image embeddings. The dataset does not include actual images but provides references to pre-computed image embeddings.

提供机构：

nvidia

5,000+

优质数据集

54 个

任务类型

进入经典数据集