five

heb-clip

收藏
魔搭社区2025-11-27 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/heb-clip
下载链接
链接失效反馈
官方服务:
资源简介:
# Hebrew-CLIP Dataset The Hebrew-CLIP dataset is a collection of Hebrew image captions designed to facilitate training of vision-language models like CLIP (Contrastive Language-Image Pre-training) for the Hebrew language. This dataset provides captions without actual images, instead offering references to pre-computed image embeddings. ## Dataset Composition The dataset consists of two parquet files: 1. **Translated Captions**: 4 million captions from the [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) dataset, translated to Hebrew using the [opus-mt-en-he](https://huggingface.co/Helsinki-NLP/opus-mt-en-he) machine translation model. 2. **Original Hebrew Captions**: 3.78 million Hebrew captions extracted from the multilingual subset of [LAION-5B](https://laion.ai/blog/laion-5b/). ## Data Format Each parquet file contains 4 columns: - `key`: Unique identifier for the caption - `heb_caption`: The Hebrew caption - `file_name`: Name of the corresponding image embedding file - `file_index`: Index of the embedding within the file ## Usage with Image Embeddings To use this dataset for training CLIP or similar models, you'll need to pair the captions with their corresponding CLIP ViT-L/14 image embeddings. These embeddings are not included in this dataset but can be accessed as follows: 1. For the translated DataComp captions: - Embeddings are available at: https://huggingface.co/datasets/mlfoundations/datacomp_1b - Use the `file_name` to locate the correct npz file - Use the `file_index` to find the specific embedding within that file 2. For the original LAION-2B Hebrew captions: - Embeddings are available at: https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-multi/img_emb/ - Follow the same process using `file_name` and `file_index` ## Limitations and Biases - This dataset provides only captions and references to image embeddings, not the actual images. - The quality of the translated captions may vary and could introduce biases or inaccuracies. - The original Hebrew captions from LAION-2B may contain web-scraped content with potential biases or quality issues. ## Acknowledgments - [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) for the original English captions - [LAION-5B](https://laion.ai/blog/laion-5b/) for the multilingual dataset - [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) for the opus-mt-en-he translation model - [DataComp](https://huggingface.co/datasets/mlfoundations/datacomp_1b) for providing the image embeddings for the translated captions ## License The use of this dataset is governed by the [NVIDIA License](LICENSE) which permits commercial usage.

# 希伯来语CLIP数据集(Hebrew-CLIP Dataset) 希伯来语CLIP数据集是一套面向希伯来语的图像描述文本集合,旨在支持CLIP(Contrastive Language-Image Pre-training,对比语言-图像预训练)等视觉语言模型的训练工作。本数据集仅提供图像描述文本,不包含原始图像,仅提供指向预计算图像嵌入的引用。 ## 数据集构成 本数据集包含两个Parquet文件: 1. **翻译后描述文本**:源自[Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B)数据集的400万条描述文本,通过[opus-mt-en-he](https://huggingface.co/Helsinki-NLP/opus-mt-en-he)机器翻译模型译为希伯来语。 2. **原始希伯来语描述文本**:从[LAION-5B](https://laion.ai/blog/laion-5b/)的多语言子集中提取的378万条希伯来语描述文本。 ## 数据格式 每个Parquet文件均包含4个字段: - `key`:图像描述文本的唯一标识符 - `heb_caption`:希伯来语图像描述文本 - `file_name`:对应图像嵌入文件的名称 - `file_index`:嵌入文件内的嵌入项索引 ## 图像嵌入的使用方法 若使用本数据集训练CLIP或同类视觉语言模型,需将描述文本与其对应的CLIP ViT-L/14图像嵌入进行配对。本数据集未包含此类嵌入文件,可通过以下途径获取: 1. 针对翻译后的DataComp描述文本: - 嵌入文件获取地址:https://huggingface.co/datasets/mlfoundations/datacomp_1b - 通过`file_name`字段定位对应的NPZ文件 - 通过`file_index`字段获取该文件内的指定嵌入项 2. 针对原始LAION-2B希伯来语描述文本: - 嵌入文件获取地址:https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-multi/img_emb/ - 通过`file_name`与`file_index`字段按照相同流程获取嵌入项 ## 局限性与偏倚说明 - 本数据集仅提供图像描述文本与图像嵌入引用,不包含原始图像。 - 翻译后的描述文本质量参差不齐,可能引入偏倚或表述不准确之处。 - 源自LAION-2B的原始希伯来语描述文本包含网络爬取内容,可能存在潜在偏倚或质量问题。 ## 致谢 - [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B):提供原始英语图像描述文本 - [LAION-5B](https://laion.ai/blog/laion-5b/):提供多语言数据集 - [Helsinki-NLP](https://huggingface.co/Helsinki-NLP):提供opus-mt-en-he机器翻译模型 - [DataComp](https://huggingface.co/datasets/mlfoundations/datacomp_1b):为翻译后的描述文本提供图像嵌入文件 ## 许可协议 本数据集的使用受[NVIDIA许可协议(LICENSE)](LICENSE)约束,允许商业使用。
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作