heb-clip
收藏魔搭社区2025-11-27 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/heb-clip
下载链接
链接失效反馈官方服务:
资源简介:
# Hebrew-CLIP Dataset
The Hebrew-CLIP dataset is a collection of Hebrew image captions designed to facilitate training of vision-language models like CLIP (Contrastive Language-Image Pre-training) for the Hebrew language. This dataset provides captions without actual images, instead offering references to pre-computed image embeddings.
## Dataset Composition
The dataset consists of two parquet files:
1. **Translated Captions**: 4 million captions from the [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) dataset, translated to Hebrew using the [opus-mt-en-he](https://huggingface.co/Helsinki-NLP/opus-mt-en-he) machine translation model.
2. **Original Hebrew Captions**: 3.78 million Hebrew captions extracted from the multilingual subset of [LAION-5B](https://laion.ai/blog/laion-5b/).
## Data Format
Each parquet file contains 4 columns:
- `key`: Unique identifier for the caption
- `heb_caption`: The Hebrew caption
- `file_name`: Name of the corresponding image embedding file
- `file_index`: Index of the embedding within the file
## Usage with Image Embeddings
To use this dataset for training CLIP or similar models, you'll need to pair the captions with their corresponding CLIP ViT-L/14 image embeddings. These embeddings are not included in this dataset but can be accessed as follows:
1. For the translated DataComp captions:
- Embeddings are available at: https://huggingface.co/datasets/mlfoundations/datacomp_1b
- Use the `file_name` to locate the correct npz file
- Use the `file_index` to find the specific embedding within that file
2. For the original LAION-2B Hebrew captions:
- Embeddings are available at: https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-multi/img_emb/
- Follow the same process using `file_name` and `file_index`
## Limitations and Biases
- This dataset provides only captions and references to image embeddings, not the actual images.
- The quality of the translated captions may vary and could introduce biases or inaccuracies.
- The original Hebrew captions from LAION-2B may contain web-scraped content with potential biases or quality issues.
## Acknowledgments
- [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) for the original English captions
- [LAION-5B](https://laion.ai/blog/laion-5b/) for the multilingual dataset
- [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) for the opus-mt-en-he translation model
- [DataComp](https://huggingface.co/datasets/mlfoundations/datacomp_1b) for providing the image embeddings for the translated captions
## License
The use of this dataset is governed by the [NVIDIA License](LICENSE) which permits commercial usage.
# 希伯来语CLIP数据集(Hebrew-CLIP Dataset)
希伯来语CLIP数据集是一套面向希伯来语的图像描述文本集合,旨在支持CLIP(Contrastive Language-Image Pre-training,对比语言-图像预训练)等视觉语言模型的训练工作。本数据集仅提供图像描述文本,不包含原始图像,仅提供指向预计算图像嵌入的引用。
## 数据集构成
本数据集包含两个Parquet文件:
1. **翻译后描述文本**:源自[Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B)数据集的400万条描述文本,通过[opus-mt-en-he](https://huggingface.co/Helsinki-NLP/opus-mt-en-he)机器翻译模型译为希伯来语。
2. **原始希伯来语描述文本**:从[LAION-5B](https://laion.ai/blog/laion-5b/)的多语言子集中提取的378万条希伯来语描述文本。
## 数据格式
每个Parquet文件均包含4个字段:
- `key`:图像描述文本的唯一标识符
- `heb_caption`:希伯来语图像描述文本
- `file_name`:对应图像嵌入文件的名称
- `file_index`:嵌入文件内的嵌入项索引
## 图像嵌入的使用方法
若使用本数据集训练CLIP或同类视觉语言模型,需将描述文本与其对应的CLIP ViT-L/14图像嵌入进行配对。本数据集未包含此类嵌入文件,可通过以下途径获取:
1. 针对翻译后的DataComp描述文本:
- 嵌入文件获取地址:https://huggingface.co/datasets/mlfoundations/datacomp_1b
- 通过`file_name`字段定位对应的NPZ文件
- 通过`file_index`字段获取该文件内的指定嵌入项
2. 针对原始LAION-2B希伯来语描述文本:
- 嵌入文件获取地址:https://the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-multi/img_emb/
- 通过`file_name`与`file_index`字段按照相同流程获取嵌入项
## 局限性与偏倚说明
- 本数据集仅提供图像描述文本与图像嵌入引用,不包含原始图像。
- 翻译后的描述文本质量参差不齐,可能引入偏倚或表述不准确之处。
- 源自LAION-2B的原始希伯来语描述文本包含网络爬取内容,可能存在潜在偏倚或质量问题。
## 致谢
- [Recap-DataComp-1B](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B):提供原始英语图像描述文本
- [LAION-5B](https://laion.ai/blog/laion-5b/):提供多语言数据集
- [Helsinki-NLP](https://huggingface.co/Helsinki-NLP):提供opus-mt-en-he机器翻译模型
- [DataComp](https://huggingface.co/datasets/mlfoundations/datacomp_1b):为翻译后的描述文本提供图像嵌入文件
## 许可协议
本数据集的使用受[NVIDIA许可协议(LICENSE)](LICENSE)约束,允许商业使用。
提供机构:
maas
创建时间:
2025-01-20



