five

ABC-Pretraining-Data

收藏
魔搭社区2025-12-05 更新2025-03-01 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/ABC-Pretraining-Data
下载链接
链接失效反馈
官方服务:
资源简介:
## ABC Pretraining Data This dataset contains the pretraining data for ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions, advancing the state of visual embeddings with natural language control. This dataset is derived from Google's [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/) dataset. Each item in the dataset contains a URL where the corresponding image can be downloaded and mined negatives for each item. The full dataset is ~300 GB of images. For a detailed description of how we mined the negatives, please check out our paper. **Update**: The images have been added to this repository. For an example of how to use and download this dataset, see our [repository](https://github.com/TIGER-AI-Lab/ABC). ## Paper, Project Page, and Code - Paper: [ABC: Achieving Better Control of Multimodal Embeddings using VLMs](https://huggingface.co/papers/2503.00329) - Project Page: [https://tiger-ai-lab.github.io/ABC/](https://tiger-ai-lab.github.io/ABC/) - Code: [https://github.com/TIGER-AI-Lab/ABC](https://github.com/TIGER-AI-Lab/ABC) ## Sample Usage ### Quick Start First, install the necessary dependencies by cloning the repository and installing requirements: ```bash git clone https://github.com/TIGER-AI-Lab/ABC cd ABC pip install -r requirements.txt ``` Then, you can start making multimodal embeddings: ```python python -i ./quick_start.py ``` ### Fetching Datasets from 🤗 Hub Our datasets are hosted on HuggingFace Hub. The text data and dataset metadata can be fetched using HF's `load_dataset` utility. To fetch the images from our datasets, we provide scripts in the `fetch_datasets` directory. These scripts will pull the pretraining/finetuning image data off the hub and unpack them in your huggingface datasets cache (under a directory called `tigerlab`). Run `python ./fetch_datasets/pretrain.py` to get the pretraining dataset and `python ./fetch_datasets/instruct.py` to get the finetuning dataset, respectively. ## Citation If you find any of our work helpful, please consider citing: ```bibtex @misc{schneider2025abcachievingbettercontrol, title={ABC: Achieving Better Control of Multimodal Embeddings using VLMs}, author={Benjamin Schneider and Florian Kerschbaum and Wenhu Chen}, year={2025}, eprint={2503.00329}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.00329}, } ```

# ABC预训练数据集 本数据集为ABC模型的预训练数据。ABC是一款开源多模态嵌入模型,采用视觉语言模型(Vision-Language Model, VLM)骨干网络实现图像特征与自然语言指令的深度融合,借助自然语言控制推动视觉嵌入技术的前沿性能。 本数据集源自谷歌发布的[Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/)数据集。数据集中的每一条目均包含可下载对应图像的URL,同时为每条数据挖掘负样本。完整数据集的图像数据总量约为300 GB。关于负样本挖掘的具体细节,请参阅我们的研究论文。 **更新**:本数据集的图像已上传至代码仓库。如需了解数据集使用与下载的示例,请参阅我们的[代码仓库](https://github.com/TIGER-AI-Lab/ABC)。 ## 论文、项目主页与代码 - 论文:[ABC:借助视觉语言模型实现多模态嵌入的更优控制](https://huggingface.co/papers/2503.00329) - 项目主页:[https://tiger-ai-lab.github.io/ABC/](https://tiger-ai-lab.github.io/ABC/) - 代码:[https://github.com/TIGER-AI-Lab/ABC](https://github.com/TIGER-AI-Lab/ABC) ## 示例使用 ### 快速上手 首先通过克隆代码仓库并安装依赖包配置运行环境: bash git clone https://github.com/TIGER-AI-Lab/ABC cd ABC pip install -r requirements.txt 随后即可开始生成多模态嵌入向量: python python -i ./quick_start.py ### 从🤗 Hub获取数据集 我们的数据集托管于HuggingFace Hub。可通过HuggingFace(HF)的`load_dataset`工具获取文本数据与数据集元数据。针对数据集图像的获取,我们在`fetch_datasets`目录中提供了专用脚本。这些脚本将从Hub拉取预训练/微调图像数据,并将其解压至你的HuggingFace数据集缓存目录(路径为`tigerlab`子文件夹)。分别运行`python ./fetch_datasets/pretrain.py`以获取预训练数据集,运行`python ./fetch_datasets/instruct.py`以获取微调数据集。 ## 引用格式 若您的研究工作得益于本项目,请引用以下文献: bibtex @misc{schneider2025abcachievingbettercontrol, title={ABC: Achieving Better Control of Multimodal Embeddings using VLMs}, author={Benjamin Schneider and Florian Kerschbaum and Wenhu Chen}, year={2025}, eprint={2503.00329}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.00329}, }
提供机构:
maas
创建时间:
2025-02-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作