five

86Cao/MegaPairs-Standard

收藏
Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/86Cao/MegaPairs-Standard
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - image-to-text - text-to-image - image-feature-extraction language: - en size_categories: - 10M<n<100M source_datasets: - JUNJIE99/MegaPairs tags: - multimodal - retrieval - synthetic-data - massive-scale - arrow pretty_name: MegaPairs Standard --- # MegaPairs-Standard (Standardized Version) ## Dataset Summary This is a standardized, high-efficiency version of the **[JUNJIE99/MegaPairs](https://huggingface.co/datasets/JUNJIE99/MegaPairs)** dataset. **Why use this version?** The original dataset is distributed as a **massive Tar archive** containing millions of images, accompanied by a separate JSONL annotation file. * **The Problem:** Using the original format requires extracting terabytes of small files (which can exhaust disk inodes) or writing complex logic to read from archives. It also requires manually mapping JSONL metadata to image paths. * **The Solution (This Dataset):** We have processed the data into **Native Arrow format**. Images are decoded and embedded directly alongside their text metadata. **Key Features:** * 🚀 **Zero-Extraction:** No need to unzip or untar anything. You can start training immediately after downloading. * ⚡ **Fastest Loading:** Data is stored in the raw Arrow format (generated by `datasets.save_to_disk`). This allows for zero-copy memory mapping, offering the fastest possible local loading speed. * 📦 **Self-Contained:** Metadata (texts) and Images (PIL Objects) are merged into a single row. * 🧩 **Optimized Sharding:** Data is saved in ~1GB shards for optimal network transfer and parallel processing. > **Note on Preview:** Since this dataset uses the native Arrow directory structure for performance, the Hugging Face "Dataset Viewer" on the website might not render the images directly. This is expected. Please follow the usage instructions below to load the data. ## Dataset Structure Each row in the dataset represents a **Universal Retrieval Pair** (Query -> Target). ### Data Fields | Field Name | Type | Description | | :--- | :--- | :--- | | `query_texts` | `Sequence(String)` | A list of query texts describing the target image. | | `query_image` | `Image` | The query image (PIL object). | | `target_image` | `Image` | The ground-truth positive target image (PIL object). | | `negatives_paths` | `Sequence(String)` | A list of relative paths for hard negative images. <br>⚠️ Note: To prevent the dataset size from exploding (700GB -> 4TB+), negatives are stored as paths/metadata only. For training, it is highly recommended to use In-Batch Negatives strategy, which utilizes other samples in the batch as negatives. | ### Data Statistics * **Total Pairs:** ~15.2M * **Original Source:** [JUNJIE99/MegaPairs](https://huggingface.co/datasets/JUNJIE99/MegaPairs) ## Usage You can load this dataset directly using the `datasets` library. ### Method 1: Using `load_dataset` (Recommended) This is the easiest way. The library handles the Arrow files automatically. ```python from datasets import load_dataset # Load the dataset (this will download the files to your local cache) dataset = load_dataset("86Cao/MegaPairs-Standard", split="train") print(f"Total samples: {len(dataset)}") # Accessing data sample = dataset[0] print(f"Text: {sample['query_texts'][0]}") sample['query_image'].show() # Displays the query image
提供机构:
86Cao
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作