five

LAYEK-143/Open-Pixel-1T

收藏
Hugging Face2026-04-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LAYEK-143/Open-Pixel-1T
下载链接
链接失效反馈
官方服务:
资源简介:
--- readme: "1.0.0" dataset_info: features: - name: image dtype: image - name: url dtype: string - name: seed dtype: string splits: - name: train num_bytes: 1099511627776 num_examples: 500000000 configs: - config_name: default data_files: - split: train path: "data/*.parquet" license: mit task_categories: - text-to-image - image-classification - unconditional-image-generation tags: - synthetic - random-seeds - 1TB - 100TB-roadmap - high-resolution - open-source - vision-core pretty_name: Open Pixel 1T (Visual Atlas) size_categories: - 1B<n<10B --- # 🌌 Open-Pixel-1T (Visual Atlas) <div align="center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_logo_name.png" width="200"/> <br> <b>A Large-Scale, High-Entropy Synthetic Image Dataset for Foundational Pre-Training</b> <br> <br> <a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T/viewer/default/train"><img src="https://img.shields.io/badge/Dataset-Viewer-green?style=for-the-badge"></a> <a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/Task-Vision-blue?style=for-the-badge"></a> <a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/License-MIT-red?style=for-the-badge"></a> <a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://hitscounter.dev/api/hit?url=https%3A%2F%2Fhuggingface.co%2Fdatasets%2FLAYEK-143%2FOpen-Pixel-1T&label=VIEWS&icon=eye&color=%23ffc107&message=&style=for-the-badge&tz=UTC" alt="VIEWS"></a> </div> --- ## 📑 Dataset Summary **Open-Pixel-1T** is a monumental open-source initiative designed to create a "Visual Atlas" of stochastic imagery. Unlike traditional datasets scraped from social media which contain inherent human bias, Open-Pixel-1T is constructed using high-entropy random seeds to generate unique, diverse visual signals. This dataset serves as a **foundational layer** for computer vision research, specifically targeting self-supervised learning (SSL), variational autoencoders (VAEs), and large-scale generative pre-training where data volume and variance are critical. ### 🚀 Roadmap & Scale The project follows an aggressive expansion roadmap: * **Phase 1 (Current):** 2 Terabyte (2TB) of high-resolution data. * **Phase 2:** Expansion to 10 Terabytes (10TB). * **Phase 3:** Long-term goal of **100 Terabytes (100TB)** of open visual data. ### 🎯 Key Specifications * **Resolution:** Standardized **1024x1024** px. * **Format:** Optimized **Apache Parquet** (Snappy Compression). * **Source:** Synthetic randomness via UUIDv4 seeding (Picsum Source). * **Entropy:** Maximized randomness to prevent overfitting on specific visual domains. --- ## 💾 Dataset Structure The dataset is sharded into ~1GB Parquet files to facilitate distributed training and streaming. Each row represents a unique image sample generated from a unique seed. ### Data Fields | Field | Type | Description | | :--- | :--- | :--- | | **`image`** | `image` | The raw image binary (PIL compatible). | | **`url`** | `string` | The source URL containing the unique seed used for generation. | | **`seed`** | `string` | The UUIDv4 seed key responsible for the image's visual output. | ### Sample Data ```json { "image": "<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1024x1024>", "url": "[https://picsum.photos/seed/a1b2-c3d4-e5f6/1024/1024](https://picsum.photos/seed/a1b2-c3d4-e5f6/1024/1024)", "seed": "a1b2-c3d4-e5f6" } ``` --- ## 🛠️ Usage ### 1. Streaming (Recommended) Due to the massive size (1TB+), it is recommended to stream the dataset rather than download it entirely. ```python from datasets import load_dataset # Stream the dataset (No disk space required) dataset = load_dataset("LAYEK-143/Open-Pixel-1T", split="train", streaming=True) # Iterate through images for i, sample in enumerate(dataset): print(f"Processing image {i}: {sample['seed']}") image = sample['image'] image.show() if i == 5: break ``` ### 2. PyTorch DataLoader Integration The dataset is optimized for high-throughput training pipelines. ```python import torch from torch.utils.data import DataLoader from datasets import load_dataset from torchvision import transforms # Define transforms transform_pipeline = transforms.Compose([ transforms.Resize((256, 256)), transforms.ToTensor(), ]) def process_batch(examples): examples["pixel_values"] = [transform_pipeline(img.convert("RGB")) for img in examples["image"]] return examples # Load in streaming mode dataset = load_dataset("LAYEK-143/Open-Pixel-1T", split="train", streaming=True) dataset = dataset.map(process_batch, batched=True, remove_columns=["image", "url", "seed"]) # Create Loader dataloader = DataLoader(dataset, batch_size=64) ``` --- ## ⚖️ Citation & License ### License This dataset is released under the **MIT License**. You are free to use it for research, commercial, and open-source projects. ### Citation If you use this dataset in your research or project, please cite it as: ```bibtex @dataset{open_pixel_1t, author = {Ryan Shelby}, title = {Open-Pixel-1T: A Large-Scale Synthetic Visual Atlas}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{[https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T](https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T)}}, note = {Targeting 100TB of open visual data} } ``` --- <div align="center"> Created with ❤️ by <b>Ryan Shelby</b> | 2026 </div>

readme: "1.0.0" dataset_info: features: - name: image dtype: 图像 - name: url dtype: 字符串 - name: seed dtype: 字符串 splits: - name: train num_bytes: 1099511627776 num_examples: 500000000 configs: - config_name: default data_files: - split: train path: "data/*.parquet" license: mit task_categories: - 文本到图像生成 - 图像分类 - 无条件图像生成 tags: - 合成数据 - 随机种子 - 1TB - 100TB路线图 - 高分辨率 - 开源 - 视觉核心 pretty_name: Open Pixel 1T(视觉图谱) size_categories: - 10亿 < 样本数 < 100亿 --- # 🌌 Open-Pixel-1T(视觉图谱) <div align="center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_logo_name.png" width="200"/> <br> <b>面向基础预训练的大规模高熵合成图像数据集</b> <br> <br> <a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T/viewer/default/train"><img src="https://img.shields.io/badge/Dataset-Viewer-green?style=for-the-badge"/></a> <a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/Task-Vision-blue?style=for-the-badge"/></a> <a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/License-MIT-red?style=for-the-badge"/></a> <a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/VIEWS-<number>-yellow?style=for-the-badge"/></a> </div> --- ## 📑 数据集概述 **Open-Pixel-1T**是一项里程碑式的开源倡议,旨在构建随机图像的"视觉图谱"。与传统从社交媒体抓取、带有固有人类偏见的数据集不同,Open-Pixel-1T通过高熵随机种子生成独特且多元的视觉信号。 该数据集可作为计算机视觉研究的**基础层**,专门针对自监督学习(self-supervised learning, SSL)、变分自编码器(variational autoencoders, VAEs)以及对数据规模与多样性要求严苛的大规模生成式预训练场景。 ### 🚀 路线图与规模 该项目遵循激进的扩张路线图: * **第一阶段(当前阶段)**:2太字节(2TB)高分辨率数据。 * **第二阶段**:扩展至10太字节(10TB)。 * **第三阶段**:长期目标为**100太字节(100TB)**开源视觉数据。 ### 🎯 核心规格 * **分辨率**:标准化**1024×1024**像素。 * **格式**:采用优化后的**Apache Parquet**(Snappy压缩)。 * **数据来源**:基于UUIDv4种子生成的合成随机数据(源自Picsum)。 * **熵值**:最大化随机程度,避免在特定视觉领域出现过拟合。 --- ## 💾 数据集结构 该数据集被分片为约1GB的Parquet文件,以支持分布式训练与流式加载。每一行代表一个由唯一种子生成的独立图像样本。 ### 数据字段 | 字段 | 类型 | 描述 | | :--- | :--- | :--- | | **`image`** | `image` | 原始图像二进制数据(兼容PIL(Python Imaging Library)库)。 | | **`url`** | `string` | 包含生成所用唯一种子的来源URL。 | | **`seed`** | `string` | 控制图像视觉输出的UUIDv4种子密钥。 | ### 样本示例 json { "image": "<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1024x1024>", "url": "[https://picsum.photos/seed/a1b2-c3d4-e5f6/1024/1024](https://picsum.photos/seed/a1b2-c3d4-e5f6/1024/1024)", "seed": "a1b2-c3d4-e5f6" } --- ## 🛠️ 使用方法 ### 1. 流式加载(推荐) 鉴于该数据集体量庞大(1TB+),推荐采用流式加载而非完整下载至本地。 python from datasets import load_dataset # 流式加载数据集(无需占用本地磁盘空间) dataset = load_dataset("LAYEK-143/Open-Pixel-1T", split="train", streaming=True) # 遍历图像样本 for i, sample in enumerate(dataset): print(f"正在处理第 {i} 张图像:{sample['seed']}") image = sample['image'] image.show() if i == 5: break ### 2. PyTorch DataLoader 集成 该数据集针对高吞吐量训练流水线进行了优化。 python import torch from torch.utils.data import DataLoader from datasets import load_dataset from torchvision import transforms # 定义预处理流水线 transform_pipeline = transforms.Compose([ transforms.Resize((256, 256)), transforms.ToTensor(), ]) def process_batch(examples): examples["pixel_values"] = [transform_pipeline(img.convert("RGB")) for img in examples["image"]] return examples # 以流式模式加载数据集 dataset = load_dataset("LAYEK-143/Open-Pixel-1T", split="train", streaming=True) dataset = dataset.map(process_batch, batched=True, remove_columns=["image", "url", "seed"]) # 创建数据加载器 dataloader = DataLoader(dataset, batch_size=64) --- ## ⚖️ 引用与许可证 ### 许可证 本数据集采用**MIT许可证**发布,您可自由将其用于研究、商业及开源项目。 ### 引用规范 若您在研究或项目中使用本数据集,请按以下格式引用: bibtex @dataset{open_pixel_1t, author = {Ryan Shelby}, title = {Open-Pixel-1T: A Large-Scale Synthetic Visual Atlas}, year = {2026}, publisher = {Hugging Face}, howpublished = {url{https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T}}, note = {目标为100TB开源视觉数据} } --- <div align="center"> 由 ❤️ Ryan Shelby 创作 | 2026 </div>
提供机构:
LAYEK-143
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作