LAYEK-143/Open-Pixel-1T
收藏Hugging Face2026-04-07 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LAYEK-143/Open-Pixel-1T
下载链接
链接失效反馈官方服务:
资源简介:
---
readme: "1.0.0"
dataset_info:
features:
- name: image
dtype: image
- name: url
dtype: string
- name: seed
dtype: string
splits:
- name: train
num_bytes: 1099511627776
num_examples: 500000000
configs:
- config_name: default
data_files:
- split: train
path: "data/*.parquet"
license: mit
task_categories:
- text-to-image
- image-classification
- unconditional-image-generation
tags:
- synthetic
- random-seeds
- 1TB
- 100TB-roadmap
- high-resolution
- open-source
- vision-core
pretty_name: Open Pixel 1T (Visual Atlas)
size_categories:
- 1B<n<10B
---
# 🌌 Open-Pixel-1T (Visual Atlas)
<div align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_logo_name.png" width="200"/>
<br>
<b>A Large-Scale, High-Entropy Synthetic Image Dataset for Foundational Pre-Training</b>
<br>
<br>
<a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T/viewer/default/train"><img src="https://img.shields.io/badge/Dataset-Viewer-green?style=for-the-badge"></a>
<a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/Task-Vision-blue?style=for-the-badge"></a>
<a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/License-MIT-red?style=for-the-badge"></a>
<a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://hitscounter.dev/api/hit?url=https%3A%2F%2Fhuggingface.co%2Fdatasets%2FLAYEK-143%2FOpen-Pixel-1T&label=VIEWS&icon=eye&color=%23ffc107&message=&style=for-the-badge&tz=UTC" alt="VIEWS"></a>
</div>
---
## 📑 Dataset Summary
**Open-Pixel-1T** is a monumental open-source initiative designed to create a "Visual Atlas" of stochastic imagery. Unlike traditional datasets scraped from social media which contain inherent human bias, Open-Pixel-1T is constructed using high-entropy random seeds to generate unique, diverse visual signals.
This dataset serves as a **foundational layer** for computer vision research, specifically targeting self-supervised learning (SSL), variational autoencoders (VAEs), and large-scale generative pre-training where data volume and variance are critical.
### 🚀 Roadmap & Scale
The project follows an aggressive expansion roadmap:
* **Phase 1 (Current):** 2 Terabyte (2TB) of high-resolution data.
* **Phase 2:** Expansion to 10 Terabytes (10TB).
* **Phase 3:** Long-term goal of **100 Terabytes (100TB)** of open visual data.
### 🎯 Key Specifications
* **Resolution:** Standardized **1024x1024** px.
* **Format:** Optimized **Apache Parquet** (Snappy Compression).
* **Source:** Synthetic randomness via UUIDv4 seeding (Picsum Source).
* **Entropy:** Maximized randomness to prevent overfitting on specific visual domains.
---
## 💾 Dataset Structure
The dataset is sharded into ~1GB Parquet files to facilitate distributed training and streaming. Each row represents a unique image sample generated from a unique seed.
### Data Fields
| Field | Type | Description |
| :--- | :--- | :--- |
| **`image`** | `image` | The raw image binary (PIL compatible). |
| **`url`** | `string` | The source URL containing the unique seed used for generation. |
| **`seed`** | `string` | The UUIDv4 seed key responsible for the image's visual output. |
### Sample Data
```json
{
"image": "<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1024x1024>",
"url": "[https://picsum.photos/seed/a1b2-c3d4-e5f6/1024/1024](https://picsum.photos/seed/a1b2-c3d4-e5f6/1024/1024)",
"seed": "a1b2-c3d4-e5f6"
}
```
---
## 🛠️ Usage
### 1. Streaming (Recommended)
Due to the massive size (1TB+), it is recommended to stream the dataset rather than download it entirely.
```python
from datasets import load_dataset
# Stream the dataset (No disk space required)
dataset = load_dataset("LAYEK-143/Open-Pixel-1T", split="train", streaming=True)
# Iterate through images
for i, sample in enumerate(dataset):
print(f"Processing image {i}: {sample['seed']}")
image = sample['image']
image.show()
if i == 5: break
```
### 2. PyTorch DataLoader Integration
The dataset is optimized for high-throughput training pipelines.
```python
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from torchvision import transforms
# Define transforms
transform_pipeline = transforms.Compose([
transforms.Resize((256, 256)),
transforms.ToTensor(),
])
def process_batch(examples):
examples["pixel_values"] = [transform_pipeline(img.convert("RGB")) for img in examples["image"]]
return examples
# Load in streaming mode
dataset = load_dataset("LAYEK-143/Open-Pixel-1T", split="train", streaming=True)
dataset = dataset.map(process_batch, batched=True, remove_columns=["image", "url", "seed"])
# Create Loader
dataloader = DataLoader(dataset, batch_size=64)
```
---
## ⚖️ Citation & License
### License
This dataset is released under the **MIT License**. You are free to use it for research, commercial, and open-source projects.
### Citation
If you use this dataset in your research or project, please cite it as:
```bibtex
@dataset{open_pixel_1t,
author = {Ryan Shelby},
title = {Open-Pixel-1T: A Large-Scale Synthetic Visual Atlas},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T](https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T)}},
note = {Targeting 100TB of open visual data}
}
```
---
<div align="center">
Created with ❤️ by <b>Ryan Shelby</b> | 2026
</div>
readme: "1.0.0"
dataset_info:
features:
- name: image
dtype: 图像
- name: url
dtype: 字符串
- name: seed
dtype: 字符串
splits:
- name: train
num_bytes: 1099511627776
num_examples: 500000000
configs:
- config_name: default
data_files:
- split: train
path: "data/*.parquet"
license: mit
task_categories:
- 文本到图像生成
- 图像分类
- 无条件图像生成
tags:
- 合成数据
- 随机种子
- 1TB
- 100TB路线图
- 高分辨率
- 开源
- 视觉核心
pretty_name: Open Pixel 1T(视觉图谱)
size_categories:
- 10亿 < 样本数 < 100亿
---
# 🌌 Open-Pixel-1T(视觉图谱)
<div align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers_logo_name.png" width="200"/>
<br>
<b>面向基础预训练的大规模高熵合成图像数据集</b>
<br>
<br>
<a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T/viewer/default/train"><img src="https://img.shields.io/badge/Dataset-Viewer-green?style=for-the-badge"/></a>
<a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/Task-Vision-blue?style=for-the-badge"/></a>
<a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/License-MIT-red?style=for-the-badge"/></a>
<a href="https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T"><img src="https://img.shields.io/badge/VIEWS-<number>-yellow?style=for-the-badge"/></a>
</div>
---
## 📑 数据集概述
**Open-Pixel-1T**是一项里程碑式的开源倡议,旨在构建随机图像的"视觉图谱"。与传统从社交媒体抓取、带有固有人类偏见的数据集不同,Open-Pixel-1T通过高熵随机种子生成独特且多元的视觉信号。
该数据集可作为计算机视觉研究的**基础层**,专门针对自监督学习(self-supervised learning, SSL)、变分自编码器(variational autoencoders, VAEs)以及对数据规模与多样性要求严苛的大规模生成式预训练场景。
### 🚀 路线图与规模
该项目遵循激进的扩张路线图:
* **第一阶段(当前阶段)**:2太字节(2TB)高分辨率数据。
* **第二阶段**:扩展至10太字节(10TB)。
* **第三阶段**:长期目标为**100太字节(100TB)**开源视觉数据。
### 🎯 核心规格
* **分辨率**:标准化**1024×1024**像素。
* **格式**:采用优化后的**Apache Parquet**(Snappy压缩)。
* **数据来源**:基于UUIDv4种子生成的合成随机数据(源自Picsum)。
* **熵值**:最大化随机程度,避免在特定视觉领域出现过拟合。
---
## 💾 数据集结构
该数据集被分片为约1GB的Parquet文件,以支持分布式训练与流式加载。每一行代表一个由唯一种子生成的独立图像样本。
### 数据字段
| 字段 | 类型 | 描述 |
| :--- | :--- | :--- |
| **`image`** | `image` | 原始图像二进制数据(兼容PIL(Python Imaging Library)库)。 |
| **`url`** | `string` | 包含生成所用唯一种子的来源URL。 |
| **`seed`** | `string` | 控制图像视觉输出的UUIDv4种子密钥。 |
### 样本示例
json
{
"image": "<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1024x1024>",
"url": "[https://picsum.photos/seed/a1b2-c3d4-e5f6/1024/1024](https://picsum.photos/seed/a1b2-c3d4-e5f6/1024/1024)",
"seed": "a1b2-c3d4-e5f6"
}
---
## 🛠️ 使用方法
### 1. 流式加载(推荐)
鉴于该数据集体量庞大(1TB+),推荐采用流式加载而非完整下载至本地。
python
from datasets import load_dataset
# 流式加载数据集(无需占用本地磁盘空间)
dataset = load_dataset("LAYEK-143/Open-Pixel-1T", split="train", streaming=True)
# 遍历图像样本
for i, sample in enumerate(dataset):
print(f"正在处理第 {i} 张图像:{sample['seed']}")
image = sample['image']
image.show()
if i == 5: break
### 2. PyTorch DataLoader 集成
该数据集针对高吞吐量训练流水线进行了优化。
python
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from torchvision import transforms
# 定义预处理流水线
transform_pipeline = transforms.Compose([
transforms.Resize((256, 256)),
transforms.ToTensor(),
])
def process_batch(examples):
examples["pixel_values"] = [transform_pipeline(img.convert("RGB")) for img in examples["image"]]
return examples
# 以流式模式加载数据集
dataset = load_dataset("LAYEK-143/Open-Pixel-1T", split="train", streaming=True)
dataset = dataset.map(process_batch, batched=True, remove_columns=["image", "url", "seed"])
# 创建数据加载器
dataloader = DataLoader(dataset, batch_size=64)
---
## ⚖️ 引用与许可证
### 许可证
本数据集采用**MIT许可证**发布,您可自由将其用于研究、商业及开源项目。
### 引用规范
若您在研究或项目中使用本数据集,请按以下格式引用:
bibtex
@dataset{open_pixel_1t,
author = {Ryan Shelby},
title = {Open-Pixel-1T: A Large-Scale Synthetic Visual Atlas},
year = {2026},
publisher = {Hugging Face},
howpublished = {url{https://huggingface.co/datasets/LAYEK-143/Open-Pixel-1T}},
note = {目标为100TB开源视觉数据}
}
---
<div align="center">
由 ❤️ Ryan Shelby 创作 | 2026
</div>
提供机构:
LAYEK-143



