ZeroOneCreative/amara-spatial-10k

Name: ZeroOneCreative/amara-spatial-10k
Creator: ZeroOneCreative
Published: 2026-04-17 15:23:56
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/ZeroOneCreative/amara-spatial-10k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-to-3d - image-to-3d size_categories: - 10K<n<100K tags: - 3d - mesh - glb - synthetic - spatial - pbr - webdataset - embodied-ai pretty_name: AmaraSpatial-10K configs: - config_name: default data_files: - split: train path: "metadata/*.parquet" --- # AmaraSpatial-10K ### A Semantically Anchored, Metric-Scale 3D Dataset for Embodied AI and Spatial Computing ![AmaraSpatial-10K Hero](figures/Amara_Huggingface_Hero2.png) **10,071 AI-generated 3D meshes across 65 categories** — from basilisks to bassoons, cottages to cosmic stations — curated by **Zero One Creative** to close the *spatial alignment gap* that makes most generative 3D repositories unusable for zero-shot deployment in game engines, robotics simulators, and AR/VR pipelines. Every asset is simultaneously **metric-scaled**, **semantically anchored**, **PBR-ready**, and **richly described** — four properties that, to our knowledge, do not co-occur in any other public 3D dataset at this scale. --- ## Why this dataset exists Recent image-to-3D models can produce plausible meshes, but their outputs are spatially *ungrounded*: a generated chair may be 40 m tall, oriented sideways, with its pivot point floating at the centroid. Large repositories inherit and compound this problem — ShapeNet has no PBR, Objaverse has severe quality variance and arbitrary scale, GSO is metric-accurate but only ~1K assets. The next evolution of 3D datasets is not pure volume, but **spatial and semantic alignment**. AmaraSpatial-10K is curated to be that. ### The four properties, all at once - 🟡 **Real-world metric scaling.** Assets are scaled to true physical dimensions in metres and validated by a novel **Scale Plausibility Score (SPS)** using an independent LLM-as-judge. - 🟡 **Semantic origin anchoring.** Origins are placed by functional context — bottom-centre for ground-resting items (chairs, tables), centre for suspended objects (chandeliers, drones), top-centre for ceiling-mounted items. - 🟡 **Production-ready PBR & physics.** Main meshes are decimated to ~50K triangles with separated Normal/Roughness maps (no baked lighting), and ship with a paired convex collision hull (<500 triangles). - 🟡 **Rich multi-modal metadata.** Every asset includes multi-sentence descriptions, a 2D seed image, and five camera renders, yielding ~18× the descriptive concept density of Objaverse tags. ![Category Distribution](figures/category_donut.png) ![AmaraSpatial-Size](figures/Amara_Spatial_Size.png) ![Demo](figures/seed_to_mesh.gif) --- ## Key results at a glance Averages across 9 evaluated categories (5,247 assets in AmaraSpatial-10K, 2,856 matched in Objaverse): | Metric | AmaraSpatial-10K | Objaverse (matched) | |---|---|---| | Mean bounding-box height across 9 categories | **3.89 m** | 1,723 m | | Intra-category scale **CV** (9-category mean) ↓ | **3.40** | 9.92 | | Seating assets in plausible range [0.6, 1.1] m ↑ | **40.7 %** | 7.7 % | | Mean **SPS** ↑ | **0.68** | — | | Assets within plausible size range (aggregate) ↑ | **29.5 %** | — | | Anchor within 1 cm of semantic target ↑ | **79.7 %** | 4.2 % | | Anchors outside object bounding box ↓ | **5.2 %** | 35.2 % | | CLIP Text ↔ 3D coherence ↑ | **0.238** | 0.203 | | LLM Concept Density (0–5) ↑ | **2.62** | 0.14 | | UV-mapped ↑ | **100 %** | 94 % | Where SPS and CV stand for: - **Scale Plausibility Score (SPS)** — a continuous score in [0, 1]. An asset whose measured primary dimension falls inside an LLM-judged plausible interval `[ℓ, u]` scores 1.0; outside, SPS decays as a Gaussian normalised by the interval half-width `h = (u − ℓ) / 2`. The normalisation means narrow-range categories (tea cup: 7–12 cm) and wide-range ones (building: 3–100 m) are penalised on the same *relative* scale. The interval itself comes from an *independent* LLM instance that never sees our dataset. - **Coefficient of Variation (CV)** — `σ / x̄` of a category's bounding-box heights. Low CV means every chair is roughly chair-sized; high CV means the category contains objects spanning orders of magnitude. ### What the numbers actually say - **Scale is physical, not arbitrary.** Across nine evaluated categories, AmaraSpatial-10K's 5,247 assets have a mean bounding-box height of **3.89 m**. The matched 2,856 Objaverse assets average **1,723 m** — three orders of magnitude larger, driven by outliers spanning from 2 cm to over 100 km within a single category. - **2.9× tighter intra-category distributions.** Mean CV of **3.40** across nine categories vs. **9.92** for Objaverse. Individual categories improve dramatically — Seating drops from CV 11.75 → 1.03, Tableware from 10.13 → 2.17. - **Scale plausibility, directly measured.** **40.7 %** of our seating assets fall in the physically plausible height range [0.6, 1.1] m, vs. only **7.7 %** in Objaverse. On our own dataset, the aggregate mean SPS across 5,247 assets is **0.68**, with **29.5 %** scoring a perfect 1.0. - **Anchors you can actually build on.** **79.7 %** of assets land within 1 cm of their semantically correct anchor (bottom-centre, centre, or top-centre), vs. **4.2 %** in Objaverse. Only **5.2 %** of our anchors fall outside the object's own bounding box, vs. **35.2 %** in Objaverse. - **18× richer descriptions.** Each description covers, on average, **2.62** of the 5 core visual constraint axes (Color, Material, Style, Shape, Component) used by text-to-3D models — vs. **0.14** for Objaverse tags. See **"Generation and QC methodology"** below for how every metric is computed. --- ## At a glance | | | |---|---| | **Assets** | 10,071 | | **Total size** | >130 GB | | **Top categories** | 11 core themes, 65 top-level classes (`ClassLabel`) | | **Sub-categories** | 476 (`ClassLabel`) | | **Metadata format** | Parquet (with HF `Image` features inline) | | **Mesh format** | WebDataset `.tar` shards containing GLB binaries | | **Texture size** | 2048 × 2048 | | **Mean face count** | ~47,000 (main mesh), <500 (collision hull) | | **Licence** | CC BY 4.0 | --- ## What's in the box Every asset ships with: - **A seed image** — the text-conditioned synthesis image used to generate the mesh. - **A main GLB mesh** — metric-scaled, semantically anchored, UV-unwrapped, ~10 MB typical, 2K PBR textures. - **A collision GLB** — simplified convex hull for physics and raycasting. - **Five camera renders** — one perspective "doll-house" view plus four cardinal orthographic views (front, back, left, right). - **Rich metadata** — 28 geometric and quality metrics, multi-sentence descriptions, structured category labels, and spatial orientation data. Every column is filterable. Query "all animals with >80 % watertightness and <50K vertices" with a single Parquet predicate. --- ## Repository layout ```text metadata/ train-00000-of-00006.parquet ~2.5 GB each, 6 shards train-00001-of-00006.parquet … meshes/ shard-00000.tar ~5 GB each, 21 shards shard-00001.tar each tar contains <asset_id>.glb + <asset_id>.collision.glb … manifest.parquet asset_id → mesh_shard + category labels (small index) top_categories.json 65 sorted ClassLabel names sub_categories.json 476 sorted ClassLabel names figures/ README figures (hero, category donut, etc.) ``` You don't need to download 130 GB to poke around. The metadata parquet (~ 15 GB) has everything — descriptions, renders, quality scores — and downloads in minutes. The mesh tars (~ 115 GB) only matter when you actually want the 3D files. --- ## Schema Every row in `metadata/*.parquet` has: - **Identity**: `asset_id` (primary key), `top_category`, `sub_category`, `asset_basename` - **Prompt**: `brief_description`, `full_description` - **Visual** (HF `Image` features): `seed_image`, `render_perspective`, `render_front`, `render_back`, `render_left`, `render_right` - **Mesh pointers**: `mesh_shard`, `mesh_path`, `collision_path` (join into the matching tar) - **Geometry**: `vertices`, `decimation_faces`, `approx_islands`, `texture_size`, `aabb[3]`, `anchor_origin[3]`, `forward_axis` - **Quality**: `watertight_percent`, `manifold_edge_ratio`, `degenerate_triangle_count`, `non_manifold_vertices`, `has_uv_coordinates`, `euler_number`, `unique_edges` - **Collision mesh**: `collision_volume_ratio`, `collision_vertices`, `collision_faces` - **Derived geometry**: `surface_area`, `mesh_volume`, `bounding_box_volume`, `average_edge_length`, `aspect_ratio` --- ## Quickstart ### Browse and filter metadata (~15 GB) ```python from datasets import load_dataset ds = load_dataset("zero-one-creative/spatial-10k", split="train") print(ds) # High-quality animals only animals = ds.filter( lambda r: r["top_category"] == "Animals" and r["watertight_percent"] > 80 ) print(f"{len(animals)} clean animal meshes") animals[0]["render_perspective"].show() ``` ### Stream meshes for training ```python import webdataset as wds url = "https://huggingface.co/datasets/zero-one-creative/spatial-10k/resolve/main/meshes/shard-{00000..00020}.tar" pipeline = wds.WebDataset(url, shardshuffle=True).shuffle(1000) for sample in pipeline: asset_id = sample["__key__"] # e.g. "Animals_Dragon_SM_MeshGen_FireDragon" glb_bytes = sample["glb"] # main mesh coll_bytes = sample["collision.glb"] # collision mesh # Join with metadata by asset_id for prompts + geometry fields ``` ### Fetch a single asset by ID ```python from huggingface_hub import hf_hub_download import tarfile row = next(r for r in ds if r["asset_id"] == "Animals_Dragon_SM_MeshGen_FireDragon") shard = hf_hub_download( "zero-one-creative/spatial-10k", f"meshes/shard-{row['mesh_shard']:05d}.tar", repo_type="dataset", ) with tarfile.open(shard) as t: glb_bytes = t.extractfile(row["mesh_path"]).read() ``` ### Download the whole dataset (~130 GB) ```bash hf download zero-one-creative/spatial-10k --repo-type dataset --local-dir ./spatial-10k ``` Resumable and parallel. Use `--include "metadata/*"` to grab only the metadata side. --- ## Generation and QC methodology Every asset was produced through Zero One Creative's synthesis pipeline: ``` text-to-image seed → image-to-3D mesh → spatial alignment & scaling → UV unwrap → mesh decimation → collision-hull simplification → multi-view render ``` ### Spatial alignment Each raw mesh is transformed by a semantically determined rigid transform plus metric scale: - **Metric scale** — an LLM-estimated physical dimension (in metres) for the asset's subcategory sets the scale factor. - **Rotation** — PCA combined with semantic heuristics orients each mesh so its functional front faces +X and its vertical axis aligns to +Z. - **Anchor translation** — origin placed at bottom-centre for ground-resting objects, centre for suspended objects, top-centre for ceiling-mounted objects. ### Quality checks Every output was rigorously quality-checked on both the main mesh and the collision mesh: | Check | Metric | Column | |---|---|---| | Closed-surface completeness | % watertight triangulation | `watertight_percent` | | Manifold geometry | Fraction of edges shared by exactly 2 faces | `manifold_edge_ratio` | | Degenerate triangles | Zero-area / collinear triangle count | `degenerate_triangle_count` | | Non-manifold vertices | Vertices where the surface self-intersects | `non_manifold_vertices` | | Topology | Euler characteristic | `euler_number` | | Collision fit | Collision-hull volume / main-mesh volume | `collision_volume_ratio` | | UV coverage | Whether UV coordinates are present | `has_uv_coordinates` | Every metric is a top-level column rather than a buried JSON blob — **filter for your own quality bar rather than accepting ours.** We deliberately kept borderline-watertight meshes because the optimal threshold depends heavily on downstream use (rendering vs. physics simulation). --- ## Intended uses AmaraSpatial-10K is designed to drop into: - **LLM-driven scene composition** — correct scale and anchors reduce floating objects and interpenetrations without algorithmic changes. - **Embodied AI and robotics simulators** — metric scale and PBR materials shrink the sim-to-real gap. - **Text-to-3D / image-to-3D training & evaluation** — aligned text ↔ image ↔ mesh triplets enable cross-modal objectives. - **Retrieval systems** — multi-sentence descriptions significantly outperform sparse tags under CLIP and LLM-embedding similarity. - **Game-engine prototyping** — production-ready GLB with collision hulls, usable zero-shot in Unreal, Unity, or Godot. --- ## Licence Released under **Creative Commons Attribution 4.0 International (CC BY 4.0)**. You are free to use, remix, redistribute, and build upon the assets for any purpose including commercial, provided you give appropriate credit. --- *Built by [Zero One Creative](https://01c.ai).*

提供机构：

ZeroOneCreative

5,000+

优质数据集

54 个

任务类型

进入经典数据集