nani ventures — volumetric egocentric data for embodied AI

Current egocentric datasets are 2D video. That's fundamentally limiting.

View Synthesis Bottleneck

World models learn from fixed camera perspectives. But robots have different cameras in different positions. Training on 2D ego video creates a domain gap at deployment.

Depth Ambiguity

2D video can't tell you how far the cup is, the 3D trajectory of the hand, or spatial relationships between objects. You're forcing models to infer what should be ground truth.

Occlusion Hell

In egocentric video, hands constantly occlude objects. The model never learns object permanence because it can't see what's behind the hand during a grasp.

Sim2Real Gap

Simulators don't look like reality. Training in sim, deploying in real means visual domain shift. The geometry is right but the pixels are wrong.

Egocentric manipulation video with full volumetric ground truth.

We capture manipulation tasks from a first-person view while simultaneously reconstructing the entire scene as a 3D Gaussian splat. Every frame of egocentric video is aligned with a view-independent 3D representation.

This means you can render the scene from ANY camera pose—the ego view, a robot wrist cam, overhead, or arbitrary novel views. Same scene, infinite perspectives.

The model learns not just "what the human saw" but "what the 3D world looked like, and here's one view of it."

Egocentric RGB video1080p+, 30fps, head-mounted capture
3D Gaussian splat sceneFull photorealistic reconstruction, renderable in real-time
Per-frame camera posesEgo camera localized in splat coordinate frame
3D hand poseMANO parameters or joint positions, every frame
Dense depth mapsRendered from splat, metrically accurate
Action + language labelsSegmentation, task descriptions, object annotations

Figure 1: Data Structure

┌────────────────────────────────────────────────────────────────────────┐
│  CAPTURE                                                               │
│                                                                        │
│    Egocentric Camera ──────┬─────── Multi-view Rig ──────┐            │
│    (Vision Pro / GoPro)    │        (for 3DGS recon)     │            │
│                            ▼                              ▼            │
│                     ┌─────────────┐              ┌─────────────┐       │
│                     │  Ego Video  │              │  3D Splat   │       │
│                     │  + Poses    │              │  Scene      │       │
│                     └──────┬──────┘              └──────┬──────┘       │
│                            └──────────┬─────────────────┘              │
│                                       ▼                                │
│                              ┌────────────────┐                        │
│                              │   ALIGNMENT    │                        │
│                              │  Ego poses in  │                        │
│                              │  splat coords  │                        │
│                              └───────┬────────┘                        │
└──────────────────────────────────────┼─────────────────────────────────┘
                                       ▼
┌────────────────────────────────────────────────────────────────────────┐
│  OUTPUT: Per-frame aligned data                                        │
│                                                                        │
│    • ego_rgb[t]           - what the human saw                        │
│    • ego_pose[t]          - where the camera was (6DoF in splat)      │
│    • rendered_rgb[t]      - ego view rendered FROM the splat          │
│    • depth[t]             - dense depth from splat                    │
│    • hand_pose[t]         - 3D MANO mesh                              │
│    • novel_views[t][k]    - arbitrary camera renders                  │
│    • splat.ply            - full scene geometry                       │
└────────────────────────────────────────────────────────────────────────┘

Each trajectory includes both raw capture and aligned derivatives. The splat enables unlimited novel view synthesis.

Five capabilities that 2D egocentric data can't provide.

Cross-embodiment transfer

Train on human ego video, render synthetic robot-camera views through the same scene. The visual domain matches because it's the same 3D reconstruction.

Ground truth depth + geometry

No depth estimation, no stereo matching, no noise. The splat gives you metrically accurate 3D. True spatial relationships, not 2D proxies.

Occlusion supervision

The full 3D scene exists even when occluded in ego view. Render "what's behind the hand" as auxiliary supervision. Teach object permanence.

Real-to-sim with photorealism

Import the splat into your simulator. Train policies in the captured scene, deploy to the real scene. Zero visual domain gap.

Unlimited data augmentation

One real trajectory → thousands of synthetic trajectories rendered through the same scene. Real pixels, synthetic actions.

Building on the foundation. Filling the gap.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran et al. — Meta AI, 2025

arXiv

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Kerbl et al. — SIGGRAPH, 2023

Project

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Hoque et al., 2025

arXiv

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky et al. — RSS, 2024

Dataset

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Google DeepMind + 21 Institutions, 2023

Project

OpenVLA: An Open-Source Vision-Language-Action Model

Kim et al. — Stanford, 2024

Project

Roadmap

Phase 1

Proof of concept

5-10 tabletop manipulation tasks. Ego video + 3DGS + hand pose + alignment. Release as a small benchmark. Technical blog post. Validate with research groups.

Phase 2

Pipeline + Scale

Portable capture rig. Annotation automation. 100+ scenes, 500+ trajectories. Diverse environments. Format compatibility with major frameworks.

Phase 3

Dual revenue model

Every volumetric portrait activation is also a data capture opportunity. B2C portrait revenue + B2B dataset licensing from the same sessions.

The missing data layer for embodied AI: volumetric scenes + egocentric video

Current egocentric datasets are 2D video. That's fundamentally limiting.

View Synthesis Bottleneck

Depth Ambiguity

Occlusion Hell

Sim2Real Gap

Egocentric manipulation video with full volumetric ground truth.

Five capabilities that 2D egocentric data can't provide.

Cross-embodiment transfer

Ground truth depth + geometry

Occlusion supervision

Real-to-sim with photorealism

Unlimited data augmentation

Building on the foundation. Filling the gap.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

3D Gaussian Splatting for Real-Time Radiance Field Rendering

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

OpenVLA: An Open-Source Vision-Language-Action Model

Proof of concept

Pipeline + Scale

Dual revenue model

This data doesn't exist yet. We're building it.