A nani-inc venture

The missing data layer for embodied AI: volumetric scenes + egocentric video

Every egocentric frame aligned with a full 3D Gaussian splat reconstruction. View-independent training data for world models and VLAs. Something that doesn't exist yet.

Talk to Us See What We're Building

Current egocentric datasets are 2D video. That's fundamentally limiting.

View Synthesis Bottleneck

World models learn from fixed camera perspectives. But robots have different cameras in different positions. Training on 2D ego video creates a domain gap at deployment.

Depth Ambiguity

2D video can't tell you how far the cup is, the 3D trajectory of the hand, or spatial relationships between objects. You're forcing models to infer what should be ground truth.

Occlusion Hell

In egocentric video, hands constantly occlude objects. The model never learns object permanence because it can't see what's behind the hand during a grasp.

Sim2Real Gap

Simulators don't look like reality. Training in sim, deploying in real means visual domain shift. The geometry is right but the pixels are wrong.

V-JEPA 2 achieved zero-shot robot planning with just 62 hours of manipulation data. The models aren't compute-starved. They're data-starved. And they're getting the wrong kind of data.

— The thesis

What We're Building

Egocentric manipulation video with full volumetric ground truth.

We capture manipulation tasks from a first-person view while simultaneously reconstructing the entire scene as a 3D Gaussian splat. Every frame of egocentric video is aligned with a view-independent 3D representation.

This means you can render the scene from ANY camera pose—the ego view, a robot wrist cam, overhead, or arbitrary novel views. Same scene, infinite perspectives.

The model learns not just "what the human saw" but "what the 3D world looked like, and here's one view of it."

Figure 1: Data Structure

┌────────────────────────────────────────────────────────────────────────┐
│  CAPTURE                                                               │
│                                                                        │
│    Egocentric Camera ──────┬─────── Multi-view Rig ──────┐            │
│    (Vision Pro / GoPro)    │        (for 3DGS recon)     │            │
│                            ▼                              ▼            │
│                     ┌─────────────┐              ┌─────────────┐       │
│                     │  Ego Video  │              │  3D Splat   │       │
│                     │  + Poses    │              │  Scene      │       │
│                     └──────┬──────┘              └──────┬──────┘       │
│                            └──────────┬─────────────────┘              │
│                                       ▼                                │
│                              ┌────────────────┐                        │
│                              │   ALIGNMENT    │                        │
│                              │  Ego poses in  │                        │
│                              │  splat coords  │                        │
│                              └───────┬────────┘                        │
└──────────────────────────────────────┼─────────────────────────────────┘
                                       ▼
┌────────────────────────────────────────────────────────────────────────┐
│  OUTPUT: Per-frame aligned data                                        │
│                                                                        │
│    • ego_rgb[t]           - what the human saw                        │
│    • ego_pose[t]          - where the camera was (6DoF in splat)      │
│    • rendered_rgb[t]      - ego view rendered FROM the splat          │
│    • depth[t]             - dense depth from splat                    │
│    • hand_pose[t]         - 3D MANO mesh                              │
│    • novel_views[t][k]    - arbitrary camera renders                  │
│    • splat.ply            - full scene geometry                       │
└────────────────────────────────────────────────────────────────────────┘

Each trajectory includes both raw capture and aligned derivatives. The splat enables unlimited novel view synthesis.

Five capabilities that 2D egocentric data can't provide.

01

Cross-embodiment transfer

Train on human ego video, render synthetic robot-camera views through the same scene. The visual domain matches because it's the same 3D reconstruction.

02

Ground truth depth + geometry

No depth estimation, no stereo matching, no noise. The splat gives you metrically accurate 3D. True spatial relationships, not 2D proxies.

03

Occlusion supervision

The full 3D scene exists even when occluded in ego view. Render "what's behind the hand" as auxiliary supervision. Teach object permanence.

04

Real-to-sim with photorealism

Import the splat into your simulator. Train policies in the captured scene, deploy to the real scene. Zero visual domain gap.

05

Unlimited data augmentation

One real trajectory → thousands of synthetic trajectories rendered through the same scene. Real pixels, synthetic actions.

Research Context

Building on the foundation. Filling the gap.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran et al. — Meta AI, 2025

arXiv

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Kerbl et al. — SIGGRAPH, 2023

Project

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

Hoque et al., 2025

arXiv

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky et al. — RSS, 2024

Dataset

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Google DeepMind + 21 Institutions, 2023

Project

OpenVLA: An Open-Source Vision-Language-Action Model

Kim et al. — Stanford, 2024

Project
Phase 1

Proof of concept

5-10 tabletop manipulation tasks. Ego video + 3DGS + hand pose + alignment. Release as a small benchmark. Technical blog post. Validate with research groups.

Phase 2

Pipeline + Scale

Portable capture rig. Annotation automation. 100+ scenes, 500+ trajectories. Diverse environments. Format compatibility with major frameworks.

Phase 3

Dual revenue model

Every volumetric portrait activation is also a data capture opportunity. B2C portrait revenue + B2B dataset licensing from the same sessions.

This data doesn't exist yet. We're building it.

If you're working on world models, VLAs, or embodied AI and want early access or to shape the dataset spec—let's talk.

ventures@nani-inc.com