A nani-inc venture
Every egocentric frame aligned with a full 3D Gaussian splat reconstruction. View-independent training data for world models and VLAs. Something that doesn't exist yet.
The Problem
World models learn from fixed camera perspectives. But robots have different cameras in different positions. Training on 2D ego video creates a domain gap at deployment.
2D video can't tell you how far the cup is, the 3D trajectory of the hand, or spatial relationships between objects. You're forcing models to infer what should be ground truth.
In egocentric video, hands constantly occlude objects. The model never learns object permanence because it can't see what's behind the hand during a grasp.
Simulators don't look like reality. Training in sim, deploying in real means visual domain shift. The geometry is right but the pixels are wrong.
V-JEPA 2 achieved zero-shot robot planning with just 62 hours of manipulation data. The models aren't compute-starved. They're data-starved. And they're getting the wrong kind of data.
— The thesisWhat We're Building
We capture manipulation tasks from a first-person view while simultaneously reconstructing the entire scene as a 3D Gaussian splat. Every frame of egocentric video is aligned with a view-independent 3D representation.
This means you can render the scene from ANY camera pose—the ego view, a robot wrist cam, overhead, or arbitrary novel views. Same scene, infinite perspectives.
The model learns not just "what the human saw" but "what the 3D world looked like, and here's one view of it."
Figure 1: Data Structure
┌────────────────────────────────────────────────────────────────────────┐
│ CAPTURE │
│ │
│ Egocentric Camera ──────┬─────── Multi-view Rig ──────┐ │
│ (Vision Pro / GoPro) │ (for 3DGS recon) │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Ego Video │ │ 3D Splat │ │
│ │ + Poses │ │ Scene │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ └──────────┬─────────────────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ ALIGNMENT │ │
│ │ Ego poses in │ │
│ │ splat coords │ │
│ └───────┬────────┘ │
└──────────────────────────────────────┼─────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────────────┐
│ OUTPUT: Per-frame aligned data │
│ │
│ • ego_rgb[t] - what the human saw │
│ • ego_pose[t] - where the camera was (6DoF in splat) │
│ • rendered_rgb[t] - ego view rendered FROM the splat │
│ • depth[t] - dense depth from splat │
│ • hand_pose[t] - 3D MANO mesh │
│ • novel_views[t][k] - arbitrary camera renders │
│ • splat.ply - full scene geometry │
└────────────────────────────────────────────────────────────────────────┘
Each trajectory includes both raw capture and aligned derivatives. The splat enables unlimited novel view synthesis.
Why This Matters
Train on human ego video, render synthetic robot-camera views through the same scene. The visual domain matches because it's the same 3D reconstruction.
No depth estimation, no stereo matching, no noise. The splat gives you metrically accurate 3D. True spatial relationships, not 2D proxies.
The full 3D scene exists even when occluded in ego view. Render "what's behind the hand" as auxiliary supervision. Teach object permanence.
Import the splat into your simulator. Train policies in the captured scene, deploy to the real scene. Zero visual domain gap.
One real trajectory → thousands of synthetic trajectories rendered through the same scene. Real pixels, synthetic actions.
Research Context
Roadmap
5-10 tabletop manipulation tasks. Ego video + 3DGS + hand pose + alignment. Release as a small benchmark. Technical blog post. Validate with research groups.
Portable capture rig. Annotation automation. 100+ scenes, 500+ trajectories. Diverse environments. Format compatibility with major frameworks.
Every volumetric portrait activation is also a data capture opportunity. B2C portrait revenue + B2B dataset licensing from the same sessions.
If you're working on world models, VLAs, or embodied AI and want early access or to shape the dataset spec—let's talk.
ventures@nani-inc.com