Overview: Large 2D vision and multimodal models have shown how to learn from both supervised and unannotated data and transfer across tasks. These lessons point to a clear path for 3D spatial tasks. Yet many 3D related systems still rely on long, brittle pipelines (e.g., COLMAP) that are hard to scale and slow to adapt. This workshop focuses on end-to-end 3D learning (E2E3D): a single trainable system that maps raw images or video to complete 3D representations, then supports downstream tasks that run in real time and scale with large datasets. Our goal is practical impact in robotics, extended reality, and scientific imaging. Topics include architectures that map from pixels to 3D without hand-tuned steps; cross-modal training; data engines that mine in-the-wild video at scale; tight integration with end-to-end planning and control; efficient deployment on edge devices and robots; and methods for scientific imaging, from protein structure to cellular microscopy. By unifying modeling, inference, and optimization in one data-driven approach, the E2E3D workshop aims to set a clear path for next-generation spatial intelligence systems.
Modeling and learning. E2E3D studies unified architectures that map pixels to 3D with minimal postprocessing; pretraining that embeds geometric priors such as scale, viewpoint, and occlusion; world models and vision-language-action models that use spatial memory to handle spatial-temporal dynamics; and differentiable rendering and physics that provide gradients for shape, appearance, and motion.
Data and pretraining. E2E3D focuses on open scale data sources, including long video, multi-view, and multi-sensor logs for robust 3D pretraining; cross modal alignment that uses 2D and image-text corpora to ground language and action; auto annotation with quality control and reproducible protocols; and data governance covering licensing, privacy, and safety for 3D assets.
Systems, evaluation, and impact. E2E3D emphasizes real time and edge deployment on robots and mobile devices; holistic metrics that report accuracy, latency, memory, and energy; robustness and safety for open world generalization and failure analysis; and applications in autonomous driving, XR, industrial and scientific imaging, and mapping.
Georgios Pavlakos
UT Austin
End-to-end view synthesis and 3D human
Jiajun Wu
Stanford University
Physical scene generation and understanding
Marco Pavone
Stanford / NVIDIA
End-to-End VLA
Paul Edouard Sarlin
Geometric learning and mapping
Luca Carlone
MIT
SLAM, robotic perception
| Time | Event |
|---|---|
| 00:00-00:05 | Opening Remarks |
| 00:05-00:50 | Keynote: Georgios Pavlakos |
| 00:50-01:35 | Keynote: Jiajun Wu |
| 01:35-02:20 | Keynote: Paul-Edouard Sarlin |
| Time | Event |
|---|---|
| 02:20-03:05 | Poster and Awards Session |
| 03:05-03:50 | Keynote: Marco Pavone |
| 03:50-04:35 | Keynote: Luca Carlone |
| 04:35-04:45 | Closing Remarks |
We invite both short (up to 4 pages) and long (up to 8 pages) paper submissions, excluding references and supplementary materials. Short papers may introduce original but unfinished research or serve as technical reports that present implementations using open source frameworks. Authors can opt for archival or non-archival submissions; non-archival submissions may be concurrently under review elsewhere if external policies permit.
All accepted papers will be presented as posters, with three selected for oral presentations. A single best paper will be chosen from among the long papers, accompanied by a cash prize from our sponsors.