End-to-End 3D Learning Workshop

Abstract

Overview: Large 2D vision and multimodal models have shown how to learn from both supervised and unannotated data and transfer across tasks. These lessons point to a clear path for 3D spatial tasks. Yet many 3D related systems still rely on long, brittle pipelines (e.g., COLMAP) that are hard to scale and slow to adapt. This workshop focuses on end-to-end 3D learning (E2E3D): a single trainable system that maps raw images or video to complete 3D representations, then supports downstream tasks that run in real time and scale with large datasets. Our goal is practical impact in robotics, extended reality, and scientific imaging. Topics include architectures that map from pixels to 3D without hand-tuned steps; cross-modal training; data engines that mine in-the-wild video at scale; tight integration with end-to-end planning and control; efficient deployment on edge devices and robots; and methods for scientific imaging, from protein structure to cellular microscopy. By unifying modeling, inference, and optimization in one data-driven approach, the E2E3D workshop aims to set a clear path for next-generation spatial intelligence systems.

Focus of the Workshop:

Modeling and learning. E2E3D studies unified architectures that map pixels to 3D with minimal postprocessing; pretraining that embeds geometric priors such as scale, viewpoint, and occlusion; world models and vision-language-action models that use spatial memory to handle spatial-temporal dynamics; and differentiable rendering and physics that provide gradients for shape, appearance, and motion.

Data and pretraining. E2E3D focuses on open scale data sources, including long video, multi-view, and multi-sensor logs for robust 3D pretraining; cross modal alignment that uses 2D and image-text corpora to ground language and action; auto annotation with quality control and reproducible protocols; and data governance covering licensing, privacy, and safety for 3D assets.

Systems, evaluation, and impact. E2E3D emphasizes real time and edge deployment on robots and mobile devices; holistic metrics that report accuracy, latency, memory, and energy; robustness and safety for open world generalization and failure analysis; and applications in autonomous driving, XR, industrial and scientific imaging, and mapping.

Invited Speakers

Georgios Pavlakos

UT Austin

End-to-end view synthesis and 3D human

Jiajun Wu

Stanford University

Physical scene generation and understanding

Marco Pavone

Stanford / NVIDIA

End-to-End VLA

Paul Edouard Sarlin

Google

Geometric learning and mapping

Luca Carlone

MIT

SLAM, robotic perception

Schedule

Time	Event
00:00-00:05	Opening Remarks
00:05-00:50	Keynote: Georgios Pavlakos
00:50-01:35	Keynote: Jiajun Wu
01:35-02:20	Keynote: Paul-Edouard Sarlin

Time	Event
02:20-03:05	Poster and Awards Session
03:05-03:50	Keynote: Marco Pavone
03:50-04:35	Keynote: Luca Carlone
04:35-04:45	Closing Remarks

Call for Papers

We invite both short (up to 4 pages) and long (up to 8 pages) paper submissions, excluding references and supplementary materials. Short papers may introduce original but unfinished research or serve as technical reports that present implementations using open source frameworks. Authors can opt for archival or non-archival submissions; non-archival submissions may be concurrently under review elsewhere if external policies permit.

All accepted papers will be presented as posters, with three selected for oral presentations. A single best paper will be chosen from among the long papers, accompanied by a cash prize from our sponsors.

Submission Portal: OpenReview
Submission Deadline: TBD
Notification to Authors: TBD
Camera-ready Deadline: TBD