End-to-End 3D Learning

in conjunction with CVPR 2026, Denver, CO

June 3, 2026 · 13:00–18:00 · Room 501

Abstract News Speakers Schedule Call for Papers Organizers Venue ▾

Abstract

Overview: Large 2D vision and multimodal models have shown how to learn from both supervised and unannotated data and transfer across tasks. These lessons point to a clear path for 3D spatial tasks. Yet many 3D related systems still rely on long, brittle pipelines (e.g., COLMAP) that are hard to scale and slow to adapt. This workshop focuses on end-to-end 3D learning (E2E3D): a single trainable system that maps raw images or video to complete 3D representations, then supports downstream tasks that run in real time and scale with large datasets. Our goal is practical impact in robotics, extended reality, and scientific imaging. Topics include architectures that map from pixels to 3D without hand-tuned steps; cross-modal training; data engines that mine in-the-wild video at scale; tight integration with end-to-end planning and control; efficient deployment on edge devices and robots; and methods for scientific imaging, from protein structure to cellular microscopy. By unifying modeling, inference, and optimization in one data-driven approach, the E2E3D workshop aims to set a clear path for next-generation spatial intelligence systems.

Focus of the Workshop:

    Modeling and learning. E2E3D studies unified architectures that map pixels to 3D with minimal postprocessing; pretraining that embeds geometric priors such as scale, viewpoint, and occlusion; world models and vision-language-action models that use spatial memory to handle spatial-temporal dynamics; and differentiable rendering and physics that provide gradients for shape, appearance, and motion.

    Data and pretraining. E2E3D focuses on open scale data sources, including long video, multi-view, and multi-sensor logs for robust 3D pretraining; cross modal alignment that uses 2D and image-text corpora to ground language and action; auto annotation with quality control and reproducible protocols; and data governance covering licensing, privacy, and safety for 3D assets.

    Systems, evaluation, and impact. E2E3D emphasizes real time and edge deployment on robots and mobile devices; holistic metrics that report accuracy, latency, memory, and energy; robustness and safety for open world generalization and failure analysis; and applications in autonomous driving, XR, industrial and scientific imaging, and mapping.

News

  • One Best Paper Award and one Best Demo Award will be announced during the event, each with a $500 cash prize. Gifts will be given to onsite attendees.
  • All submissions must follow the CVPR 2026 LaTeX template format. Please refer to the official CVPR 2026 Author Guidelines for detailed formatting instructions.

Invited Speakers

Georgios Pavlakos

UT Austin

End-to-end view synthesis and 3D human

Jiajun Wu

Stanford University

Physical scene generation and understanding

Marco Pavone

Stanford / NVIDIA

End-to-End VLA

Paul Edouard Sarlin

Google

Geometric learning and mapping

Luca Carlone

MIT

SLAM, robotic perception

Schedule

June 3, 2026 · 13:00–18:00 · Room 501

13:00–13:05 Opening Remarks
13:05–13:50 Keynote: Georgios Pavlakos
13:50–14:35 Keynote: Jiajun Wu
14:35–15:20 Keynote: Marco Pavone
15:20–15:35 Break
15:35–16:20 Keynote: Paul Edouard Sarlin
16:20–17:05 Keynote: Luca Carlone
17:05–17:35 Poster Session
17:35–17:50 Awards
17:50–18:00 Closing Remarks

Call for Papers

We invite both short (up to 4 pages) and long (up to 8 pages) paper submissions, excluding references and supplementary materials. Short papers may introduce original but unfinished research or serve as technical reports that present implementations using open source frameworks.

All submissions are non-archival. Dual submission is allowed where outside venue rules permit. Accepted papers will be hosted on OpenReview and/or the workshop website.

All accepted papers will be presented as posters.



Awards


Presentation Format

Organizers