End-to-End 3D Learning Workshop

Overview

Large 2D vision and multimodal models have shown how to learn from both supervised and unannotated data and transfer across tasks. These lessons point to a clear path for 3D spatial tasks. Yet many 3D related systems still rely on long, brittle pipelines (e.g., COLMAP) that are hard to scale and slow to adapt. This workshop focuses on end-to-end 3D learning (E2E3D): a single trainable system that maps raw images or video to complete 3D representations, then supports downstream tasks that run in real time and scale with large datasets. Our goal is practical impact in robotics, extended reality, and scientific imaging. Topics include architectures that map from pixels to 3D without hand-tuned steps; cross-modal training; data engines that mine in-the-wild video at scale; tight integration with end-to-end planning and control; efficient deployment on edge devices and robots; and methods for scientific imaging, from protein structure to cellular microscopy. By unifying modeling, inference, and optimization in one data-driven approach, the E2E3D workshop aims to set a clear path for next-generation spatial intelligence systems.

Focus of the Workshop:

Technology Development: A central question in end-to-end 3D learning is how to replace traditional multi-stage 3D pipelines with a single, differentiable model. This includes:
- Designing unified network architectures that transform raw inputs (e.g., multi-view or single-view images) into final 3D outputs without extensive hand-engineered steps.
- Leveraging self-supervised learning for pretraining large-scale 3D foundation models on vast, unannotated datasets.
- Developing real-time inference techniques and efficient deployment methods for resource-limited platforms.
Data Challenges: Progress in end-to-end 3D learning also depends on effectively collecting diverse data sources and incorporating large-scale pretraining. Key topics include:
- Methods for incorporating massive, unannotated sources (e.g., multi-view imagery from benchmarks or videos from YouTube) into robust pretraining pipelines.
- Strategies to leverage existing 2D and multimodal image-text datasets to mitigate the shortage of high-quality 3D data.
- Techniques for building automatic or semi-automatic tools that facilitate reliable 3D annotation for supervised finetuning.
Real-World Impact: Another key objective is to create end-to-end 3D learning systems that have transformative applications. Discussions will include:
- Case studies in autonomous driving, where fast and accurate 3D perception enhances safety and decision-making.
- Real-time 3D scene understanding for AR/VR, robotics, and digital twins, demonstrating the advantages of integrated pipelines over traditional methods.
- Scalable 3D modeling in scientific imaging and other fields requiring precise spatial analysis.

This workshop brings together researchers from computer vision, robotics, extended reality (XR), autonomous driving, scientific imaging, and related fields to foster interdisciplinary discussions on next-generation 3D systems. By spotlighting recent breakthroughs and identifying key challenges, we aim to inspire innovative research and practical applications across these domains.

Invited Speakers

Jiajun Wu

Stanford University

Physical scene understanding

Marco Pavone

Stanford / NVIDIA

Luca Carlone

MIT

Call for Papers

We invite both short (up to 4 pages) and long (up to 8 pages) paper submissions, excluding references and supplementary materials. Short papers may introduce original but unfinished research or serve as technical reports that present implementations using open source frameworks. Authors can opt for archival or non-archival submissions; non-archival submissions may be concurrently under review elsewhere if external policies permit.

All accepted papers will be presented as posters, with three selected for oral presentations. A single best paper will be chosen from among the long papers, accompanied by a cash prize from our sponsors.

Submission Portal: OpenReview
Submission Deadline: TBD
Notification to Authors: TBD
Camera-ready Deadline: TBD