Robo3R enables manipulation-ready 3D reconstruction from RGB frames in real time.

By achieving accurate metric-scale 3D geometry in the canonical robot frame, Robo3R eliminates the need for depth sensors and calibration, while improving accuracy and robustness in challenging manipulation scenarios.

These features lead to notable improvements in downstream applications such as imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.

Robo3R Demo

Method


Model

RGB images and robot states are encoded and fused. The transformer backbone processes the resulting features through alternating global and frame-wise attention. The masked point head decodes scale-invariant local geometry, while the relative pose head outputs relative poses for registering points across multiple views. S.T. tokens read out the global similarity transformation, which maps the points into metric-scale 3D geometry in the canonical robot frame.

Module

(a) To address the over-smoothing problem for dense prediction, we propose a masked point head that decomposes point prediction into depth, normalized image coordinate, and mask predictions. Through unprojection, masking, and combination, we obtain sharp points with fine-grained geometric details. (b) The extrinsic estimation module extracts robot keypoints and accurately estimates the camera extrinsics by solving the Perspective-n-Point (PnP) problem; the extrinsics are used to refine the global similarity transformation.

Reconstruction Results


Qualitative Comparison

Robo3R robustly handles challenging scenarios where the other reconstruction models and depth cameras fail. Specifically, Robo3R is capable of reconstructing objects as narrow as 1.5 mm (spanning only 1 to 2 pixels in the image), whereas other methods, including depth cameras, fail to capture such fine geometry (row 1). Furthermore, Robo3R successfully handles reflective and transparent objects that blind depth sensors (row 2). Even in cluttered scenes that include bimanual robots with dexterous hands, Robo3R consistently produces accurate and clean point clouds (row 3).


Interactive Demo

Downstream Robotic Manipulation


Application


Imitation Learning


Grasp Synthesis


Collision-Free Motion Planning

Our Team

Sizhe Yang1,2
Linning Xu1,2
Hao Li1,3
Juncheng Mu1,4
Jia Zeng1
Dahua Lin1,2
Jiangmiao Pang1
1Shanghai AI Laboratory, 2The Chinese University of Hong Kong, 3University of Science and Technology of China, 4Tsinghua University

If you have any questions, please contact Sizhe Yang.

BibTeX


@article{robo3r,
  title={Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction},
  author={Yang, Sizhe and Xu, Linning and Li, Hao and Mu, Juncheng and Zeng, Jia and Lin, Dahua and Pang, Jiangmiao},
  journal={arXiv preprint arXiv:2602.10101},
  year={2026}
}