Robo3R

Robo3R enables manipulation-ready 3D reconstruction from RGB frames in real time.

By achieving accurate metric-scale 3D geometry in the canonical robot frame, Robo3R eliminates the need for depth sensors and calibration, while improving accuracy and robustness in challenging manipulation scenarios.

These features lead to notable improvements in downstream applications such as imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.

Method

RGB images and robot states are encoded and fused. The transformer backbone processes the resulting features through alternating global and frame-wise attention. The masked point head decodes scale-invariant local geometry, while the relative pose head outputs relative poses for registering points across multiple views. S.T. tokens read out the global similarity transformation, which maps the points into metric-scale 3D geometry in the canonical robot frame.

(a) To address the over-smoothing problem for dense prediction, we propose a masked point head that decomposes point prediction into depth, normalized image coordinate, and mask predictions. Through unprojection, masking, and combination, we obtain sharp points with fine-grained geometric details. (b) The extrinsic estimation module extracts robot keypoints and accurately estimates the camera extrinsics by solving the Perspective-n-Point (PnP) problem; the extrinsics are used to refine the global similarity transformation.

Reconstruction Results

Robo3R robustly handles challenging scenarios where the other reconstruction models and depth cameras fail. Specifically, Robo3R is capable of reconstructing objects as narrow as 1.5 mm (spanning only 1 to 2 pixels in the image), whereas other methods, including depth cameras, fail to capture such fine geometry (row 1). Furthermore, Robo3R successfully handles reflective and transparent objects that blind depth sensors (row 2). Even in cluttered scenes that include bimanual robots with dexterous hands, Robo3R consistently produces accurate and clean point clouds (row 3).

Interactive Demo

Thin Objects

Reflective & Transparent Objects

The Cluttered Scene

Downstream Robotic Manipulation

Imitation Learning

Grasp Synthesis

Collision-Free Motion Planning

Our Team

Sizhe Yang^1,2

Linning Xu^1,2

Hao Li^1,3

Juncheng Mu^1,4

Jia Zeng¹

Dahua Lin^1,2

Jiangmiao Pang¹

¹Shanghai AI Laboratory, ²The Chinese University of Hong Kong, ³University of Science and Technology of China, ⁴Tsinghua University

If you have any questions, please contact Sizhe Yang.

BibTeX


@article{robo3r,
  title={Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction},
  author={Yang, Sizhe and Xu, Linning and Li, Hao and Mu, Juncheng and Zeng, Jia and Lin, Dahua and Pang, Jiangmiao},
  journal={arXiv preprint arXiv:2602.10101},
  year={2026}
}