RGB images and robot states are encoded and fused. The transformer backbone processes the resulting features through alternating global and frame-wise attention. The masked point head decodes scale-invariant local geometry, while the relative pose head outputs relative poses for registering points across multiple views. S.T. tokens read out the global similarity transformation, which maps the points into metric-scale 3D geometry in the canonical robot frame.
(a) To address the over-smoothing problem for dense prediction, we propose a masked point head that decomposes point prediction into depth, normalized image coordinate, and mask predictions. Through unprojection, masking, and combination, we obtain sharp points with fine-grained geometric details. (b) The extrinsic estimation module extracts robot keypoints and accurately estimates the camera extrinsics by solving the Perspective-n-Point (PnP) problem; the extrinsics are used to refine the global similarity transformation.
Robo3R robustly handles challenging scenarios where the other reconstruction models and depth cameras fail. Specifically, Robo3R is capable of reconstructing objects as narrow as 1.5 mm (spanning only 1 to 2 pixels in the image), whereas other methods, including depth cameras, fail to capture such fine geometry (row 1). Furthermore, Robo3R successfully handles reflective and transparent objects that blind depth sensors (row 2). Even in cluttered scenes that include bimanual robots with dexterous hands, Robo3R consistently produces accurate and clean point clouds (row 3).
@article{robo3r,
title={Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction},
author={Yang, Sizhe and Xu, Linning and Li, Hao and Mu, Juncheng and Zeng, Jia and Lin, Dahua and Pang, Jiangmiao},
journal={arXiv preprint arXiv:2602.10101},
year={2026}
}