Grounding Image Matching in 3D with MASt3R

问题描述

给定两张图像 $I^1$ 和 $I^2$ ，分别由未知参数的相机 $C^1$ 和 $C^2$ 采集，恢复出一组像素对应点 ${(i,j)}$

Method

MASt3R1

由于gt pointmaps是metric的，所以为了得到metric的预测，将原本的nomalize的因子设置为 $z:= \hat{z}$

Matching prediction head and loss

回归的方法固有地会受到噪声地影响，且DUSt3R没有显式地针对matching任务做训练。

Matching head

$D^{1}=\mathrm{H e a d}_{\mathrm{d e s c}}^{1} ( [ H^{1}, H^{\prime1} ] ), \tag{8}$ $D^{2}=\mathrm{H e a d}_{\mathrm{d e s c}}^{2} ( [ H^{2}, H^{\prime2} ] ). \tag{9}$

Head是一个简单的两层MLP以及非线性GELU激活函数，最后，将每个局部feature 归一化。

Matching objective

一张图像中的每一个local desriptor至多和另一个图像中的单个descriptor想匹配，表示场景中相同的三维点。

infoNCE loss

${\cal L}_{\mathrm{m a t c h}}=-\sum_{( i, j ) \in\hat{\cal M}} \operatorname{l o g} \frac{s_{\tau} ( i, j )} {\sum_{k \in\mathcal{P}^{1}} s_{\tau} ( k, j )}+\operatorname{l o g} \frac{s_{\tau} ( i, j )} {\sum_{k \in\mathcal{P}^{2}} s_{\tau} ( i, k )},\tag{10}$ $\mathrm{w i t h} \; s_{\tau} ( i, j )=\operatorname{e x p} \left[-\tau D_{i}^{1 \top} D_{j}^{2} \right]. \tag{11}$

total loss

$\cal {L}_{}\mathrm{t o t a l}=\cal {L}_{\mathrm{c o n f}}+\beta \cal {L}_{\mathrm{m a t c h}} \tag{12}$

Fast reciprocal matching 快速相互匹配

首先在图像 $I^1$ 上的grid上初始化kW个像素的稀疏点集

$U^0=\{U^0_n\}^k_{n-1}$

然后每一个像素映射到$I^2$的NN上，得到的像素同样映射回$I^1$：

$U^{t} \longmapsto[ \mathrm{N N}_{2} ( D_{u}^{1} ) ]_{u \in U^{t}} \equiv V^{t} \longmapsto[ \mathrm{N N}_{1} ( D_{v}^{2} ) ]_{v \in V^{t}} \equiv U^{t+1}\tag{15}$

在下一次迭代的时候，已经匹配的像素被滤除，设定迭代次数，知道大多数的对应点收敛。

MASt3R2

得到的相互匹配集为

$\mathcal{M}_{k}^{t}=\{( U_{n}^{t}, V_{n}^{t} ) \mid\; U_{n}^{t}=U_{n}^{t+1} \}$

输出的对应点集合是所有匹配集的总和

通过快速相互匹配不仅显著提高了匹配的速度，还具备离群值的过滤特性，最终的精度比用全部对应集要更高

Coarse-to-fine matching

在高分辨率的图像上生成多个重叠的窗口，然后选择覆盖最多coarse correspondences 的窗口对子集。具体而言，通过贪心的形式逐步添加窗口对，直到90%的correspondences被覆盖。最后对每一个窗口对进行匹配。

$\displaystyle{D^{w_{1}}, D^{w_{2}}=\mathrm{M A S t 3 R} ( I_{w_{1}}^{1}, I_{w_{2}}^{2} )}\tag{16}$ $\displaystyle{\mathcal{M}_{k}^{w_{1}, w_{2}}=\mathrm{f a s t \_\mathrm{r e c i p r o c a l \_\mathrm{N N}} ( D^{w_{1}}, D^{w_{2}} )}}\tag{17}$