2013年11月14日木曜日

Tracking関連論文のエレガントな変数定義例

PWP3D

Prisacariu, Victor and Reid, Ian: "PWP3D: Real-Time Segmentation and Tracking of 3D Objects," International Journal of Computer Vision, vol. 98, no. 3, pp. 1-20, 2012.
Let the image be denoted by $I$ and the image domain by $\Omega\subset\mathbb{R}^2$. An image pixel $x = [x,y]$ has a corresponding image value $I(x) = y$ (in our experiments an RGB value), a 3D point $X = [X,Y,Z]^T = RX_0 +T \in \mathbb{R}^3$ in the camera coordinate frame and a point $X_0 = [X_0,Y_0,Z_0]^T \in \mathbb{R}^3$ in the object coordinate frame. $R$ and $T$ are the rotation matrix and translation vector respectively (representing the unknown pose) and are parameterised by 7 parameters (4 for rotation and 3 for translation), denoted by $\lambda_i$. We use quaternions to represent rotation, so we use 4 parameters to capture the 3 degrees of freedom of the rotation.
We assume the camera calibration parameters (for each camera) to be known. Let $(f_u,f _v)$ be the focal distance expressed in horizontal and vertical pixels and $(u_o, v_o)$ the principal point of the camera. In the case of multiple cameras, the extrinsics were obtained using the Matlab Calibration Toolbox.

DTAM

Newcombe, Richard A. and Lovegrove, Steven J. and Davison, Andrew J.: "DTAM: Dense tracking and mapping in real-time", Proc. Int. Conf. on Computer Vision (ICCV2011), pp. 2320-2327, 2011.
We refer to the pose of a camera $c$ with respect to the world frame of reference $w$ as \[ T_{wc} = \left( \begin{array}{cc} R_{wc} & c_{w}\\ 0^T & 1 \end{array} \right) \] where $T_{wc} \in \mathbb{SE(3)}$ is the matrix describing point transfer between the camera's frame of reference and that of the world, such that $x_w = T_{wc}x_c$. $R_{wc} \in \mathbb{SO(3)}$ is the rotation matrix describing directional transfer, and $c_w$ is the location of the optic center of camera $c$ in the frame of reference $w$. Our camera has fixed and pre-calibrated intrinsic matrix $K$ and all images are pre-warped to remove radial distortion. We describe perspective projection of a 3D point $x_c = (x, y, z)^T$ including dehomogenisation by $\pi (x_c) = (x/z, y/z)^T$. Our dense model is composed of overlapping keyframes. Illustrated in Figure 1, a keyframe $r$ with world-camera frame transform $T_{rw}$, contains an inverse depth map $x_{r}: \Omega \rightarrow \mathbb{R}$ and RGB reference image $I_r : \Omega \rightarrow \mathbb{R}^3$ where $\Omega \subset \mathbb{R}^2$ is the image domain. For a pixel $u := (u, v)^T \in \Omega$, we can back-project an inverse depth value $d = \xi (u)$ to a 3D point $x = \pi^{-1}(u, d)$ where $\pi^{-1}(u, d) = \frac{1}{d} K^{-1}\dot u$. The dot notation is used to define the homogeneous vector $\dot u := (u, v, 1)^T$.

KinectFusion

Newcombe, Richard A. and Davison, Andrew J. and Izadi, S. and Kohli, P. and Hilliges, Otmar and Shotton, J. and Molyneaux, David and Hodges, Steve and Kim, David and Fitzgibbon, A.: "KinectFusion: Real-time dense surface mapping and tracking," Proc. 10th IEEE Int. Symp. on Mixed and Augmented Reality (ISMAR2011), pp. 127-136, 2011.
We represent the live 6DOF camera pose estimated for a frame at time $k$ by a rigid body transformation matrix: \[ T_{g,k} = \left[ \begin{array} R_{g,k} & t_{g,k}\\ 0 & 1 \end{array} \right] \in \mathbb{SE}^3 \] where the Euclidean group $\mathbb{SE}^3 := \{R, t | R \in \mathbb{SO}^3, t \in \mathbb{R}^3g\}$. This maps the camera coordinate frame at time $k$ into the global frame $g$, such that a point $p_k \in \mathbb{R}^3$ in the camera frame is transferred into the global co-ordinate frame via $p_g = T_{g,k} p_k$. We will also use a single constant camera calibration matrix $K$ that transforms points on the sensor plane into image pixels. The function $q = \pi(p)$ performs perspective projection of $p \in \mathbb{R}^3 = (x,y,z)^T$ including dehomogenisation to obtain $q \in \mathbb{R}^2 = (x/z,y/z)$. We will also use a dot notation to denote homogeneous vectors $\dot u := (u^T|1)^T$