Nouns related to VO

Here I extract some of them mainly from wikipedia to do a short summary.

Visual odometry

Visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images.

In navigation, odometry is the use of data from the movement of actuators to estimate change in position over time through devices such as rotary encoders to measure wheel rotations. Visual odometry is the process of determining equivalent odometry information using sequential camera images to estimate the distance traveled.

Types of VO:

  • Monocular and Stereo
  • Feature Based and Direct Method
  • Visual Inertial Odometry

Monocular cues

Monocular cues provide depth information when viewing a scene with one eye.

  • Motion parallax: When an observer moves, the apparent relative motion of several stationary objects against a background gives hints about their relative distance. If information about the direction and velocity of movement is known, motion parallax can provide absolute depth information.

  • Depth from motion: One form of depth from motion, kinetic depth perception, is determined by dynamically changing object size. As objects in motion become smaller, they appear to recede into the distance; objects in motion that appear to be getting larger seem to be coming closer.

  • Relative size: If two objects are known to be the same size (e.g. two trees) but their absolute size is unknown, relative size cues can provide information about the relative depth of the two objects. If one subtends a larger visual angle on the retina than the other, the object which subtends the larger visual angle appears closer.

  • Texture gradient: Suppose you are standing on a gravel road. The gravel near you can be clearly seen in terms of shape, size and colour. As your vision shifts towards the more distant part of the road it becomes progressively less easy to distinguish the texture.


A voxel represents a value on a regular grid in three-dimensional space. As with pixels in a bitmap, voxels themselves do not typically have their position explicitly encoded along with their values. Instead, rendering systems infer the position of a voxel based upon its position relative to other voxels.

In contrast to pixels and voxels, points and polygons are often explicitly represented by the coordinates of their vertices. A direct consequence of this difference is that polygons can efficiently represent simple 3D structures with lots of empty or homogeneously filled space, while voxels excel at representing regularly sampled spaces that are non-homogeneously filled.

Photo Consistency

In computer vision, Photo-consistency determines whether a given voxel is occupied. A voxel is considered to be photo consistent when its color appears to be similar to all the cameras that can see it.

Photo-consistency is a global method for estimating the depth variation in a scene. Essentially, each potential depth is considered by projecting the point into every camera view and comparing the pixel colours with a photo-consistency metric. It is commonly used in space carving algorithms to create a model of the object that can be used to reconstruct novel viewpoints.

Optical flow

Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.

The optical flow methods try to calculate the motion between two image frames which are taken at times and at every voxel position.

For a dimensional case, a voxel at location with intensity will have moved by and between the two image frames, and the following brightness constancy constraint can be given:

Assuming the movement to be small, then

which result in

where are the and components of the velocity(optical flow) of . This can be written as

This is an equation in two unknowns and cannot be solved as such. This is known as the aperture problem of the optical flow algorithms.

Aperture problem

The motion direction of a contour is ambiguous, because the motion component parallel to the line cannot be inferred based on the visual input. This means that a variety of contours of different orientations moving at different speeds can cause identical responses in a motion sensitive neuron in the visual system.

In this figure, The grating appears to be moving down and to the right, perpendicular to the orientation of the bars. But it could be moving in many other directions, such as only down, or only to the right. It is impossible to determine unless the ends of the bars become visible in the aperture.


Egomotion is defined as the 3D motion of a camera within an environment. In the field of computer vision, egomotion refers to estimating a camera’s motion relative to a rigid scene. An example of egomotion estimation would be estimating a car’s moving position relative to lines on the road or street signs being observed from the car itself.

The process of estimating a camera’s motion within an environment involves the use of visual odometry techniques on a sequence of images captured by the moving camera. This is typically done using feature detection to construct an optical flow from two image frames in a sequence generated from either single cameras or stereo cameras. Using stereo image pairs for each frame helps reduce error and provides additional depth and scale information.

Features are detected in the first frame, and then matched in the second frame. This information is then used to make the optical flow field for the detected features in those two images. The optical flow field illustrates how features diverge from a single point, the focus of expansion. The focus of expansion can be detected from the optical flow field, indicating the direction of the motion of the camera, and thus providing an estimate of the camera motion.

Scale ambiguity

Monocular visual odometry inevitably suffers from Scale ambiguity. The monocular configuration cannot identify the length of translation movement (a.k.a. scale factor) only from the feature correspondences.

In right figure, translation with two different scale factors exhibits the same feature correspondence.

  • (a). The first camera move 2 meters forward in the 2-meter-wide tunnel.
  • (b). The second camera move 1 meter forward in the 1-meter-wide tunnel

Extra sensors fundamentally solved the ambiguity, but this approach can break one of the strongest advantages of the monocular configuration, compact size and less cost.