I plan to divide the series into three episodes. This time I will introduce basic notions related to image processing. Those high-level algorithms will be included in later articles. (Actually, these articles are written mainly for the review of final exam ::>_<::)

## Pixel Distance Measure

Given some pixels as $p\rightarrow (x,y)$, $q\rightarrow (s,t)$ and $r\rightarrow (u,v)$.

The pixel distance has some properties:

• $D(p,q) \geqslant 0 \quad (D(p,q) = 0 \Leftrightarrow p=q)$

• $D(p,q) = D(q,p)$

• $D(p,r) \leqslant D(p,q) + D(q,r)$

Basic distance functions:

• Euclidean: $D_E(p,q) = \sqrt{(x-s)^2 + (y-t)^2}$
• City-Block: $D_4(p,q) = \vert x-s\vert + \vert y-t\vert$
• Chessboard: $D_8(p,q) = \mathrm{max}\big(\vert x-s\vert,\vert y-t\vert\big)$

Pixel Norm and general distance:

• Norm: $\displaystyle\|f\|_w = \bigg[\int\vert f(x)\vert^w\mathrm{d}x\bigg]^{1/w}$
• (*) Distance: $D_{w}(p,q) = \bigg[\vert x-s\vert^w + \vert y-t\vert^w\bigg]^{1/w}$

Actually, we can find that

• $\left. D_w(p,q) \right|_{w=2} = D_{E}(p,q)$

• $\left. D_w(p,q) \right|_{w=1} = D_4(p,q)$

• $\left. D_w(p,q) \right|_{w=\infty} = D_8(p,q)$

Define neighborhood by distance:

• $N_4(p) = \Big\{r \,\Big\vert \,D_4(p,r) = 1\Big\}$

• $N_8(p) = \Big\{r \,\Big\vert \,D_8(p,r) = 1\Big\}$

## 2-D Discrete Fourier Transform

Given an $N\times N$ matrix (i.e. 2-D image), 2-D DFT can be defined as

2-D IDFT (Invert DFT) is defined as

We can also find that the 2-D DFT matrix for a $N\times N$ image is

There are many 2-D DFT properties:

• (*) Linear: $f_1(x,y) + f_2(x,y) \leftrightarrow F_1(u,v) + F_2(u,v)$
• Scale: $f(ax, by) \leftrightarrow \frac{1}{ab} F\big(\frac{u}{a},\frac{v}{b}\big)$
• (*) Translation:
• $f(x-a, y-b) \leftrightarrow e^{-2\pi j (au+bv)} F(u,v)$

• $e^{2\pi j (cx+dy)}f(x,y) \leftrightarrow F(u-c,v-d)$

• (*) Convolution:
• $f_1(x,y)*f_2(x,y) \leftrightarrow F_1(u,v)F_2(u,v)$

• $f_1(x,y)f_2(x,y) \leftrightarrow F_1(u,v)*F_2(u,v)$

• Rotation: $f(x\cos \theta+y\sin\theta, -x\sin\theta + y\cos\theta) \leftrightarrow F(u\cos \theta + v\sin\theta, -u\sin\theta + v\cos\theta)$

## Classical Image Enhancement

Some basic gray-level transformation:

• Negative
• Log
• Exponential: $s = cr^{\gamma}$
• Gamma Correction (exponential transform with $c=1$)
• Contrast Stretching
• Bit-Plane Representation

Simple enhancement methods:

• Histogram Equalization: contrast adjustment
• Histogram Specification (or Histogram Matching)
• Image Sharpening
• Frequency domain: some high-pass filters (Butterworth, Gaussian … )
• Spatial domain: spatial convolution by using gradient operator (Sobel, Laplace … )
• Image Smoothing
• Frequency domain: some low-pass filters (Butterworth, Gaussian …)
• Spatial domain: spatial convolution (average filter, median filter, Multi-frame average … )
• Pseudo Color: A mapping from gray to color

## Classical Image Restoration

Image restoration is different from image enhancement. Though the result of both is the improvement of image quality, the aim of restoration is to find the original image from an exists low-quality image. To restore the original image, we usually want to find the model of image degradation.

A simple model of image degradation can be defined as a linear system

In this system, the degradation function H and additional noise N will affect original image F

Some common noise:

• Gaussian noise
• Uniform noise
• Exponential noise
• Impulse (salt-and-pepper) noise
• Rayleigh noise
• Periodic noise

Spatial domain filtering:

• only suitable for additional noise
• some common filter:
• mean filters: arithmetic mean filter, geometric mean filter, harmonic mean filter, etc.
• order statistic filters: median filter, max-and-min filters, midpoint filter, etc.

Frequency domain filtering for periodic noise reduction:

• Band-reject filters
• Band-pass filters
• Notch filters

Estimating the degradation function:

It is obvious that to restore the original image, we need the degradation model. In practice, the degradation function can be estimated by image observation, experimentation, modeling, etc. Among them, the modeling method utilizes prior knowledge to construct mathematical model that describe the degradation process.

Here we give an example of modeling method. We know that if there exists relative motion between camera and target objects, the produced image will be blurred, which is called “Motion Blur” phenomenon. Now we can deduce the corresponding degradation function.

Suppose we use a camera to take photos, the exposure time is $T$. the relative motion can be described by the orthogonal two directions, which are defined as $x_0(t)$ and $y_0(t)$. Now we take a blurry photo $g(x,y)$ as

Use Fourier Transform, we can get the spectrum of $g$ as

Now that the spectrum $F(u,v)$ is not related with time t, we can then deduce that

If we define the motion functions as $x_0(t) = at/T$ and $y_0(t) = bt/T$, then the system function $H(u,v)$ becomes

Hence we can get the final model

Image Geometric Correction:

In practice, the given photo usually tend to be distorted. The aim of Geometric Correction is to eliminate this kind of distortion. The Correction often contain two steps

• coordinate correction (spatial transformation)
• intensity (gray value) estimation (gray value interpolation)

The first step, coordinate correction, transfer the input image $f(x,y)$ to output $g(x,y) = f(x',y')$, which means that we should construct a proper relationship between true grid coordinate and input distorted coordinate. The relationship can be represented as a polynomial equation. In this step, we usually choose a set of image control points to guarantee the correction quality.

Since the calculated correction coordinates tend to be float value in the target grid, we execute the second step to get the final output image. The bilinear interpolation is often used in this step, and the warp method is often backward instead of the forward one. The detailed description can be found in this image processing tutorial. Here we give some equations related to this kind of interpolation.

In this picture, P is the point that backward warp produced. To calculate the value of point P, we need to use neighbor red grid points $Q_{ij}$. Now imagine that the image is a 3-D signal f(x,y), we first use 1-D interpolation column wise to calculate the value of $R_1$ and $R_2$:

The factor $(x-x_1)\,/\,(x_2-x_1)$ is a weight of how much of a mix the output consists of between the two row-line values. Now we can find the value of P by interpolating row wise use $R_1$ and $R_2$:

Actually, you can find another expression of bilinear interpolation on Wikipedia, they are completely equal.

## Digitization in Image Analysis

To analyze digital image, the digital topology and digital geometry mechanism are often used.

There are some kinds of sampling grids:

• (*) Rectangular grid: widely used, intuitive but having connectivity paradox
• Triangle grid: $N_6(p)$
• Hexagon grid: $N_3(p)$, $N_{12}(p)$

The sampling points are located at the center of grid cells.

We can define the sampling efficiency by calculating the fraction of area between unit circle and covered cells.

• Rectangular grid’s efficiency: $\pi/4\approx 0.79$
• Triangle grid’s efficiency: $\pi/3\sqrt{3}\approx0.6$
• Hexagon grid’s efficiency: $\pi/2\sqrt{3}\approx0.91$

The preimages and domain of the digitized set P are defined as:

• Given a set of discrete points P, a set of consecutive points S that is digitized as P is called a preimage of P
• The region defined by the union of all possible preimages S is called the domain of P

A suitable digitizing model should have the following characteristics:

• The digitized result for a nonempty continuous set should be nonempty
• The digitizing model should be as translation-invariant as possible
• Given a set P, each of its preimages should be similar under certain criteria

Some common digitizing models:

• The Square Box Quantization (SBQ)
• Grid-Intersect Quantization (GIQ)

The Square Box Quantization:

In SBQ, there is a corresponding digitization box $B_i=(x_i-1/2,x_i+1/2)\times(y_i-1/2,y_i+1/2)$ for any pixel $p_i=(x_i,y_i)$. A pixel $p_i$ is in the digitized set $P$ of preimages $S$ if and only if $B_i \cap S \neq \varnothing$, that is, its corresponding digitizing cell $B_t$ intersects $S$.

The Grid-Intersect Quantization:

Given a thin object $C$ formed by continuous points, its intersection with the grid line is a real point $t=(x_t,y_t)$. This point may satisfy $x_t\in\mathcal{Z}$ or $y_t\in\mathcal{Z}$ depending on that $C$ intersects the vertical grid lines or the horizontal grid lines. This point $t\in C$ will be mapped to a grid point $p_i=(x_i,y_i)$, where $t\in (x_i-1/2,x_i+1/2)\times(y_i-1/2,y_i+1/2)$.

Comparison between GIQ and SBQ:

• GIQ reduces the number of pixels in a digitized set
• the area of the SBQ domain is smaller than the area of the GIQ domain
• the shape of the GIQ domain appears to be more appropriate for the description of a curve

Digital Arcs:

In Euclidean geometry, the arc is a part of the curve between two points, while the chord is the straight line connecting any two points on the conic curve. A continuous straight line segment can often be viewed as a special case of an arc.

Given a set of discrete points and the adjacency between them, the digital arc $P_{pq}$ from point p to point q is defined as the arc $P_{pq}=\{p_i,i=0,1,\ldots,n\}$ that satisfies the following conditions:

• $p_0=p$, $p_n=q$

• $\forall\, i \in \{1,2,\ldots,n-1\}$, point $p$ has exactly two adjacent points in the arc $P_{pq}$, they are $p_{i-1}$ and $p_{i+1}$

• The endpoint $p_0$ (or $p_n$) has exactly one adjacent point in the arc $P_{pq}$, it is $p_1$ (or $p_{n-1}$)

4-digit arcs or 8-digit arcs can be defined as above, depending on the difference in adjacency.

Digitized set:

Consider the continuous straight line segment $[\alpha,\beta]$ in a square grid. Using GIQ, points intersecting the grid lines between $[\alpha, \beta]$ are mapped to a set of discrete points $\{p_i\},i=0,1,\ldots,n$. This set is called a digitized set of $[\alpha, \beta]$.

Digital Chords:

The digital arcs can be digital chords. The determination of the digital chord is based on the following principle: Given a digital arc $P_{pq}$, the distance between the continuous line $[p_i,p_j]$ and the sum of the segments $\bigcup_j[p_j,p_{j+1}]$ can be measured using the discrete distance function and should not exceed a certain threshold.

It can be proved that given any two discrete points $p_i$ and $p_j$ in an 8-digit arc $P_{pq}=\{p_i\}$ and any real point $\rho$ in continuous line segment $[p_i,p_j]$, if there exists a point $p_k\in P_{pq}$ such that $% $, then $P_{pq}$ satisfies the property of the chord. Similarly, if there exists a point $p_k\in P_{pq}$ such that $% $, then $P_{pq}$ satisfies the property of the compact chord.

It can be proved that in the 8-digital space, the result of the line digitization is a digital arc that satisfies the properties of the chord. Conversely, if a digital arc satisfies the property of the chord, it is the result of digitizing the line segments.

Distance Transform:

Distance Transform (DT) is a special transform that maps a binary image into a gray-scale image. DT is a global operation, but it can be obtained by local distance computation.

DT computes the distance between a point inside a region and the closest point outside the region. In other words, the DT of a point in a region computes the nearest distance between this point and the boundary of the region. Given a set of points $P$, a subset $B\subset P$, and a distance function $D$ that satisfies the distance measure properties (e.g. $D_E$, $D_4$ and $D_8$), then $\mathrm{DT}(p) = \max_{q\in B} \big\{D(p,q)\big\}$.

Given a set of points $P$ and one of its subsets $B$, the distance transformation of $P$ satisfies some properties:

• $\mathrm{DT}(p)=0 \Leftrightarrow p\in B$

• $\forall\,p\in P-B, \;\exists \, q\in N_{D}(p)$ such that the discrete distance transform value at position $p$ satisfies the relationship $\mathrm{DT}(p)=\mathrm{DT}(q) + D(p,q)$.

• (Also some other properties $\cdots$)

For the second property, since $p$ and $q$ are neighbors, $\ell(p,q)=D(p,q)$ is the length of the move between them. Therefore, for any point $p\notin B$, the discrete transformation can be described as $\mathrm{DT}(p) = \min\big\{D(q')\big\} + \ell(p,q')$, where $q'\in N_{D}(p)$.

To calculate discrete distance map using masks, we consider a binary image of size $W\times H$, and assume the set of its border point $B$ is known. The discrete distance map is a matrix, which is denoted as $\big[\mathrm{DT}(p)\big]_{W\times H}$, The value will be calculated iteratively until a stable is reached.

Before iteration ($t=0$) begin, the values of distance are firstly initialized as

Then, for iterations $t > 0$, the mask $M(k,l)$ is positioned at a pixel $p=(x_p,y_p)$, and the distance values are propagated from the pixel $q=(x_p+k,y_p+l)$ onto $p$ by using the updating formula:

Note that the implementation of this process can be sequential or parallel one.

## Classical Image Edge Detection

This sub-section will introduce five kinds of Differential Edge Detector (DED).

Since the edge in an image corresponds to the discontinuity in an image, the differential edge detector is popularly employed to detect edges. The detailed classical edge detection methods can be found here.

Some parameters to describe an edge:

• location (center): where the maximum gray discontinuity occur
• direction: the direction across the discontinuity
• strength (magnitude): the difference along the edge direction
• mean value: the mean of gray value on edge
• slope: the degree of tilt on the edge direction

Points which lie on an edge can be detected by:

• detecting local maxima or minima of the first derivative
• detecting the zero-crossing of the second derivative

the 1st derivative:

which can be described as a mask [-1 1], or [-1 0 1] (centered about $x$).

the 2nd derivative:

which can be described as a mask [1 -2 1].

Actually, we usually use gradient to do image edge detection.

The gradient is a vector which has certain magnitude and direction, which is defined as: $\nabla f(x,y) = [G_x\;\, G_y]^{\mathrm{T}} = \begin{bmatrix}\frac{\partial f}{\partial x}\\ \frac{\partial f}{\partial y} \end{bmatrix}$

Its magnitude and direction:

We can estimate the gradient with finite differences (the 1st difference):

If we let $h_x = h_y = 1$, then the gradient can be written as:

Moreover, we usually use 2nd derivative to do edge detection.

There are two operators in 2D that correspond to the second derivative: Laplacian and Second directional derivative. In this kind of method, we can find the zero-crossing of the second derivative to detect edge points. Similar to gradient method, the 2nd derivative can be estimated by using 2nd differences.

In practice, various kinds of edge detection operator are used to do edge detection:

• gradient operator (the 1st difference): Roberts, Prewitt, Sobel
• 2nd derivative (the 2nd difference): Laplacian, Marr
• optimal operator: Canny
• non-differential edge operator: SUSAN

In pixel-coordinate notation, $j$ corresponds to the $x$ direction and $i$ to the negative $y$ direction. In this way, the 1st differences can be written as:

The mask of them can be $[-1\;\,1]$ and $\begin{bmatrix}1\\ -1\end{bmatrix}$ respectively.

Now we are going to see some gradient operator, consider of a $3\times 3$ grid.

Robert operator: %

This approximation can be implemented by the following masks:

Note that the Robert edge detection operator gives the approximation: $G_x$ at $(i, j+1/2)$, $G_y$ at $(i+1/2,j)$.

Prewitt and Sobel operator:

Now consider of the arrangement of pixels about the pixel (i,j)

The partial derivatives can be computed as

From this form, we get the Prewitt and Sobel operator:

• Prewitt $\rightarrow$ c = 1
• Sobel $\rightarrow$ c = 2

The masks of this type can be implemented as the following masks:

Now let us take a look at the 2nd derivative operator.

The Laplacian:

Laplacian operator is very famous. It is defined as $\nabla^2f = f_x^2 + f_y^2$. We can use the 2nd difference to approximate it.

This is a template of mask that has 4-connection. To expand it, we can rotate it by $45^{\circ}$ and then add by itself, which produce another type of template which covers 8-neighbor. These two types of masks can be implemented as the following masks:

Some properties of the Laplacian:

• (*) It is an isotropic operator (rotation has no effect on Laplacian)
• It is cheaper to implement (one mask only)
• (*) It does not provide information about edge direction
• (*) It is more sensitive to noise (differentiates twice)
• It produce the edge with double pixel wide

Marr operator:

The process of Marr is simple, which is called Laplacian-of-Gaussian (LoG). First, we use Gaussian smoothing (low-pass filter) to remove noise of input image. Then, we calculate Laplacian value by using Laplacian operator. Note that these two steps can be combined together and then provide LoG mask.

The low-pass Gaussian filter is

It can be shown that

hence we deduce the total filter as

The complete derivation of this final LoG form can be found here. Though we can use Mathematica to do this directly by $\texttt{LoG = Simplify[Laplacian[G, {x, y}]]}$.

Then we can plot it using Mesh (set $\sigma=-0.3$)

Now let us consider of optimal operator. There are some criteria for optimal edge detection:

• the optimal detector must minimize the probability of false positives, as well as that of false negatives
• the edges detected must be as close as possible to the true edges
• the detector must return one point only for each true edge point

Canny is almost the most widely used edge detector in computer vision. It has shown that the 1st derivative of the Gaussian closely approximates the operator that optimizes the product of signal-to-noise ratio and localization. The process of Canny:

• Gaussian smoothing
• Compute the gradient magnitude and direction
• Apply non-maxima suppression
• Apply hysteresis thresholding/edge linking

Now we introduce Canny step by step, the detailed introduction can be found in this BLOG.

After Gaussian smoothing, we should compute the gradient of it:

Then we can compute the magnitude image by $\sqrt{s_x^2+s_y^2}$ and direction $\arctan(s_y/s_x)$.

Non-maxima suppression:

• Broad ridges must be thinned so that only the magnitudes at the points of greatest local change remain
• All values along the direction of the gradient that are not peak values of a ridge are suppressed
• Non maximum suppression requires us to divide the $3 \times 3$ grid of pixels into 8 sections

Here we give the explanation of grid division. We know that the 8-connection grid has 4 directions. The edge direction angle is rounded to one of four angles representing vertical, horizontal and the two diagonals ($0^\circ$, $45^\circ$, $80^\circ$ and $135^\circ$). An edge direction falling in each color region will be set to a specific angle values.

Double thresholding:

The output of non-maxima suppression still contains the local maxima created by noise, consider of two kinds of threshold:

• low threshold: some noisy maxima will be accepted too
• high threshold: true maxima might be missed

For this reason, a more effective scheme is to use two threshold:

• a low threshold $t_l$
• a high threshold $t_h$
• usually $t_h\approx t_l$

Use these two threshold, and we will get two thresholded images $I_l$ and $I_h$, then we link them into contours:

• Now we want to link the edges in $I_h$ into contours because there are gaps in this map

• Look in $I_l$ when a gap is found
• By judging the 8 neighbors in $I_l$, gather edge points from $I_l$ until the gap has been bridged to an edge in $I_h$

After the thresholding, the edge linking is also done!

SUSAN:

The idea of SUSAN edge detector is simple, detailed introduction can be found here. We can find that this method need not to calculate gradient, which is easy to implement. The SUSAN edge finder has been implemented using circular masks to give isotropic responses. The mask is placed at each point in the image and, for each point, the brightness of each pixel within the mask is compared with that of the nucleus, and then determine the comparison by:

where $\mathbf{r}_0=(x_0,y_0)$ is the position of the nucleus, $\mathbf{r}=(x,y)$ is the position of any other point within the mask. $t$ is the brightness difference threshold, $c$ is the output of the comparison.

This process will be done for each pixel $(x,y)$ in the mask, and a running total of $c$ is made by

This total $s$ is just the number of pixels in the USAN, which gives the USAN’s area.

Then, $s$ is compared with a fixed threshold $g$ (the geometric threshold), which is set to $3s_{max}/4$. The edge response is then created by using the following rule:

In practice, a more stable and sensible equation to calculate $c$ is

## Classical Image Segmentation

Formal definition of image segmentation:

A segmentation of the grid $S$ for a uniformity prediction $P$ is to partition $S$ into disjoin non-empty subsets $S_1,S_2,\cdots,S_N$ such that:

• The union of all $S_i$ is the entire $S$: $\bigcup_{i=1}^{N}S_i = S$
• For all $i$ and $j$ ($i\neq j$), it has $S_i\cap S_j = \varnothing$
• Every region $S_i$ is a connected component
• Given a uniform measure $P$, it has $P(S_i)=\mathrm{True}$ and $P(S_i\cup S_j)=\mathrm{False}$

We mainly concentrate on the following segmentation methods:

• (*) Thresholding segmentation
• Region growing
• Region splitting and merging
• (*) Watershed transformation
• Clustering methods
• (*) Partial Differential Equation (PDE) methods: snake, level-set
• Graph Cut

The detailed segmentation tutorial can be found here and here.

Thresholding segmentation:

Thresholding is the simplest and most commonly used method of segmentation. Given a single threshold $t$:

This equation describes the simple mechanism of 2-class segmentation. The multi-class version is similar to this.

Actually, thresholding segmentation can be classified into three types:

• pixel dependent method (global thresholding)
• region dependent method (local thresholding)
• coordinate dependent method (dynamic thresholding)

The discussion of thresholding segmentation is mainly about the selection of threshold $t$. I’m not going to show details of these three types of thresholding methods, one can find lots of material about them using Google.

Watershed:

The main goal of watershed segmentation algorithm is to find the “watershed lines” in an image in order to separate the distinct regions. To imagine the pixel values of an image is a 3D topographic chart, where x and y denote the coordinate of a plane, and f(x,y) denotes the height of the point (x,y).

The algorithm starts to pour water in the topographic chart from the lowest basin to the highest peak. In the process, we may detect some peaks disjointed the catchment basins, which is called “dam”.

Before the watershed algorithm, we first calculate the gradient image $g(x,y)$ of original image $f(x,y)$. Let $M_i$, $i\in[1,R]$ denotes the coordinate in the local regional minima of $g(x,y)$. Denote $C(M_i)$ as the set of coordinates in the catchment basin associated with the position $M_i$. Let $T[n]$ denotes the set of coordinates that

The value $n$ is the threshold of gray value.

Now we start the watershed algorithm:

• Find the minimum and maximum value of $g(x,y)$ as min and max.
• Assign the coordinate of min into $M_i$. The topography will be flooded in integer flood increments from n = min+1
• Compute $C_n(M_i)=C(M_i)\cap T[n]$, which produce a binary image
• Let $C[n]=\bigcup_{i=1}^{R}C_n(M_i)$, which is the union of the flooded catchment basins at stage n
• set n = n+1
• Derive the set of connected components (CC) in $T[n]$ denoting as $Q$
• For each connect component $q\in Q$, there are three conditions
• If $q\cap C[n-1]$ is empty, $q$ is incorporated into $C[n-1]$ to form $C[n]$
• If $q\cap C[n-1]$ contains only one CC, $q$ is incorporated into $C[n-1]$ to form $C[n]$
• If $q\cap C[n-1]$ contains more than one CC, find the points of ridge(s) and set them as “dam”
• Construct $C[n]$, and then set n = n+1
• Repeat step 6 to 8 until n reaches max+1

The details of watershed can be found here, which is the slides given by Desok Kim.

Active Contour Based Segmentation:

This kind of method is very famous, which utilizes partial differential equation to find contours of objects. I’m not going to explain the detailed algorithm in this article, but written another one, which contains the explanation about some classical active contour methods (e.g. Snake and Level-Set). Here and here are some good materials.

The advantages of active contour based methods:

• easier to describe objects
• sub-pixel precision
• easier to combine various of information (e.g. prior, motion information)
• utilize strong mathematical tools: PDE, calculus of variations, etc.

Graph Cuts:

Segmentation with Graph Cuts is another interesting method. The goal is to segment the main objects out of an image. A graph-based approach makes use of efficient solutions of the max-flow/min-cut problem between source and sink nodes in directed graphs. Here is a tutorial about this method.