Convolutional Neural Networks

Chun-Hao Yang

Outline

  • Convolution
    • Gaussian Filter
    • Cross-Correlation and Auto-Correlation
    • Edge Detection
  • Basic CNN Architecture
    • Padding and Stride
    • Pooling
    • AlexNet
  • Applications
    • Object Detection
    • Image Segmentation

Convolutional Networks

  • Convolutional networks, also known as convolutional neural networks, or CNNs, are a specialized kind of neural network for processing data that has a known grid-like topology.
  • Examples:
    • time-series data: 1-D grid taking samples at regular time intervals
    • images: 2-D grid of pixels
    • 3-D images: 3-D grid of voxels
    • video: 3-D grid = 2-D of pixels + 1-D of time
  • Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

Convolution Operation

  • Given two functions \(f(t)\) and \(g(t)\), the convolution of \(f\) and \(g\) is defined as: \[ (f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t-\tau)d\tau. \]
  • One way to think about convolution is as a weighted average of the function \(f\) at time \(t\) where the weights are given by the function \(g\).
  • In practice, we work with discrete data. That is, we only observe \(f(t)\) for a finite set of values of \(t\), e.g., \(t \in \mathbb{Z}\).
  • In this case, the convolution is defined as: \[ (f * g)(t) = \sum_{\tau=-\infty}^{\infty} f(\tau)g(t-\tau). \]
  • The function \(f\) is often called the input and the function \(g\) is called the kernel or filter.

Example

  • Let \(f(t) = [1, 2, 3, 4]\) and \(g(t) = [1, 1, 1]\). That is,
    • \(f(0) = 1, f(1) = 2, f(2) = 3, f(3) = 4\), and \(f(t) = 0\) for all other values of \(t\)
    • \(g(0) = 1, g(1) = 1, g(2) = 1\), and \(g(t) = 0\) for all other values of \(t\).
  • The convolution of \(f\) and \(g\) is: \[\begin{align*} (f * g)(t) & = \sum_{\tau=-\infty}^{\infty} f(\tau)g(t-\tau) = f(0)g(t) + f(1)g(t-1) + f(2)g(t-2) + f(3)g(t-3). \end{align*}\]
  • Therefore \[\begin{align*} (f * g)(0) & = f(0)g(0) + f(1)g(-1) + f(2)g(-2) + f(3)g(-3) = 1 \times 1 + 2 \times 0 + 3 \times 0 + 4 \times 0 = 1\\ (f * g)(1) & = f(0)g(1) + f(1)g(0) + f(2)g(-1) + f(3)g(-2) = 1 \times 1 + 2 \times 1 + 3 \times 0 + 4 \times 0 = 3\\ (f * g)(2) & = f(0)g(2) + f(1)g(1) + f(2)g(0) + f(3)g(-1) = 1 \times 1 + 2 \times 1 + 3 \times 1 + 4 \times 0 = 6 \end{align*}\]
  • That is, \((f * g)(t) = [1, 3, 6, 9, 7, 4]\).
  • If we convolve again with \(g\), we get \(((f * g) * g)(t) = [1, 4, 10, 18, 22, 20, 11, 4]\).

Gaussian Filter

  • A Gaussian filter is the convolution with \(g(t) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{t^2}{2\sigma^2}}\).
  • Roughly speaking, the Gaussian filter smooths the function \(f\) by computing the weighted average the values of \(f\) in a neighborhood of \(t\).
  • The parameter \(\sigma\) controls the width of the Gaussian filter and hence the smoothness of the output.

Cross-Correlation and Auto-Correlation

  • The cross-correlation of \(f\) and \(g\) is similar to convolution. It is defined as: \[ (f \star g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t+\tau)d\tau. \]
  • It is used to measure the similarity between two signals.
  • The autocorrelation of a function \(f\) is the cross-correlation of \(f\) with itself: \[ (f \star f)(t) = \int_{-\infty}^{\infty} f(\tau)f(t+\tau)d\tau. \]
  • The autocorrelation of a function is a measure of how similar the function is to a shifted version of itself.

Applications of convolution

  • In signal processing, convolution is used to filter signals, extract features, and analyze time or frequency characteristics.
  • Examples include:
    • Noise Reduction: Removes unwanted components from signals.
    • Image Sharpening: Enhances edges and detail.
    • Feature Detection: Identifies patterns like edges in images or specific frequencies in audio.
  • In statistics, convolution is used to model probability distributions and smooth data.
  • Examples::
    • The density of the sum of two independent random variables is the convolution of the two density functions.
    • Kernel Density Estimation: Smooths data to estimate probability densities.

Multi-dimensional Convolution

  • In higher dimensions, the convolution of two functions \(f: \mathbb{R}^d \to \mathbb{R}\) and \(g: \mathbb{R}^d \to \mathbb{R}\) is defined as: \[ (f * g)(\boldsymbol{t}) = \int_{\mathbb{R}^d} f(\boldsymbol{\tau})g(\boldsymbol{t}-\boldsymbol{\tau})d\boldsymbol{\tau}. \]
  • For discrete observations (e.g., images), the convolution is defined as: \[ (f * g)(\boldsymbol{t}) = \sum_{\boldsymbol{\tau} \in \mathbb{Z}^d} f(\boldsymbol{\tau})g(\boldsymbol{t}-\boldsymbol{\tau}). \]
  • For \(d = 2\), \[ (f * g)(t_1, t_2) = \sum_{\tau_1=-\infty}^{\infty}\sum_{\tau_2=-\infty}^{\infty} f(\tau_1, \tau_2)g(t_1-\tau_1, t_2-\tau_2). \]
  • The cross-correlation and autocorrelation are defined similarly.

2-D Convolution

  • For images, the input \(I\) is a 2D grid of pixel values.
  • The convolutional kernel \(K\) is a 2D grid of weights.
  • The kernel size is typically \(3 \times 3\) or \(5\times 5\).
    • Smaller kernels are computationally efficient and can capture fine details.

Image size after convolution

  • Suppose the size of the input image is \(H_{\text{in}} \times W_{\text{in}}\) and the size of the kernel is \(k_h \times k_w\).
  • The stride (i.e., the step size) is \((s_h, s_w)\).
  • The size of the output image is: \[ H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} - k_h}{s_h} + 1 \right\rfloor, \quad W_{\text{out}} = \left\lfloor \frac{W_{\text{in}} - k_w}{s_w} + 1 \right\rfloor. \]
  • In the previous example, the input image is \(7 \times 7\) and the kernel size is \(3 \times 3\) with stride \((1, 1)\). Therefore the output image is \(5 \times 5\).
  • The image size can be preserved by using padding, i.e., adding zeros around the input image.

2-D Gaussian Filter

Apply a 2D Gaussian filter to an image to smooth it. The filter is defined as:

from scipy import datasets
from scipy.ndimage import gaussian_filter

orig = datasets.ascent()

# Apply Gaussian filters
result1 = gaussian_filter(orig, sigma=3)
result2 = gaussian_filter(orig, sigma=10)

Filter for edge detection

  • We can also use convolution for edge detection: \(k_h\) is for detecting horizontal edges and \(k_v\) is for detecting vertical edges \[ k_{h} = \begin{bmatrix} -1 \\ 0\\ 1 \end{bmatrix}, \quad k_{v} = \begin{bmatrix} -1 & 0 & 1 \end{bmatrix} \]
  • In fact, the filter compute the approximate gradient of the image in the horizontal and vertical directions: \[\begin{align*} \frac{\partial}{\partial x} I(x, y) &\approx \frac{1}{2}\left[I(x+1, y) - I(x-1, y)\right] = \frac{1}{2}[-1, 0, 1] \cdot [I(x-1, y), I(x, y), I(x+1, y)] \\ \frac{\partial}{\partial y} I(x, y) &\approx \frac{1}{2}\left[I(x, y+1) - I(x, y-1)\right] = \frac{1}{2}\begin{bmatrix} -1 \\ 0\\ 1 \end{bmatrix} \cdot \begin{bmatrix} I(x, y-1) \\ I(x, y) \\ I(x, y+1) \end{bmatrix}. \end{align*}\]
from scipy import ndimage, datasets
import numpy as np

# Define the horizontal and vertical edge detection filters
k_v = np.array([[-1, 0, 1]])
k_h = np.array([[-1], [0], [1]])

orig = datasets.ascent().astype('int32')

# Apply the filters to the image
edge_h = ndimage.convolve(orig, k_h)
edge_v = ndimage.convolve(orig, k_v)

# Compute the magnitude of the gradient
magnitude = np.sqrt(edge_h**2 + edge_v**2)
magnitude *= 255.0 / np.max(magnitude)

Sobel Filter

  • The Sobel filter is a combination of gradient filters and smoothing filters, \[\begin{align*} k_{v} & = \begin{bmatrix} 1 \\ 2\\ 1 \end{bmatrix} \begin{bmatrix} -1 & 0 & 1 \end{bmatrix} = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}\\ k_h & = \begin{bmatrix} -1 \\ 0\\ 1 \end{bmatrix} \begin{bmatrix} 1 & 2 & 1 \end{bmatrix} = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}\\ \end{align*}\]
  • That is, if we want to find vertical edges, we add a vertical smoothing filter and vice versa.
from scipy import ndimage, datasets
import numpy as np
ascent = datasets.ascent().astype('int32')
sobel_h = ndimage.sobel(ascent, 0)  # horizontal gradient
sobel_v = ndimage.sobel(ascent, 1)  # vertical gradient
magnitude = np.sqrt(sobel_h**2 + sobel_v**2)
magnitude *= 255.0 / np.max(magnitude)  # normalization

High-pass filter and low-pass filter

  • We can roughly classify filters into two categories:
    • High-pass filters: remove low-frequency components and preserve high-frequency components.
    • Low-pass filters: remove high-frequency components and preserve low-frequency components.
  • High-pass filters keep fine details and sharpen images, e.g., the Sobel filter.
  • Low-pass filters smooth images and remove noise, e.g., the Gaussian filter.
  • Filter size (kernel size) also affects the filter’s behavior:
    • Small filters capture fine details but are sensitive to noise.
    • Large filters smooth images but may blur fine details.

Quick Summary

  • In image processing, convolution is the operation of applying a filter to an image.
  • Filters are used to:
    • Blur: Smooth images by averaging pixel values.
    • Sharpen: Enhance edges and detail.
    • Edge Detection: Identify edges in images.
    • Noise Reduction: Remove unwanted noise from images.
    • Feature Extraction: Detect specific patterns or features in images.
  • Therefore, we can choose different filters to achieve different effects.

Motivation of convolution

There are several reasons to used convolution operations in neural networks:

  • Ease of Interpretation: convolutional has long been used in signal processing and image processing.
  • Taking account of spatial structure: Convolutional networks are designed to take advantage of the spatial structure of images.
  • Parameter Sharing: A feature detector (e.g., edge detector) that’s useful in one part of the image is likely useful in another part.
  • Sparsity of Connections: In each layer, each output value depends only on a small number of inputs.
  • Translation Invariance: A feature learned in one part of the image can be applied to other parts.
  • Efficient Computation: Convolution can be implemented efficiently using matrix multiplication.

Convolution layer

In PyTorch, the convolutional layer is defined as torch.nn.Conv2d and it requires the following parameters:

  • in_channels: number of input channels (e.g., 3 for RGB images).
  • out_channels: number of output channels (i.e., number of filters).
  • kernel_size: size of the convolutional kernel (e.g., 3 for a \(3 \times 3\) kernel).
  • stride: step size for the kernel.
  • padding: number of zeros to add around the input image.
  • bias: whether to include a bias term.
  • dilation: spacing between kernel elements (\(\text{default} = 1\) means no dilation).

import torch
import torch.nn as nn
m = nn.Conv2d(in_channels=3, out_channels=5, kernel_size=3, stride=1, padding=1)

print("Size of the weight parameter", m.weight.shape)
print("Size of the bias parameter", m.bias.shape)
Size of the weight parameter torch.Size([5, 3, 3, 3])
Size of the bias parameter torch.Size([5])

The number of parameters of a convolutional layer is: \[ \text{\# parameters} = \text{in\_channels} \times \text{out\_channels} \times \text{kernel\_size}^2 + \text{out\_channels}. \]

Pooling

  • Pooling layer is another important layer in CNNs.
  • It has several benefits:
    • Downsampling: Reduces the spatial dimensions of the feature map.
    • Reduces Overfitting: By summarizing the features, it reduces the number of parameters.
    • Translation Invariance: Makes the network more robust to small translations in the input.
  • The most common pooling operation are max pooling and average pooling.

Max Pooling and Average Pooling

  • Max pooling takes the maximum value in a patch.
m = nn.MaxPool2d(kernel_size = 2, stride = 1)
input = torch.randn(1, 4, 4)
print(input)
output = m(input)
print(output)
tensor([[[-1.1195, -2.3749, -0.4088,  0.7293],
         [ 0.2177,  2.2577,  1.3515, -2.0341],
         [ 0.2819,  0.7313,  0.3734, -0.6688],
         [-0.3654,  0.5156, -2.4215,  0.4078]]])
tensor([[[2.2577, 2.2577, 1.3515],
         [2.2577, 2.2577, 1.3515],
         [0.7313, 0.7313, 0.4078]]])
  • Similarly, average pooling takes the average value in a patch.

Adaptive Pooling

In PyTorch, adaptive pooling (AdaptiveAvgPool or AdaptiveMaxPool) allows you to specify the output size instead of the kernel size.

  • AdaptiveAvgPool1d
m = nn.AdaptiveAvgPool1d(output_size=8)
x = torch.tensor([[1,2,3]], dtype = torch.float32)
m(x)
tensor([[1.0000, 1.0000, 1.5000, 2.0000, 2.0000, 2.5000, 3.0000, 3.0000]])
  • AdaptiveMaxPool1d
m = nn.AdaptiveMaxPool1d(output_size=10)
x = torch.tensor([[1,2,3]], dtype = torch.float32)
m(x)
tensor([[1., 1., 1., 2., 2., 2., 3., 3., 3., 3.]])
  • This layer presents us from hard-coding the feature sizes.

How does it work? (for 1-D case)

  • The output size for a convolution/pooling layer is: \[ L_{\text{out}} = \left\lfloor \frac{L_{\text{in}} + 2\times\text{padding} - \text{dilation} \times (\text{kernel size} - 1) - 1}{\text{stride}} + 1\right\rfloor. \]

  • If \(L_{\text{out}} \leq L_{\text{in}}\), define

    • \(\text{padding} = 0\), \(\text{dilation} = 1\)
    • \(\text{stride} = \left\lfloor \frac{L_{\text{in}} - L_{\text{out}}}{L_{\text{out}}} + 1\right\rfloor\)
    • \(\text{kernel size} = L_{\text{in}} - (L_{\text{out}} - 1)\times\text{stride}\)
  • If \(L_{\text{out}} > L_{\text{in}}\), upsample the input by the factor \(\left\lceil \frac{L_{\text{out}}}{L_{\text{in}}}\right\rceil\), e.g., if \(L_{\text{in}} = 3\) and \(L_{\text{out}} = 5\), upsample the input by a factor of \(2\): \[ [1,2,3] \Rightarrow [1,1,2,2,3,3] \] and then apply the previous case.

CNN Architecture

A typical CNN architecture consists of two parts:

  • Feature Extraction: Consists of convolutional layers and activation functions.
    • Convolutional Layers: Extract features from the input image.
    • Activation Function: Introduce non-linearity into the model.
    • Pooling Layers: Downsample the feature maps.
  • Classification (or regression): Consists of fully connected layers.
    • Fully Connected Layers: Perform classification based on the features.
    • Activation Function
    • Dropout: Regularize the model to prevent overfitting.
    • Output Layer: softmax layer for classification or linear layer for regression.

Example

class MyConvNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.feature = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.AdaptiveMaxPool2d(output_size=(4, 4)),
            nn.Flatten()
        )
        self.classifier = nn.Sequential(
            nn.Linear(32 * 4 * 4, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )   
    def forward(self, x):
        x = self.feature(x)
        x = self.classifier(x)
        return x

AlexNet

  • AlexNet1 is a popular CNN architecture that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.
  • It consists of 5 convolutional layers and 3 fully connected layers.

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=4096, bias=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

Applications

Computer Vision

  • Computer Vision is a field of computer science that works on enabling computers to see, identify, and process images in the same way that human vision does, and then provide appropriate output.
  • Typical tasks in computer vision include:
    • Image/video Classification
    • Object Detection
    • Segmentation
    • Reconstruction
    • Pose estimation, etc.

Classical CV vs. DL

  • Classical CV requires hand-crafted features:

flowchart LR
  A(Image) --->|Feature Extraction| B(Feature Vector) --->|Classifier| C(Output)

  • For example, we can use the Scale-invariant feature transform (SIFT) algorithm to extract features from an image and use a classifier (e.g., SVM or random forest) to classify the image.
  • Usually, the feature extraction step depends on the problem.
  • In contrast, DL solves problems in an end-to-end manner, that is, it uses one model to simultaneously learn the feature extraction and classification steps.

flowchart LR
  A(Image) --->|DL Models| B(Output)

  • Of course, the model architecture depends on the problem and compare to classical CV, DL approaches requires more data.

ImageNet Database

  • ImageNet is a large-scale hierarchical image database containing over 14 million images hand-annotated with 20,000 categories.
  • The dataset is used in ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

ImageNet Database

  • It is developed to support visual recognition tasks in computer vision.
    • Model Development: Most models in torchvision are pretrained on ImageNet.
    • Transfer Learning: Use the pretrained models to solve other tasks, e.g., object detection, segmentation, etc.

Other public datasets

  • Image classification:
    • CIFAR-10 and CIFAR-100: Small-scale datasets for image classification with 10 or 100 categories; popular for educational purposes.
    • MNIST: Handwritten digits (0-9) dataset; ideal for simple classification and neural network testing.
  • Object detection/segmentation:
    • COCO (Common Objects in Context): Over 330,000 images with 80 object categories; supports object detection, segmentation, and keypoint detection.
    • Open Images v7: Large dataset with 9 million images and bounding boxes for 600 categories; supports object detection and visual relationship detection.

Other public datasets

  • For autonomous driving:
    • Cityscapes: Urban scene dataset with high-quality pixel annotations for 30 classes, used for autonomous driving.
    • CamVid: Road scene dataset with pixel-level annotations, commonly used in autonomous vehicle research.
    • KITTI: Dataset for autonomous driving with images, lidar, and GPS data; supports object detection, tracking, and stereo vision.

Object Localization

  • The goal is to identify and classify objects in an image by drawing bounding boxes around them.
  • Define bounding box with four values:
    • Top-left corner coordinates \((x, y)\)
    • Box width and height \((w, h)\)
  • It can be treated as a regression task:
    • Input: Image
    • Output: Bounding box coordinates \((x, y, w, h)\).
  • If we have multiple objects in an image, we can predict multiple bounding boxes with class labels.

Region-based CNNs

  • The region-based CNNs (R-CNN)1 family of models are used for object detection and localization.
  • The key idea is to use a region proposal algorithm to generate candidate regions in an image and then use a CNN to classify the regions.
  • The Faster R-CNN2 model improves the speed of R-CNN by sharing computation between the region proposal and classification.

Image Segmentation

  • Image segmentation is the process of partitioning an image into multiple segments (sets of pixels) to simplify the representation of an image.
  • It can be achieved using both unsupervised and supervised methods.
  • Unsupervised Segmentation: cluster pixels based on color, intensity, or texture.
  • Supervised Segmentation: use labeled data to train a model to segment images.

U-Net1

Model zoos

  • You can find a huge collection of model architectures and pretrained models, for example,
    • torchvision: PyTorch’s official model zoo with pretrained models for image classification, object detection, segmentation, etc.
    • timm: PyTorch Image Models (timm) library with various model architectures and pretrained models.
  • Network backbone vs. head:
    • Backbone: Convolutional layers for feature extraction.
    • Head: Final few layers for classification, detection, or segmentation.
  • Strategies:
    • use a pretrained backbone and add your custom head for specific tasks
    • fine-tune the entire model (backbone and head)
  • Depending on the size of your new dataset, you can choose the strategy that best fits your needs.