Introduction to PyTorch

Chun-Hao Yang

Outline

  • Basic concepts of PyTorch
    • Tensor
    • Autograd
  • Main loop for training models
    • Loading Data and data preparation
    • Building a Neural Network
    • Training and Validation Loop
  • Tensorboard Visualization
    • Tracking the training process
    • Visualizing the network architecture

PyTorch Basics

Tensor

  • Tensors are multi-dimensional arrays.
    • 0-dim array is called scalar.
    • 1-dim array is called vector.
    • 2-dim array is called matrix.
  • In PyTorch, everything is a tensor: data, paremeter, gradient, etc. A tensor can be created from a Python list or sequence with the torch.tensor() function.
import torch

a = torch.tensor([1]) # scalar
b = torch.tensor([1, 2, 3]) # vector
c = torch.tensor([[1, 2], [3, 4]]) # matrix
d = torch.tensor([[[1, 2], [3, 4]], 
                  [[5, 6], [7, 8]]]) # 3D tensor
  • The shape attribute is used to get the shape of a tensor
print(d.shape)
torch.Size([2, 2, 2])

Tensor vs. Numpy array

  • PyTorch tensors are very similar to Numpy arrays.
  • The two main differences are
    • PyTorch tensors can run on GPUs
    • PyTorch tensors are better integrated with PyTorch’s autograd.
  • The required_grad attribute is used to track the gradient of the tensor.
    • If required_grad=True, the gradient of the tensor will be computed during the backpropagation.
  • We can easily convert a PyTorch tensor to a Numpy array and vice versa using
    • torch.tensor.numpy()
    • torch.tensor.from_numpy()

Autograd

  • When creating tensors with requires_grad=True, it signals to autograd that every operation on them should be tracked.
  • We call the function .backward() on the final tensor to compute the gradient.
x = torch.tensor([3.0], requires_grad=True)
y = x**2
y.backward()
print(x.grad.item()) # dy/dx = 2x and when x = 3, dy/dx = 6
6.0
  • It also works for general multi-dimensional tensors and matrix operations.
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
y = torch.trace(torch.matmul(x.t(), x))
y.backward()
print(x.grad) 
tensor([[2., 4.],
        [6., 8.]])

\[ y = \operatorname{tr}(X^TX) \Rightarrow \frac{\partial y}{\partial X} = 2X \]

Object-oriented programming (OOP) in Python

  • A class is a prototype of objects.
class person:
    def __init__(self, name, age): # instance constructor
        self.name = name # attribute
        self.age = age
    def get_name(self): # method
        return self.name

john = person("John", 20) # john is an instance of the class 'person'
  • A class contains attributes (name and age) and methods (get_name).
  • We can define a subclass that inherits from a parent class.
class student(person):
    def __init__(self, name, age, major): 
        super().__init__(name, age) # call the parent class constructor
        self.major = major
    def get_major(self):
        return self.major

john = student("John", 20, "Stat")
print(f"{john.get_name()} is majoring in {john.get_major()}.")
John is majoring in Stat.

Data Preparation

Data Preparation

  • In the data preparation stage, we need to do the following:

    • Split the data into training, validation, and test sets
    • Split the data into mini-batches
  • A dataset is stored in a Dataset class. Inside the dataset calss, we can

    • download/load the data
    • preprocess the data (standardization, transformation, etc.)
  • To feed the dataset to our model, we need the DataLoader class, which

    • shuffles the data indices
    • splits the data into mini-batches

Dataset

from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, x, y):
        self.x = torch.tensor(x).float()
        self.y = torch.tensor(y).float()
        
    def __len__(self):
        return len(self.x)
    
    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

You need to implement three methods in the Dataset class:

  • __init__: Read data and preprocess
  • __getitem__: Return one sample at a time.
  • __len__: Return the size of the dataset.

Example

import numpy as np

rng = np.random.default_rng(20241029)
n = 30
p = 2

# Generate x_1, x_2, ..., x_30 from a normal distribution
x = rng.normal(loc=0.0, scale=1.0, size=(n, p))

# Generate y from the linear model y = 2 - x_1 + 3*x_2
beta = np.array([-1, 3])
y = 2 + x.dot(beta).reshape(-1,1) + rng.normal(loc=0.0, scale=0.1, size=(n, 1))

training_data = MyDataset(x, y)
print("The number fo samples is", training_data.__len__())
print("The first sample is (x_1, x_2, y) =", training_data.__getitem__(0))
The number fo samples is 30
The first sample is (x_1, x_2, y) = (tensor([-0.7578,  1.2519]), tensor([6.6215]))

DataLoader

  • The DataLoader class will:
    • shuffle the data indices (if shuffle=True)
    • split the data into mini-batches (if batch_size is specified)
from torch.utils.data import DataLoader

train_loader = DataLoader(training_data, batch_size=10, shuffle=True)
  • The DataLoader is an iterable. After we iterate over all batches, the data will be shuffled again.
x_batch, y_batch = next(iter(train_loader))
print(f"Batch size is {len(x_batch)}")
print(f"Batch mean of x is {x_batch.mean(axis = 0)}")
print(f"Batch mean of y is {y_batch.mean(axis = 0)}")
Batch size is 10
Batch mean of x is tensor([-0.1499, -0.1101])
Batch mean of y is tensor([1.7837])

Model building and training

Layers

  • A neural network model is built by stacking layers.
  • PyTorch provides many predefined layers that can be used to build a neural network.
  • For example, torch.nn.Linear is a fully connected layer.
lin = torch.nn.Linear(2, 3) # 2 input features and 3 output features
  • The parameters of the layer can be accessed using the state_dict() method or directly accessing the attributes
print(lin.state_dict())
OrderedDict([('weight', tensor([[-0.5396,  0.4724],
        [ 0.4363,  0.3894],
        [-0.6281, -0.3871]])), ('bias', tensor([0.2805, 0.6208, 0.5337]))])
print(lin.weight)
Parameter containing:
tensor([[-0.5396,  0.4724],
        [ 0.4363,  0.3894],
        [-0.6281, -0.3871]], requires_grad=True)

Available Layers

See https://pytorch.org/docs/stable/nn.html

Building a Neural Network

  • The torch.nn.Module class is the base class for all neural network modules.
  • The torch.nn.Sequential class is a subclass of torch.nn.Module that is used to sequentially stack layers.
  • The following example defines a simple neural network with one hidden layer and ReLU activation function
model = nn.Sequential(
    nn.Linear(input_dim, hidden_dim),
    nn.ReLU(), # activation function
    nn.Linear(hidden_dim, output_dim)
)
  • However, not all neural networks can be defined using torch.nn.Sequential, for example, ResNet, recurrent networks, etc.
  • We can also define a custom neural network by subclassing torch.nn.Module.

nn.Module

A basic nn.Module subclass is as follows:

class model(nn.Module):
    def __init__(self):
        super().__init__()
        # Define the layers
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # Define the forward pass, i.e., how to compute the output from the input
        x = self.linear1(x)
        x = self.relu(x)
        y = self.linear2(x)
        return y
  • This model is exactly the same as the previous one.
  • In the __init__ method, we define the layers that will be used by the model.
  • In the forward method, we define how the output \(y\) is obtained from the input \(x\).

Example: Residual Layer

  • Recall that a residual layer is defined as \(y = x + f(x)\).
  • The \(f(x)\) can be defined with different number/type of layers, activation functions, etc.
  • So you won’t find a predefined residual layer in PyTorch.

The following code defines a residual block with three FC layers and ReLU activation function.

import torch.nn as nn

class MyResBlock(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 2*input_dim),
            nn.ReLU(),
            nn.Linear(2*input_dim, 2*input_dim),
            nn.ReLU(),
            nn.Linear(2*input_dim, input_dim)
        )
        
    def forward(self, x):
        y = x + self.model(x)
        return y

We can use MyResBlock to build a deep neural network.

input_dim = 2
output_dim = 1

model = nn.Sequential(
    MyResBlock(input_dim),
    nn.ReLU(),
    MyResBlock(input_dim),
    nn.ReLU(),
    nn.Linear(input_dim, output_dim)
)

Loss Function

  • After defining the model and the data, we need to define the loss function.
  • There are many loss functions available in PyTorch:
    • torch.nn.MSELoss: Mean Squared Error
    • torch.nn.CrossEntropyLoss: Cross Entropy
    • torch.nn.L1Loss: L1 Loss
    • torch.nn.PoissonNLLLoss: Poisson Negative Log Likelihood
  • See https://pytorch.org/docs/stable/nn.html#loss-functions
  • You can also define your own loss function. For example,
def my_MSE(output, target):
    loss = torch.mean((output - target)**2)
    return loss

Optimizer

  • The optimizer is used to update the parameters of the model.
  • There are many optimizers available in PyTorch:
    • torch.optim.SGD: Stochastic Gradient Descent
    • torch.optim.Adam: Adam
  • See https://pytorch.org/docs/stable/optim.html
  • There are some important methods in the optimizer:
    • zero_grad(): Clear the gradient stored in the optimizer
    • step(): Update the parameters
optim = torch.optim.SGD(model.parameters(), lr=0.05)
# Compute the gradient 
...
# Update the parameters
optim.step()
# Clear the gradient
optim.zero_grad()

Training Loop

Hence a standard training loop looks like this:

def training_loop(dataloader, model, loss_fn, optimizer, n_epochs):
    for epoch in range(n_epochs):
        for x_batch, y_batch in dataloader:

            # Compute prediction and loss
            y_pred = model(x_batch)
            loss = loss_fn(y_pred, y_batch)

            # Backpropagation
            loss.backward() # compute the gradient
            optimizer.step() # update the parameters
            optimizer.zero_grad() # clear the gradient stored in the optimizer

        # print the training progress
        if epoch % 10 == 0:
            print(f"Epoch {epoch+1}, Loss = {loss.item():.3f}") 

Example

We now have all the components (data, model, loss, and optimizer) to train the model.

loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

training_loop(train_loader, model, loss_fn, optimizer, 100)
Epoch 1, Loss = 13.321
Epoch 11, Loss = 1.536
Epoch 21, Loss = 0.365
Epoch 31, Loss = 0.067
Epoch 41, Loss = 0.045
Epoch 51, Loss = 0.025
Epoch 61, Loss = 0.853
Epoch 71, Loss = 0.028
Epoch 81, Loss = 0.066
Epoch 91, Loss = 0.050

Quick Summary

  • With PyTorch, we train a model using the following steps:
    • Define the model using torch.nn.Module
    • Define the loss function (choose from torch.nn or define your own)
    • Define the optimizer (choose from torch.optim)
    • Write a training loop
  • We can also write a validation loop to evaluate the model on the validation set.
  • The validation loop is similar to the training loop, but we don’t need to compute the gradient and update the parameters.

Monitor the training process

TensorBoard

TensorBoard is a visualization tool provided by TensorFlow, but it can also be used with PyTorch.

Tracking the training/validation loss

  • Use the torch.utils.tensorboard.SummaryWriter class to log the training process.
  • The add_scalar method is used to log the scalar value.
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/example')

for epoch in range(100):
    running_loss = 0.0
    for x_batch, y_batch in train_loader:
        # Compute prediction and loss
        y_pred = model(x_batch)
        loss = loss_fn(y_pred, y_batch)
        # Backpropagation
        loss.backward() # compute the gradient
        optimizer.step() # update the parameters
        optimizer.zero_grad() # clear the gradient stored in the optimizer

        running_loss += loss.item()
    
    avg_loss = running_loss / len(train_loader)
    writer.add_scalar('Loss/Train', avg_loss, epoch + 1)
    writer.flush()

View the network architecture

Use add_graph to visualize the network architecture.

dataiter = iter(train_loader)
x, y= next(dataiter)

writer.add_graph(model, x)
writer.flush()

Using Existing Models

Save and Load Models

  • The model information is stored in the state_dict attribute.
model.state_dict()
OrderedDict([('0.model.0.weight',
              tensor([[-0.4887, -0.8192],
                      [ 0.6946,  0.5337],
                      [ 0.1736,  0.0551],
                      [ 0.4247, -1.0334]])),
             ('0.model.0.bias', tensor([-0.3327,  0.5050, -0.6019, -0.0285])),
             ('0.model.2.weight',
              tensor([[ 0.6055, -0.6635,  0.4054,  0.5843],
                      [ 0.2425, -0.1477, -0.2659,  0.7719],
                      [ 0.1666, -0.4641, -0.0790, -0.1267],
                      [-0.4862, -0.1035, -0.2386, -0.1672]])),
             ('0.model.2.bias', tensor([ 0.0108, -0.1911, -0.0543, -0.2758])),
             ('0.model.4.weight',
              tensor([[ 0.9599,  0.8796, -0.0876,  0.0927],
                      [ 0.2789,  0.1629,  0.4322,  0.0858]])),
             ('0.model.4.bias', tensor([0.7170, 1.1826])),
             ('2.model.0.weight',
              tensor([[ 0.5355, -0.1691],
                      [-0.0610,  0.5382],
                      [-0.7051, -0.3227],
                      [ 0.4620, -0.4787]])),
             ('2.model.0.bias', tensor([ 0.0993,  0.0737, -0.2287, -0.5889])),
             ('2.model.2.weight',
              tensor([[-0.5853,  0.3475,  0.2194, -0.2362],
                      [ 0.0463,  0.2101, -0.2726,  0.0574],
                      [-0.4221, -0.3367, -0.4742,  0.2873],
                      [-0.1169, -0.0678, -0.4661, -0.0395]])),
             ('2.model.2.bias', tensor([-0.0908, -0.4833,  0.5289,  0.1900])),
             ('2.model.4.weight',
              tensor([[-0.0702, -0.2431, -0.2679, -0.3471],
                      [ 0.2825,  0.1204, -0.2197, -0.1854]])),
             ('2.model.4.bias', tensor([ 0.0370, -0.1809])),
             ('4.weight', tensor([[-1.0629,  2.5723]])),
             ('4.bias', tensor([0.0272]))])
  • We can save the model to a file and load it back.
torch.save(model.state_dict(), "my_model.pth")

Some available models

  • If we want to use the ResNet model structure for image classification and the image size is 200x200, we can use the following code to get the model structure.
  • Note that this model is defined for ImageNet, which has 1000 classes. If you want to use it for a different number of classes, you need to change the last layer.
import torchvision.models as models
from torchsummary import summary

resnet = models.resnet18(weights=None) # No weights - random initialization
summary(resnet, (3, 200, 200))
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 64, 100, 100]           9,408
       BatchNorm2d-2         [-1, 64, 100, 100]             128
              ReLU-3         [-1, 64, 100, 100]               0
         MaxPool2d-4           [-1, 64, 50, 50]               0
            Conv2d-5           [-1, 64, 50, 50]          36,864
       BatchNorm2d-6           [-1, 64, 50, 50]             128
              ReLU-7           [-1, 64, 50, 50]               0
            Conv2d-8           [-1, 64, 50, 50]          36,864
       BatchNorm2d-9           [-1, 64, 50, 50]             128
             ReLU-10           [-1, 64, 50, 50]               0
       BasicBlock-11           [-1, 64, 50, 50]               0
           Conv2d-12           [-1, 64, 50, 50]          36,864
      BatchNorm2d-13           [-1, 64, 50, 50]             128
             ReLU-14           [-1, 64, 50, 50]               0
           Conv2d-15           [-1, 64, 50, 50]          36,864
      BatchNorm2d-16           [-1, 64, 50, 50]             128
             ReLU-17           [-1, 64, 50, 50]               0
       BasicBlock-18           [-1, 64, 50, 50]               0
           Conv2d-19          [-1, 128, 25, 25]          73,728
      BatchNorm2d-20          [-1, 128, 25, 25]             256
             ReLU-21          [-1, 128, 25, 25]               0
           Conv2d-22          [-1, 128, 25, 25]         147,456
      BatchNorm2d-23          [-1, 128, 25, 25]             256
           Conv2d-24          [-1, 128, 25, 25]           8,192
      BatchNorm2d-25          [-1, 128, 25, 25]             256
             ReLU-26          [-1, 128, 25, 25]               0
       BasicBlock-27          [-1, 128, 25, 25]               0
           Conv2d-28          [-1, 128, 25, 25]         147,456
      BatchNorm2d-29          [-1, 128, 25, 25]             256
             ReLU-30          [-1, 128, 25, 25]               0
           Conv2d-31          [-1, 128, 25, 25]         147,456
      BatchNorm2d-32          [-1, 128, 25, 25]             256
             ReLU-33          [-1, 128, 25, 25]               0
       BasicBlock-34          [-1, 128, 25, 25]               0
           Conv2d-35          [-1, 256, 13, 13]         294,912
      BatchNorm2d-36          [-1, 256, 13, 13]             512
             ReLU-37          [-1, 256, 13, 13]               0
           Conv2d-38          [-1, 256, 13, 13]         589,824
      BatchNorm2d-39          [-1, 256, 13, 13]             512
           Conv2d-40          [-1, 256, 13, 13]          32,768
      BatchNorm2d-41          [-1, 256, 13, 13]             512
             ReLU-42          [-1, 256, 13, 13]               0
       BasicBlock-43          [-1, 256, 13, 13]               0
           Conv2d-44          [-1, 256, 13, 13]         589,824
      BatchNorm2d-45          [-1, 256, 13, 13]             512
             ReLU-46          [-1, 256, 13, 13]               0
           Conv2d-47          [-1, 256, 13, 13]         589,824
      BatchNorm2d-48          [-1, 256, 13, 13]             512
             ReLU-49          [-1, 256, 13, 13]               0
       BasicBlock-50          [-1, 256, 13, 13]               0
           Conv2d-51            [-1, 512, 7, 7]       1,179,648
      BatchNorm2d-52            [-1, 512, 7, 7]           1,024
             ReLU-53            [-1, 512, 7, 7]               0
           Conv2d-54            [-1, 512, 7, 7]       2,359,296
      BatchNorm2d-55            [-1, 512, 7, 7]           1,024
           Conv2d-56            [-1, 512, 7, 7]         131,072
      BatchNorm2d-57            [-1, 512, 7, 7]           1,024
             ReLU-58            [-1, 512, 7, 7]               0
       BasicBlock-59            [-1, 512, 7, 7]               0
           Conv2d-60            [-1, 512, 7, 7]       2,359,296
      BatchNorm2d-61            [-1, 512, 7, 7]           1,024
             ReLU-62            [-1, 512, 7, 7]               0
           Conv2d-63            [-1, 512, 7, 7]       2,359,296
      BatchNorm2d-64            [-1, 512, 7, 7]           1,024
             ReLU-65            [-1, 512, 7, 7]               0
       BasicBlock-66            [-1, 512, 7, 7]               0
AdaptiveAvgPool2d-67            [-1, 512, 1, 1]               0
           Linear-68                 [-1, 1000]         513,000
================================================================
Total params: 11,689,512
Trainable params: 11,689,512
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.46
Forward/backward pass size (MB): 51.08
Params size (MB): 44.59
Estimated Total Size (MB): 96.13
----------------------------------------------------------------

Pre-trained weights

  • We can also use the weights that are trained on the ImageNet dataset.
resnet = models.resnet18(weights="IMAGENET1K_V1")
  • This can be useful when you have a small dataset and you want to use the pre-trained weights to improve the performance.

Fine tuning

  • The last layer of ResNet18 is the fully connected layer with 512 input features and 1000 output features
resnet.fc
Linear(in_features=512, out_features=1000, bias=True)
  • If we want to use it for a 10-class classification problem, we need to change the last layer.
resnet.fc = nn.Linear(512, 10)
resnet.fc
Linear(in_features=512, out_features=10, bias=True)
  • For efficiency, we can freeze the weights of the previous layers and only train the last layer.
# freeze all layers by setting the requires_grad attribute to False
for param in resnet.parameters():
    param.requires_grad = False

# unfreeze the last layer
for param in resnet.fc.parameters():
    param.requires_grad = True

Lightning

PyTorch Lightning

  • PyTorch offers a lot of flexibility, but it can be cumbersome to write the training loop, validation loop, etc.
  • Lightning is a lightweight PyTorch wrapper.
  • The Lightning class (LightningModule) is exactly the same as the PyTorch, except that the LightningModule provides a structure for the research code.
  • More specifically, there are two main classes in Lightning: LightningModule and LightningDataModule.

The basic structure of a LightningDataModule is as follows:

import pytorch_lightning as pl

class MyDataModule(pl.LightningDataModule):
    def __init__(self):
        pass

    def prepare_data(self):
        # download, IO, etc. Useful with shared filesystems
        pass

    def setup(self, stage):
        # make assignments here (val/train/test split)
        dataset = RandomDataset(1, 100)
        self.train, self.val, self.test = data.random_split(
            dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42)
        )

    def train_dataloader(self):
        return data.DataLoader(self.train)

    def val_dataloader(self):
        return data.DataLoader(self.val)

    def test_dataloader(self):
        return data.DataLoader(self.test)

The basic structure of a LightningModule is as follows:

class MyModel(pl.LightningModule):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x

    def loss(self, pred, label):
        pass

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        pred = self.forward(x)
        loss = self.loss(pred, y)
        self.log('train_loss', loss)
        return loss

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        pred = self.forward(x)
        loss = self.loss(pred, y)
        self.log('val_loss', loss)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

Training a model in Lightning is as simple as follows:

model = MyModel()
data_module = MyDataModule()
trainer = pl.Trainer()

trainer.fit(model, data_module)

Lightning vs. PyTorch

  • Lightning provides a more structured way to write PyTorch code.
  • The Lightning Trainer automates many things, such as:
    • Epoch and batch iteration
    • optimizer.step(), loss.backward(), optimizer.zero_grad() calls
    • Calling of model.eval(), enabling/disabling grads during evaluation
    • Checkpoint Saving and Loading
    • Tensorboard (see loggers options)

References