Hint: Model Definition

The starting code for this section looks like this:

class SimpleCNNDepth(nn.Module):
    def __init__(self):
        super(SimpleCNNDepth, self).__init__()
        # TODO: your code here

    def forward(self, x):
        # TODO: your code here
        return x

Overview: Models in PyTorch

PyTorch uses nn.Module to represent one or more layers of a neural network. To define your own PyTorch model, you have to write code that says how an input tensor gets processed and becomes output.

Here’s a short example of how you could make a tiny neural network:

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()

        # Encoder (Feature Extraction)
        self.layers = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=3, padding=1)
        )

    def forward(self, x):
        x = self.layers(x)
        return x

This defines a network that performs the following sequence of operations:

2D convolution, using 3x3 filters and zero-padding
ReLU nonlinearity
2D convolution, using 3x3 filters and zero-padding
ReLU nonlinearity

You’ll also notice parameters that set the number of channels in each layer. The input images are three-channel. After the first convolution the tensor is 32-channel. And after the final convolution the tensor has 64 channels. So this example model takes an input tensor of shape (N, 1, H, W) and outputs a tensor of shape (N, 64, H, W).

Hints: Writing a Model for Depth Prediction

Many modern models follow an encoder-decoder architecture. Roughly, this means that the network has a front half, called the encoder, that shrinks images spatially but deepens in the number of channels. Then the back half (decoder) does the opposite.

The code could be organized like this:

class SimpleCNNDepth(nn.Module):
    def __init__(self):
        super(SimpleCNNDepth, self).__init__()

        self.encoder = nn.Sequential(
            # TODO
        )
        self.decoder = nn.Sequential(
            # TODO
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

So, what goes in the encoder and decoder?

For a good starting point on this project, try this architecture:

Encoder:
- Three blocks
- Each consisting of: 3x3 convolution, ReLU, 2x downsampling
- Increase channel depth with each block. I used 3 (input), 36, 64, 128
Decoder:
- Three blocks
- Each consisting of: 3x3 convolution, ReLU, 2x upsampling
- Decrease channel depth with each block. I used 128, 64, 32, 1 (output)
  - but skip the ReLU in the final block

Bigger Hint

If you’d like more coding help, read on.

Here’s what it could look like if the encoder only had one block:

self.encoder = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(), 
    nn.MaxPool2d(2)  # Downsample by 2x
)

and here would be the corresponding one-block decoder:

# Decoder (Upsampling)
self.decoder = nn.Sequential(
    nn.Conv2d(32, 1, kernel_size=3, padding=1),  # Output 1-channel depth map
    nn.ReLU(), # skip this in the final block
    nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),
)