Project 3: Depth from Single Image

Figure 1: Depth predictions on eight random images from the NYU-Depth V2 dataset. Row 1 is the input images. Row 2 shows ground truth depths. Row 3 is the predicted depths, using a very simple convolutional model trained for a mere 40 epochs. Hopefully your predictions look better than these.

Spring 2025, Rose-Hulman Institute of Technology

Instructor: Kyle Wilson

Important Dates:

Project out: Thursday 5/01/2025
Check-in artifact: Tuesday 5/13/2025
Code due (on github): Tuesday 5/20/2025

Overview

An image is a 2D representation of a 3D world. The process of taking a photo is lossy. The rich 3D geometry is compressed down into a flat image plane. Information is lost. In general, it’s impossible to reason our way back to a 3D model of the scene from a single image.

In this project we will cheat, and attempt this impossible problem using machine learning. We’ll train a deep learning model to predict the depths of pixels in an image based on statistical patterns in lots of training data.

This is a group project. We’ll be using specialized hardware (GPUs) for training the models, and groups will help make sure there is enough compute to go around.

Hints

Students come to this class with hugely varying levels of experience with machine learning. I want the assignment to adapt to your pre-existing knowledge. So, for each of the tasks below there is a hint link. There’s no grade penalty for using them. Essentially, you get to pick your difficulty level.

Learning Objectives

By doing this project, you will learn to:

Do deep learning on a remote, GPU-equipped server
Work with RGB-D images
Define a convolutional model in PyTorch
Load a vision dataset, train your model on it, and validate your model
Evaluate the quality of a depth map prediction

Technical Requirements

Start by reading the GPU server documentation. Make sure that you can connect to the compute server, clone the bare-bones starting code, and run simple Python commands.

Core Tasks

I recommend completing these in order:

Task 1: Load the NYU Depth v2 dataset

NYU Depth v2 is a high-quality dataset commonly used for depth prediction. You can read the official documentation here. Unfortunately, the official download link for this dataset is down, and I couldn’t find any reputable mirrors. While we wait for that to get fixed, we’ll use a version of the dataset that was uploaded to HuggingFace by user tanganke.

Here are subgoals for this task:

Load the dataset
Combine the stock test/train splits
Resize the data to a nice size of 320 x 240 pixels
Make your own test/train split
Create a torch.utils.data.DataLoader object for each split
Verify that the results look reasonable

Work through these in order, using the framework provided in the starting code.

Remember, there’s a hint available if you’d like. You’re encouraged to use this, especially if you are newer to doing machine learning.

Task 2: Define a convolutional neural network in PyTorch

Next you’ll need to define a neural network to train. Let’s start with the basics. You’re welcome to try fancier models later (see the “extensions” section below).

The recommended starting point is a model with two parts:

An encoder, which repeats this sequence of operations three times: 2D Convolution, ReLU activation, 2x downsampling
A decoder, which repeats this sequence of operations three times: 2D convolution, ReLU activation, 2x upsampling (omit the ReLU for the last layer)

This model is a starting point, and its performance is fairly bad. I’d like you to get something training first, even if it is bad, before you try to improve the model.

It’s not important that you reproduce this model exactly.

Remember, there’s a hint if you’d like to use it.

Task 3: Loss function

Every machine learning project needs at least one loss function. Its job is to quantify error: how different is your model’s prediction from the correct answer?

Follow the instructions in the starting code to write your loss function.

There is a small hint for this section too.

Task 4: Train your model

This section isn’t mainly about writing code. I’ve provided a training loop. If your DataLoaders, your Model, and your LossFunction are all working, then you can run these cells to train your model.

Here’s how you know if it is working:

There shouldn’t be any error messages or warning messages
The “Val Loss” number should be slowly decreasing over time.
The “Train Loss” number should roughly track the validation loss. It’s normal for it to be a little bit bigger or smaller, but if you’re consistently seeing a 15% or 20% difference then there’s a problem.
If you run the inference cells at the bottom of the notebook you would like to see reasonable output, and it should get better the longer you train your model.

I often see a problem where the model outputs all zeroes after only a little training. If you see this, remove the final ReLU in your decoder and try again.

Okay, I can train! But the output quality looks really bad.

The first time you train your model the results will probably be bad. Now it’s time to tweak things. I’ll give suggestions on reasonable things you can try that tend to improve models.

Easy things to try that might help a little:

Increasing the batch size by a modest amount
Increasing or decreasing the learning rate of the optimizer
Adding “weight decay” to the optimizer. (If you try this, start very small, like 1e-4 or less. It is only useful if you are overfitting.)
Train for more epochs
Increase the number of filters in each layer of your model.

Harder things to try that might help a lot:

Training your model to predict inverse depth 1/depth instead of depth. (This tends to be more numerically stable.) I ended up editing my dataloader to invert the depth images when I loaded the dataset.
Adding nn.Dropout layers to your model and retraining (this may artificially inflate your Train Loss so don’t worry if it becomes quite a bit higher than the Validation Loss. It’s still bad if the Train Loss is way lower than Validation Loss.)
Add more layers to your model to make it taller.

The picture at the top of this lab sets a low minimum bar. If your predictions look worse than these, keep working. If yours look about the same or better you’ve passed the minimum bar for this assignment.

Possible Extensions:

(See grading guidelines below. All of these are optional, but earning a higher grade is tied to completing some of them.)

Write code to run inference on your own images. You’ll need to resize them to the input size of your model through some combination of scaling and cropping. (Do this in code, not through Photoshop or the like.) Go take some indoor pictures around campus and show us how your model performs.
Implement simple data augmentations, such as using random crops of the input image instead of resizing, or adding mirror-image versions of real data to the training pipeline.
Dig deeper into your model’s performance. Track down a set of images that it did particularly badly on. Can you characterize the types of scenes that your model struggled with?
Make a change to the model, the loss, or the training process that significantly improves performance. Talk to me if you’re looking for ideas on where to start.
Research “skip connections” between the encoder and decoder. They are commonly used in depth prediction networks. Edit your model definition to use them.
Change your model architecture to use an off-the-shelf pretrained network as your encoder. (I suggest ResNet18.) Modify the rest of the pipeline to work with this change. Do you get better performance?

Submission and Grading

Check-in

The first thing due is your check-in artifact. You’ll need to upload sample output from your network to Moodle. (I’m looking for an image similar the one at the top of this document, but generated by your code.) To meet this milestone your code must be able to load data, do some minimal amount of training, and generate output that is better than random noise. (The quality bar is very low for the check-in.)

Code submission

Commit. Turn in your code by committing to your github repo.
Artifact. Please also commit some sample output, such as the .png files that are saved by the “inference” cell at the end of the starting code notebook. I want to be able to see your results without retraining your network. (Note: by default, the starting code shows model results on test set images. Don’t change that! Don’t submit results on train set images, because that’s doing bad science.)
README. Tell me which extensions you did (if any) in a short text file in your GitHub repo.
Weights. (optional) If you’re feeling particularly code-savvy, you’re welcome to commit a copy of your model weights, so that I can run your model too. I won’t require this from everyone.

Grading

Successfully completing all of the core tasks well will earn a C on this assignment. To earn a B, also complete one extension task of your choice. To get an A, complete two extensions and produce output that is a clear improvement on the sample output at the top of this page.

Per the syllabus, grading will be on the basis of correctness, clarity, and efficiency.