Home Syllabus Schedule AI Prompt Resources

Project: Depth from Single Image

Figure 1: Depth predictions on eight random images from the NYU-Depth V2 dataset. Row 1 is the input images. Row 2 shows ground truth depths. Row 3 is the predicted depths, using a very simple convolutional model trained for a mere 40 epochs. Hopefully your predictions look better than these.

Winter 2026, Rose-Hulman Institute of Technology

Instructor: Kyle Wilson

Overview

An image is a 2D representation of a 3D world. The process of taking a photo is lossy. The rich 3D geometry is compressed down into a flat image plane. Information is lost. In general, it’s impossible to reason our way back to a 3D model of the scene from a single image.

In this project we will cheat, and attempt this impossible problem using machine learning. We’ll train a deep learning model to predict the depths of pixels in an image based on statistical patterns in lots of training data.

This is a group project. We’ll be using specialized hardware (GPUs) for training the models, and groups will help make sure there is enough compute to go around.

Hints

Students come to this class with hugely varying levels of experience with machine learning. I want the assignment to adapt to your pre-existing knowledge. So, for each of the tasks below there is a hint link. There’s no grade penalty for using them. Essentially, you get to pick your difficulty level.

Learning Objectives

By doing this project, you will learn to:

Do deep learning on a remote, GPU-equipped server
Work with RGB-D images
Define a convolutional model in PyTorch
Load a vision dataset, train your model on it, and validate your model
Evaluate the quality of a depth map prediction

Technical Requirements

Start by reading the GPU server documentation. Make sure that you can connect to the compute server, clone the bare-bones starting code, and run cells in the starting code notebook.

Core Tasks

I recommend completing these in order:

Task 0: Get the NYU Depth v2 dataset

If you’re working on the CSSE GPU servers, this task is done for you (as described in starting code). But if you’re training models on your own computer you’ll need to prepare the dataset on your machine. Follow the instructions in the starting code, and please reach out if you need help!

Task 1: Define a convolutional neural network in PyTorch

Next you’ll need to define a neural network to train. Let’s start with the basics. You’re welcome to try fancier models later (see the “extensions” section below).

The recommended starting point is a model with two parts:

An encoder, which repeats this sequence of operations three times: 2D Convolution, ReLU activation, 2x downsampling
A decoder, which repeats this sequence of operations three times: 2D convolution, ReLU activation, 2x upsampling (omit the ReLU for the last layer)

This model is a starting point, and its performance is fairly bad. I’d like you to get something training first, even if it is bad, before you try to improve the model.

It’s not important that you reproduce this model exactly.

Remember, there’s a hint if you’d like to use it.

Task 2: Loss function

Every machine learning project needs at least one loss function. Its job is to quantify error: how different is your model’s prediction from the correct answer?

There’s no one right answer here. There are many reasonable ways to quantify error. It turns out that some approaches work much better than others. I’d like you to try several approaches.

The starting code contains a complete implementation of an L2 loss function. It takes two inputs: predicted depth, and actual depth. Then it computes the mean sum-of-squares difference between these two images.

For this section, I’d like you to also write an L1 loss function (the mean absolute difference between prediction and actual). Finally, you also need to write one more interesting loss function of your choice.

There is a small hint for this section too.

Task 3: Train your model

This section isn’t mainly about writing code. I’ve provided a training loop. If your DataLoaders, your Model, and your LossFunction are all working, then you can run these cells to train your model.

Here’s how you know if it is working:

There shouldn’t be any error messages or warning messages
The “Val Loss” number should be slowly decreasing over time.
The “Train Loss” number should roughly track the validation loss. It’s normal for it to be a little bit bigger or smaller, but if you’re consistently seeing a 15% or 20% difference then there’s a problem.
If you run the inference cells at the bottom of the notebook you would like to see reasonable output, and it should get better the longer you train your model.

I often see a problem where the model outputs all zeroes after only a little training. If you see this, remove the final ReLU in your decoder and try again.

Okay, I can train! But the output quality looks really bad.

The first time you train your model the results will probably be bad. Now it’s time to tweak things. I’ll give suggestions on reasonable things you can try that tend to improve models.

Easy things to try that might help a little:

Increasing the batch size by a modest amount
Increasing or decreasing the learning rate of the optimizer
Adding “weight decay” to the optimizer. (If you try this, start very small, like 1e-4 or less. It is only useful if you are overfitting.)
Train for more epochs
Stop downsampling the input, and write a model that works on the full dataset resolution (this might slow down training)
Increase the number of filters in each layer of your model.

Harder things to try that might help a lot:

Training your model to predict inverse depth 1/depth instead of depth. (This tends to be more numerically stable.)
Adding nn.Dropout layers to your model and retraining (this may artificially inflate your Train Loss so don’t worry if it becomes quite a bit higher than the Validation Loss. It’s still bad if the Train Loss is way lower than Validation Loss.)
Add more layers to your model to make it taller.

The picture at the top of this lab sets a low minimum bar. If your predictions look worse than these, keep working. If yours look about the same or better you’ve passed the minimum bar for this assignment.

Possible Extensions:

(See grading guidelines below. All of these are optional, but earning a higher grade is tied to completing some of them.)

For most of these I figure you’ll run some searches, find a good example, and adapt it to your problem.

Write code to run inference on your own images. You’ll need to resize them to the input size of your model through some combination of scaling and cropping. (Do this in code, not through Photoshop or the like.) Go take some indoor pictures around campus and show us how your model performs.
Implement simple data augmentations, such as using random crops of the input image instead of resizing, or adding mirror-image versions of real data to the training pipeline.
Dig deeper into your model’s performance. Track down a set of images that it did particularly badly on. Can you characterize the types of scenes that your model struggled with?
Make a change to the model, the loss, or the training process that significantly improves performance. Talk to me if you’re looking for ideas on where to start.
Research “skip connections” between the encoder and decoder. They are commonly used in depth prediction networks. Edit your model definition to use them. A modern architecture that uses skip connections is called the U-Net.
Change your model architecture to use an off-the-shelf pretrained network as your encoder. (I suggest ResNet18.) Modify the rest of the pipeline to work with this change. Do you get better performance?

Submission and Grading

Check-in

The first thing due is your check-in artifact. You’ll need to upload sample output from your network to Gradescope. (I’m looking for an image similar the one at the top of this document, but generated by your code.) To meet this milestone your code must be able to load data, do some minimal amount of training, and generate output that is better than random noise. (The quality bar is very low for the check-in.)

Code submission

Commit. Turn in your code by committing to your github repo, and then posting your repo to Gradescope.
Artifact. Commit some sample output, such as the .png files that are saved by the “inference” cell at the end of the starting code notebook. I want to be able to see your results without retraining your network. (Note: by default, the starting code shows model results on test set images. Don’t change that! Don’t submit results on train set images, because that’s doing bad science.)
README. Tell me which extensions you did (if any) in a short text file in your GitHub repo.
Weights. (optional) If you’re feeling particularly code-savvy, you’re welcome to commit a copy of your model weights, so that I can run your model too. I won’t require this from everyone.

Grading

Successfully completing all of the core tasks (Task 1, Task 2, Task 3) well will earn a C on this assignment. To earn a B, also complete two extension tasks of your choice. To get an A, complete three extensions and produce output that is a clear improvement on the sample output at the top of this page.

Per the syllabus, grading will be on the basis of correctness, clarity, and efficiency.