CSSE 461

Sample Final Project Proposal

CSSE 461 Computer Vision

Kyle Wilson, working alone

Topic

I’ll study the paper “VGGT: Visual Geometry Grounded Transformer”, available here at PapersWithCode. This paper solves several computer vision problems all-in-one, but I’ll focus on its performance on the depth-from-single-image problem. My current understanding is that the exceptional results largely come from training the same network to do multiple tasks, allowing the researchers to use more datasets for model training.

Proposed Work

This project will follow the recommend route:

Clone the VGGT repo to one of our GPU servers and follow the provided installation instructions
Capture a range of interesting indoor and outdoor images around campus. Take physical measurements of a few distances to compare against the model output.
Upload those images to the server, and run model inference on each of them
Evaluate the quality of the model predictions, both perceptually and against my measurements.
Read the paper to learn how this model works. I may also watch related Youtube explainer videos.
Communicate what I learned in writing and in a presentation to the class.

Feasibility

VGGT comes out of Facebook Research. They publish training code, model weights, and inference scripts. In fact, the model is even on HuggingFace, for maximum user-friendliness. This post suggests that I’ll need at most 7 or 8GB of GPU memory to run inference on a single image at a time. That is well within what we have available.

Possible Extensions

If time permits, I may also explore other tasks that VGGT can do, such as estimating surface normals and detecting point tracks.

Another possible extension would be to run VGGT against many images that were part of its testing set, display the predictions, and try to spot patterns in the types of errors that it is prone to making.