Sample Final Project Proposal

CSSE 461 Computer Vision

Kyle Wilson, working alone

Topic

I’ll study the paper “VGGT: Visual Geometry Grounded Transformer”, available here at PapersWithCode. This paper solves several computer vision problems all-in-one, but I’ll focus on its performance on the depth-from-single-image problem. My current understanding is that the exceptional results largely come from training the same network to do multiple tasks, allowing the researchers to use more datasets for model training.

Proposed Work

This project will follow the recommend route:

Feasibility

VGGT comes out of Facebook Research. They publish training code, model weights, and inference scripts. In fact, the model is even on HuggingFace, for maximum user-friendliness. This post suggests that I’ll need at most 7 or 8GB of GPU memory to run inference on a single image at a time. That is well within what we have available.

Possible Extensions

If time permits, I may also explore other tasks that VGGT can do, such as estimating surface normals and detecting point tracks.

Another possible extension would be to run VGGT against many images that were part of its testing set, display the predictions, and try to spot patterns in the types of errors that it is prone to making.