CSSE 461 Computer Vision
Kyle Wilson, working alone
I’ll study the paper “VGGT: Visual Geometry Grounded Transformer”, available here at PapersWithCode. This paper solves several computer vision problems all-in-one, but I’ll focus on its performance on the depth-from-single-image problem. My current understanding is that the exceptional results largely come from training the same network to do multiple tasks, allowing the researchers to use more datasets for model training.
This project will follow the recommend route:
VGGT comes out of Facebook Research. They publish training code, model weights, and inference scripts. In fact, the model is even on HuggingFace, for maximum user-friendliness. This post suggests that I’ll need at most 7 or 8GB of GPU memory to run inference on a single image at a time. That is well within what we have available.
If time permits, I may also explore other tasks that VGGT can do, such as estimating surface normals and detecting point tracks.
Another possible extension would be to run VGGT against many images that were part of its testing set, display the predictions, and try to spot patterns in the types of errors that it is prone to making.