CSSE 461

Home Syllabus Schedule AI Prompt Resources

Interesting Papers

This is a list of computer vision papers that have been getting the most buzz lately. I haven’t read them all in detail, so please do your due diligence to decide whether they are feasible.

VGGT: Visual Geometry Grounded Transformer

ArXiv. This paper promises the world. Given any number of images of a scene, it computes everything geometric you could ask for: camera parameters, 3D points, depths, and pixel correspondances. It won the 2025 Best Paper Award at CVPR, the top vision venue. Does it live up to the hype?

Depth Anything V2

ArXiv. This 2024 paper claims to be the best method for depth-from-single-image. How did they get it to work so well? Also note the link for “Fine-tune for Metric Depth Estimation”. What’s that about?

SAM 2: Segment Anything in Images and Videos

ArXiv. Segmentation means selecting all of the pixels in an image that are part of a target object. This model works on both images and videos, and it can segment objects that it never saw in the training data. How does that work?

YOLO-World: Real-Time Open-Vocabulary Object Detection

ArXiv. This is the latest in the YOLO series of models, which locate objects in images and put bounding boxes around them. This model is integrated with a language model, so it can locate objects by text description.

MASt3R: Matching and Stereo 3D Reconstruction

ArXiv. This 2024 paper trains a neural network to do Structure from Motion. The results in the paper look really good!

ViTPose++: Vision Transformer for Generic Body Pose Estimation

ArXiv. This 2024 paper is about estimating human pose (as wireframes) in video. The focus is on real-time framerates, and they have four different sizes of model depending on how powerful your hardware is.

UniMatch: Unifying Flow, Stereo and Depth Estimation

ArXiv. This is another paper trying to do a bunch of problems in geometric computer vision at once. This paper restricts focus to stereo pair problems, which have particularly strong signal. The results look pretty good!