Home Syllabus Schedule AI Prompt Resources

Structure from Motion Lab

CSSE 461 - Computer Vision

Overview

“Structure from Motion” is the problem of reconstructing the geometry of a 3D scene given many 2D images. As we’ll cover in class, this is a hard math problem, and solving it is an area of active research.

Most SfM solvers are written for researchers, by researchers. The closest thing we have to an end-user codebase is COLMAP. It clocks in at roughly 100,000 lines of code! So we won’t be writing our own SfM solvers in class.

You’ll be creating a report as you go. At the end of the lab, upload your report to Gradescope.

You may work solo or in pairs, as you choose.

Objectives

Download COLMAP, an SfM solver
Reconstruct a standard test scene
Analyze the output to better understand what it’s doing
Gather images and reconstruct a novel scene of your choice

Part 1: Get COLMAP

The COLMAP project page is here. Look for the link to “Download pre-release binaries”. Scroll to the bottom of the page and get the download. For simplicity I chose “colmap-x64-windows-nocuda.zip”. Download the zip file.
Unzip the file you just downloaded. Extract the zip folder. This may take a few minutes.
COLMAP doesn’t “install” itself in the usual way. It ships as executable binaries. Simply go to the folder where you extracted the files and double-click COLMAP.bat.

It’s normal for a security warning to pop up. I clicked “Run”.
This will start a background terminal. (This will be useful later to see logs as it works.)
This will launch a GUI application titled COLMAP.

If everything worked, you should see something like this:

Instructions for other operating systems

If you aren’t completing this lab on a Windows computer, read COLMAP’s install docs. Look first for a way to install a pre-built binary. (For example, Mac OS users can do brew install colmap, and some Linux users will have packages available.) If none of those apply, read further down the page for instructions on building the software from source. Please reach out to me if you hit snags.

Part 2: Run COLMAP on a standard dataset

This GitHub page has a nice list of standard datasets used in computer vision. The first category is “SfM and MVS”. These datasets are suitable for us (although some will be too large to solve in a reasonable amount of time).

Grab the “Small object dataset” called Bunny, hosted by the vision lab at TUM. Download the .tar.gz file. In Windows, right-click to “Extract all”.

Go back to COLMAP. We’ll run SfM in a series of steps:

Image selection

File -> New Project.
COLMAP stores a series of intermediate calculations in an SqLite database. Click “New” next to “Database”. Choose a file location you’ll remember. (I did Downloads/bunny_sfm.db.) Click “Save”.
Next to “Images” click “Select”. Navigate to the folder containing the dataset images and “Select Folder”. For me, this was Downloads/bunny_data/bunny_data/images.
Click “Save”.

Detecting Keypoint Features

Next we’ll run algorithms that identify keypoints in the images. Click “Processing -> Feature Extraction”. Most of these options are fine at their default values. Note a few that are relevant:

The “Camera model” drop-down menu:
- PINHOLE is the model we covered in class
- SIMPLE_PINHOLE is the same, but with a square pixel constraint
- SIMPLE_RADIAL is like SIMPLE_PINHOLE, but with a radial distortion correction factor. This is a nice accommodation to the real world, and will probably give you better results.
The “Shared for all images” checkbox:
- If all of your images were taken by the same camera with the same zoom settings, this box will speed up reconstruction and improve accuracy.
- Otherwise, don’t check this box!

Let’s choose SIMPLE_RADIAL and Shared for all images.

Click “Extract” and wait a moment. When it finishes computing you can close the Feature Extraction window. COLMAP has finished running SIFT, a top-notch feature extractor. The extracted locations were written to the db we created. You can see these detected keypoints by clicking “Processing -> Database Management”, selecting an image, and click the “Show Image” button.

Feature Matching

The next step of processing is to match keypoints across different views. We call a set of keypoints that all refer to the same 3D point a “track”. Here’s a conceptual view of what we’re trying to compute next: we’ve matched the same 3D point across multiple views. We know the pixel locations of this one point as it appears in each image.

Click “Processing -> Feature matching”.

The tabs across the top of this window are different algorithms for choosing which images to try to match. Out dataset is small, so stick with the first tab (“Exhaustive”) and click “Run”.

You can see what happened by clicking “Processing -> Database Management”. Select an image, and click “Overlapping images”. Now select one overlapping image from the list and click “Show Matches”.

Reconstruction

Click “Reconstruction -> Start Reconstruction”. Let it run. Look at the stats in the bottom bar of the COLMAP window, the log messages to the side (and also in the terminal that opened when you launched COLMAP), and the 3D visualization.

Look under “Render -> Render options” for ways to tweak the 3D view. I found “Image connections” interesting: it shows which cameras were able to match some scene content with each other.

Writeup questions

Add the following to your report:

A screenshot of your 3D reconstruction
Your subjective evaluation of reconstruction quality:
- What do you see?
- Does it look right?
A one-paragraph summary of the reconstruction logs.
- What happened first? What happened next?
- Can you infer what the “main loop” of the solver must be doing?
- Did you see anything unexpected?

Part 3: Analyze COLMAP output

Click on “File -> Export model as text”. This will write a bunch of .txt files, so consider making a folder for them. Open frames.txt in a text editor. This file contains the computed extrinsics for each image in the reconstruction.

Find the first line of data. I want you to compute the 3D coordinates of the center of projection of this camera. Add this computation to your report. Show your work. You’ll probably write a few lines of code. Show those in the report too.

Note: Quarternion rotations

COLMAP stores 3D rotations as quarternions. Fortunately, we can convert these to rotation matrices easily:

from scipy.spatial.transform import Rotation as R
quat = [w, x, y, z]   # your numbers here
rotmat = R.from_quat(quat, scalar_first=True).as_matrix()
print(rotmat)  # 3x3 rotation matrix

Note: camera centers

Recall the equation that defined \(\mathbf{R}\) and \(\mathbf{t}\):

\[ X_c = \mathbf{R} X_w + t \]

The camera’s center of projection is the point where \(X_c = \mathbf{0}\). So to find the camera’s position in world coordinates, set \(X_c\) to zero and solve for \(X_w\). (Hint: rotation matrices are special: \(\mathbf{R}^{-1} = \mathbf{R}^\top\).)

Part 4: Reconstruct a new scene

For part four, create your own dataset, run SfM, and inspect the reconstruction. Include these in your report:

two representative images from your dataset
a screen snapshot of your reconstruction
a paragraph discussing the quality of the reconstruction. (Are there any obvious errors?)

Dataset collection tips

Depending on options within COLMAP, the runtime complexity is somewhere between \(O(N^2)\) and \(O(N^4)\). So large datasets will quickly become too slow to use.
SfM is easier when scenes contain highly distinctive visual content. Think: interesting landscape good, white wall bad.
SfM works better when images have lots of visual overlap. Two images that don’t see any scene content in common won’t provide depth information for each other.
COLMAP isn’t reasoning about changes over time. Prefer static scenes without movement or other stark visual changes.
Wide-angle lens tend to have more optical distortion. Distorted images aren’t well-modeled by our pinhole model, so reconstruction accuracy may suffer.
COLMAP has a setting to assume the same camera intrisics for all images. If you collected your dataset with more than one device, or if you changed the zoom (focal length) during dataset collection, do not use this setting!

Extra credit

I’ll offer a small amount of extra credit if you can collect a dataset where the reconstruction fails in an interesting way. I’ll be stingy on this–it needs to really be interesting! Of course you’ll fail to reconstruct a scene where cameras can’t see any context in common, or where everything looks like a plain white wall.

Submission

Upload your report (with answers for Parts 2, 3, and 4 of this lab) as a pdf to Gradescope. If you worked in a pair, use the Team feature on the Gradescope assignment to upload a single report for your team.