At a minimum, you’ll need the following to do this project:
pytorchUnless you are very patient, you’ll also need one more important resource: GPU-accelerated compute. Training a model on your laptop could be 100x - 1000x slower than training that model on a decent CUDA-capable graphics card (GPU).
Rose CSSE has two GPU servers
gebru : 8x Nvidia Quadro RTX 6000 GPUs, each with 24GB
GPU memorygus : 6x Nvidia L40a GPUs, each with 48GB GPU
memoryGebru is first-come, first-served. Gus is part of our new SLURM compute cluster. You can’t log in to Gus directly. The instructions here are for using Gebru, which is easier.
You’re welcome to try any of the following approaches, but they aren’t the approach that the course officially supports:
The rest of this document will walk you through the following workflow:
Tldr; VS Code is a free code editor from Microsoft.
You want to install it on your local computer. You’ll also need several
VS Code extensions: Python, Jupyter, and
Remote - SSH.
1. Download the Installer:
Go to the official VS Code download page: https://code.visualstudio.com/download. Click on the “System Installer” for your OS.
2. Run the Installer:
Open the downloaded .exe file. Accept the license
agreement and click “Next”.
3. Install and Launch
Click “Next” and then “Install”. Once it’s finished, launch VS Code. You should be greeted with the welcome screen.
4. Install VS Code Extensions
Jupyter. Find the extension
named Jupyter (by Microsoft) and click the “Install”
button.Python to find and install the official
Python extension by Microsoft.Remote - SSH (also from
Microsoft)Access to Rose CSSE servers is restricted to the Rose-Hulman campus network. If you’re off campus you’ll need to first connect to the Rose-Hulman VPN. This EIT knowledge base article explains how to do this.
Remote - SSH VS Code
extension, the Remote Explorer icon will appear in your
Activity Bar (on the far left, near the Extensions button). Click
it.SSH is selected in the dropdown at the
top.username with your Rose-Hulman user name:
ssh username@gebru.csse.rose-hulman.eduEnter. It will ask you to select an SSH
configuration file; the default option is fine.Enter.SSH: server.address, you are successfully connected!From now on, any file you open or terminal you create is on the remote server, not your local machine. Your VS Code window is (not quite literally, but basically) running on the server.
All the commands in this section assume that you’ve opened VS Code, connected to the compute server, and have opened a terminal in VS Code. This terminal lets you run commands directly on the server (not on your local computer).
We’ll manage our Python environment with a slick new tool called
uv. You can read about how to install uv here.
On Linux, just run this command from any folder:
# Install uv on Linux
curl -LsSf https://astral.sh/uv/install.sh | shNext clone the project starting code from GitHub. There’s a
uv configuration file called pyproject.toml in
the starting code. Read this file to get a sense for what will be
installed.
Install all of these project dependencies with one command:
# install all of the standard Python dependencies for this project
uv syncWe also need to install some special tricky project dependencies. This command will choose the correct versions based on your hardware and operating system. It’s normal for this to take a little time.
# auto-detect and install appropriate pytorch version
uv pip install torch torchvision --torch-backend=autoFinally, we need to register this Python environment as a Python kernel that we can use in VS Code. This is a one-time command. You won’t need to run it again unless you change your Python environment.
# Install this Python environment as a Python kernel (one-time setup)
uv run python -m ipykernel install --user --name=p3-depth --display-name="P3 Depth"This section assumes that you have VS Code installed, you connected VS Code to the GPU server via SSH, you cloned the starter code to the server, and you finished setting up your Python environment on the server as described above.
Now it’s time to start working on your project.
File > Open Folder....P3 Depth. It will likely have a
(Recommended) tag next to it.There are more project teams than GPUs. We’re going to have to share. GPUs have two main resources: (1) cores, and (2) memory.
Generally, when you over-use the cores available on a GPU, all the jobs running on that card will run slowly. If you overuse the memory, usually all of the jobs on that GPU will crash. So, it’s your job to check GPU useage before you start a job.
Class Rule: If you crash a classmate’s training run you owe them a cookie.
To check the status of the GPUs on your server, run this command:
nvidia-smi
You’ll receive a snapshot report about GPU usage on the server. If
you’d prefer a continuously-running dashboard display you can try
nvtop instead of nvidia-smi.
Every time you start working on the GPU server, run one of these two commands. Choose a card that isn’t full. There’s a spot in your code notebook for you to type your chosen card number. Now restart your Python kernel and start training. (The restart is important! You have to restart your Python kernel to switch from one GPU to another.)
You can make your jobs smaller to save space. Smaller batch sizes and less-complicated models reduce both core usage and memory useage.
If you have a question about using the GPU servers, please ask me! Email and Teams are best if you can’t ask in-person.
If you think you need some system administration support, like installing system-wide software, or troubleshooting a server that has gone offline, please email both me and the CSSE department sysadmin, Darryl Mouck.