GPU servers – Basic operations

Context

The GPU servers of CL are to be used for development and refinement of jobs that are either small enough to run on two GPUs (~48GB vRAM, or GPU RAM) and about 100GB RAM (CPU RAM). Anything that requires more resources should be run on the S3it cluster.

Here are the instructions for some more common operations on our GPU servers.

WARNING: Working with Git on the GPU servers has issues due to the nature of NFS storage used. Please refer to Git operations for further information on how to handle this.

Python environment management

All environment management tools need to be installed locally by the user.

Conda

Refer to the page https://www.anaconda.com/docs/getting-started/miniconda/install#linux-terminal-installer Run the command:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Refer to Conda documentation on how to manage your environment.

You should also check the conda licensing to make sure that it is suitable for your project(s).

UV

Refer to the page https://docs.astral.sh/uv/#installation Run the command:

curl -LsSf https://astral.sh/uv/install.sh | sh

Creating your environment (venv)

cd into the path of your project, and run the command:

python3 –m venv <name_of_environment>

Inspecting GPU and computing resources

You can use the command:

nvidia-smi

Or you can install the python library “gpustat” on your python environment (provided you have made one and sourced it, you can run pip install gpustat to install this library) then run:

python3 –m gpustat --debug

Fair use of the GPU servers

Some limitations are enforced in order to guarantee continuity and availability of the service:

No hard quotas on storage usage for the moment
RAM usage is limited to 100GB per user
No limits on the number of processes currently, but this may change in the future.

In order to guarantee a fair share of usage to all collaborators the user is expected to:

Minimize the number of GPUs being. You can set this at multiple levels, including:
In bash, at the moment of running a job, it’s possible to limit the amount of GPUs that your job will be able to access
In Python and in multiple programming languages, in multiple ways
Minimize the duration of GPU occupancy: queuing systems are not implemented to simplify all users' lives.
Monitor the outcome of your jobs regularly. Check that the are effectively terminate when you expect to, don’t forget them assuming they’ll be over at some point. Especially when your job is a test
Leave enough resources for everyone. Many groups have contributed to this server and many researchers need to access it.

If you don’t know how to check for the outcome of a job or are afraid that they might be giving an error, please inform us.

Please make sure your code is functional before running it, and avoid running multiple instances of a test script, or any other job, unless strictly necessary.

Please keep the infrastructure accessible to all.