Skip to Main Content

Department of Computer Science

Technical Services and Support

Cuda and AI/Learning Tools

Most research computing in CS department, and much of our instruction, uses GPUs with Nvidia’s Cuda software, and applications such as Pytorch and Tensorflow. This page describes these technologies

GPUs and their allocation

Most of our research and larger instructional systems (servers) have 8 Cuda-capable Nvidia GPUs. Desktop systems generally have one smaller GPU that is still Cuda-capable, and thus could be used for courses in GPU programming or preliminary software development.

GPUs can be shared by more than one user. However the memory is limited (typically 12 GB on public systems), so in practice only one or two can use them at a time. To avoid having one user dominate our limited GPUs, when running GPU jobs, we require that you use Slurm Job Scheduling software to run your jobs on iLab Servers.  No GPUs will be available without Slurm on iLab Servers. Slurm is not required for Desktops machines.

Cuda

The Cuda Toolkit is a set of APIs from Nvidia, designed to make it easy to write programs using GPUs. Among other things, it provides a uniform interface that applies to many different models of GPU. There are alternatives, e.g. OpenCL, but Cuda is used most commonly here. Cuda has bindings for Python and most other major programming languages. Work in our department uses primarily Python.

We install the latest version of Cuda on all systems with appropriate GPUs. Many users have existing code that requires older versions. You can write and ask to install previous versions on your system. However this may not be possible. E.g. Ubuntu 22 supports Cuda 12, but no older versions. See the section on containers, below, for a way to use older versions of software.

Python

In CS department, GPU-based work is done primarily in Python. We would be happy to support users with other languages, but most tools currently installed are for Python. The most common tool is Pytorch, but we also have some usage of Tensorflow.

Pytorch, Tensorflow, and other tools are installed in anaconda environments. DO NOT simply type “python.” On many systems that will give you an older version of Python, without access to the GPU-related tools. Instead use the most recent version of anaconda that you can. Anaconda is a packaged environment for Python. It has most of the major tools. When possible, we can add additional ones on request.

To run python, see Using Python on CS Linux Machine. For most purposes, it’s sufficient to add the appropriate environment to your path, e.g. export PATH=/common/system/anaconda/envs/python38/bin:$PATH

The general purpose Anaconda environments are located in /common/system/anaconda/envs/.  Currently we have python36 through python39. But we’ll be adding new versions as they become available, type conda env list to see the current list.

For those users who needs special Python setup and know how to use python, we recommend that you setup your own python environment where you can install your own modules and avoid conflicts with existing modules.

Adding your own software

We have tried to put all the commonly-used software in our Anaconda environments. If you need more, there are two options:

    • Install individual packages using pip install --user That causes them to be installed in your home directory, in ~/.local/lib/pythonM.N This is a reasonable approach if you have a few packages you need.
    •  Install you own python enviornment, either using a venv. or your own Anaconda distribution. Because Anaconda distributions are large, and our home directories have limited quotas, you may want to put an Anaconda distribution in either /common/users/NETID or /common/home/NETID (if you’re a grad student or faculty).

Running Containers with Singularity

As noted above, you may need to use a specific version of Conda, pytorch, or other software that is different from what we have installed. For this we recommend using a container

A container is in some respects like a virtual machine. It has its own set of software. But it’s not as isolated from the underlying operating systems. It has the same users, processes, and user file systems. It is really a way of delivering a specific set of software that is different from what is installed on the main system.

Nvidia supplies official containers that have Cuda, Pytorch, Tensorflow, and many other tools. They issue new containers once a month, but keep the old ones archived. That lets you get most reasonable combinations of versions by running the right container.

Because older versions of Cuda won’t install on Ubuntu 20, if you need Cuda 9 or 10, you’ll have to use a container on Ubuntu 20. However in the long run we’ll probably use containers even for current software.

We have downloaded the Nvidia containers that we think you’d be most likely want, to /common/system/nvidia-containers. In that directory there are also INDEX files listing all available containers and the versions of major software they support. If you need a container that we haven’t provided, we can easily download it. To look at them, do

more /common/system/nvidia-containers/INDEX-pytorch

or

more /common/system/nvidia-containers/INDEX-tensorflow

In the table at the end, you’ll see an entries like

21.05 1.15.5 or 2.4.0, Ubuntu 20, Cuda 11.3.0, Python 3.8
     21.04 1.15.5 or 2.4.0, Ubuntu 20, Cuda 11.3.0, Python 3.8

21.05 is the container version (2021, May). It uses version 1.15.5 or 2.4.0 of tensorflow, with Ubuntu 20, Cuda 11.3.0 and Python 3.8.

The versions at the left margin are the ones we have. The indented versions are available from Nvidia and could be downloaded if you need them.

If you do ls /common/system/nvidia-containers you’ll see a list of the files we have. The containers all end in .sif. The names should match the entries in the index file. e.g. tensorflow:21.05-tf2-py3.sif is 21.05, the version with Tensorflow 2.4.0. (1.15.5 would be tf1).

To use a container, simply run it with Singularity, e.g.

singularity run --nv /common/system/nvidia-containers/tensorflow:21.05-tf2-py3.sif

Once it starts, you’ll be in a bash shell within the container, in your normal home directory. At that point you can do development and run programs as you normally would. Click for more info on Singularity Basic commands or Documentations and Examples.

You can install additional python software for the container as described above, i.e. using pip install --user. Because your home directory is the same inside the container and outside, it works just as it would outside the container. Of course you can also install your own python environment. That will work inside the container as well, though you’ll have to make sure you have versions of software that match the Cuda version supported by the container.

These containers have software intended to run code. They may not have everything you want for development. (In particular, there’s no emacs text editor.) Thus you may want to maintain a separate window on the main machine to do things other than running your program. The user files are the same inside and outside the container. In fact even the processes you see with ps are the same inside and outside the container (though usernames other than your own won’t show inside the container) .

Running Containers with Dockers

Running container with Singularity is preferred method for CS machines. If you must run Docker container, please see  Computer Science Docker page.

Running long jobs and GPU jobs

When running a long job, please be aware that we have Limitations Enforced On CS Linux Machines. Please read the instruction on how to go around the restriction.

Because we have limited GPUs, when running GPU jobs, we require that you use Job Scheduling software to run your jobs on iLab servers.

For help with our systems or If you need immediate assistant, visit LCSR Operator at CoRE 235 or call 848-445-2443. Otherwise, see CS HelpDesk. Don’t forget to include your NetID along with descriptions of your problem.