Scheduler for GPU jobs

Scheduler for GPU and Long CPU Jobs

We have a set of machines intended for jobs that require large memory and GPU jobs. Without Slurm, the user won’t be able to access any GPUs, and there are enforced limits. Our scheduler is coordinated using the Slurm workload manager. Slurm is not limited to GPU jobs. It can also be used for long CPU jobs to avoid the enforced limits.

List of Managed Systems

iLab1.cs.rutgers.edu – iLab4.cs.rutgers.edu
rLab1.cs.rutgers.edu – rLab6.cs.rutgers.edu

with a total of 75 GPUs.

- 32 RTX A4000
- 16 RTX A4500
- 11 RTX A4500 ADA
- 8 GeForce GTX 1080 TI
- 8 Nvidia TITAN X.

Why Use Scheduler?

- The scheduler will put your job on a system with free resources. It doesn’t matter which system you log into.
- The scheduler tries to give each user a fair share of the system. It also gives priority to jobs that are shorter or use fewer GPUs.
- If the resources are available, you may run several jobs, up to their limit, at the same time.
- With Slurm, the Limitation Enforced on CS Linux Machines for long jobs and memory does not apply to jobs scheduled via sbatch or srun.
- You can tell the scheduler how many GPUs (up to 4), how much memory you need (default to 80GB and up to about 1TB), and how long your job will last (up to 7 days)
- It enforces larger limits. As of Jan 1, 2024, the default memory will be 40GB.

Machine without Scheduler

- All iLab desktop machines have a single GPU. This GPU is not part of the scheduler.

- If you need a machine with multiple GPUs with a first come and first served policy, you can use iLabU.cs.rutgers.edu, a special machine we often use to test the latest version of Ubuntu LTS, which has 8 x GeForce GTX 2080ti. Slurm does not manage these 8 GPUs and is subject to limitations enforced on CS Linux Machines.

Getting Started

You can use the slurm job scheduler in interactive and batch mode. Each one has its pros and cons.

NOTE: The nvidia-smi command won’t show you anything on a system with slurm running except if you run it within a slurm job. If you want to know if a GPU is available for your batch, you can run srun -G 1 nvidia-smi. This should indicate that a single GPU is available. A special copy, nvidia-smi-priv, can be used outside the slurm to see the current machine’s GPUs.

Interactive Session (for testing only, please!)

Interactive session must be run on a command line or a terminal.
The simplest approach to using it is to ask for an interactive session. Type

srun -G 4 --pty python3 prog.py

This allocates the first available four gpus even though other machines might have free gpus.

-G indicates how many GPUs you want. Currently, you can get anything from 1 to 4. If GPUs are available, you’ll get them. Note that you may end up on a different computer from where you typed the srun command, depending on where there are free GPUs. Important: You must specify -G option or no GPU is assigned to you.

If no GPU is free, it will wait for free GPUs and then start. In this case, you might prefer to submit a batch job. (See next section.)
Note that the command will run in a completely different context. It will start by doing a “cd” to the directory you’re currently in, but if another setup is needed, create a script that does the setup and runs the script rather than running the program directly. Of course, if your setup is done automatically by .bashrc, that will work.

If you need to use graphics, use srun --x11=first -G 4 --pty python3 prog.py. Of course, you’ll need to use srun in a graphical session. Login using RDP or https://weblogin.cs.rutgers.edu.

Batch Mode (recommended)

Batch mode requires you to put your commands in a file and run it as a batch job. Once submitted, your job will start as soon as GPUs are available:

- Put the commands you want to execute in a file, e.g., myJob
- Submit the job using sbatch -G 4 myJob, where the number after -G is the number of GPUs you want. (See below for large-memory jobs.)
- You can see what jobs are running using the command squeue
- You can cancel a job using scancel NNN, where NNN is the job number shown in squeue.
- If there are a lot of jobs in the queue, you might want to test your job to ensure you haven’t made a mistake in the file. You can use sbatch myJob, i.e., without -G. However, please cancel the job once you verify it has started properly. These systems should only be used for jobs that use GPUs

What goes in your batch file

The file you submit with sbatch must contain every command you need to execute your program.

- Remember, it may run on a different computer. It needs all the commands you’d have to type after logging in to get to the point where you can run.
  - It must begin with #!/bin/bash. We recommend using #!/bin/bash -l (That’s a lowercase L, not a one.) That will cause it to read your .bash_profile, etc.
  - At a minimum, you need to type cd to get to the directory with your files.
  - If you’re using Python in an anaconda environment, it needs “activate” for that environment.
  - You should probably include #SBATCH --output=FILE unless you prefer to type --output when you submit the job
- Here’s an example for executing your Python code in YOURENV

#!/bin/bash -l
#SBATCH --output=logfile
cd YOURDIR
activate YOURENV
python YOURPROGRAM

- Here’s an example of executing a singularity container

#!/bin/bash -l 
#SBATCH --output=logfile 
cd YOURDIR 
#Important: make sure your code autorun upon container execution 
#           and terminate upon code completion 
singularity run --nv  SINGULARITY_CONTAINER.sif

- For more details on sbatch scripts

Additional Details

The maximum number of GPUs we let you allocate in a single job is currently 4. That will go up as we add more computers to the system. If you want to use more than 4, submit several jobs.
If a job runs longer than a week, it will be killed if others want to use the GPUs. (There have been a few cases where someone needs to run a job longer than a week. If that’s essential, let us know, and we’ll find a way to avoid killing the job.)
You can use the -c option in sbatch if you need more CPUs. see sbatch documentation for more details.
The scheduler also controls memory. By default, we allocate jobs 80GB of memory. As of January 1, 2024, the default memory will be 40 GB. However, you can specify less. E.g., when using sbatch, add. --mem=32g. In a few cases, that might allow a job that otherwise couldn’t run. You can specify up to 1TB, but if you do that, your job can only run on one of the four systems, and it may have to wait if other jobs are using memory. Please do not specify large amounts of memory unless needed, as it will limit what other people can do.
If you look at Slurm documentation, you’ll see many examples where all the commands in the file start with srun. That’s not necessary or even a good idea here. Usesbatchwhenever possible!
Because we have many kinds of GPUs, we have defined the feature to make it easy for you to specify. To see a list of all nodes and their specific features, use sinfo -o "%25N %50f" which shows:
NODELIST
AVAIL_FEATURES
ilab[1-2]
a4500,ampere
ilab3,rlab1
rtx4500,ada
ilab4,rlab[2,4]
a4000,ampere
rlab3
a4000,ampere
rlab5
1080ti,pascal
rlab6
titanx, pascal
If you wanted to use RTX A4000 you could specify -C a4000 or “-C ampere. If you want a 1080 TI or a TITAN X, specify -C '1080ti|titanx' or -C pascal. Note that OR is specified by |. Pascal and Ampere are the architectures. Cards with the same architectures have the same features but differ in the amount of memory and number of cores.
You can also request a specific node using -w NODE, e.g. -w rlab2.
The only resources we control are GPUs and memory. The scheduler does not attempt to schedule CPUs. Slurm can run a single job across multiple computers. We don’t recommend using that. Instead, use multiple jobs. If you want to know more details on CPUs, memory, state, weight, and features, use sinfo -Nel
You will probably want the output of your program to go into a file. To specify an output file, you can use -o FILENAME in the sbatch command.
In case you need notification, both sbatch and srun have the option --mail-type and --mail-user, which you can utilize to notify you of certain events and where to send the email.
If you log in to one of the systems and don’t use sbatch or srun, you won’t have access to any GPUs. However, nvidia-smi will show you all the GPUs, so you can see what’s happening. From within a batch or srun job, nvidia-smi will only show you the GPUs you have allocated.
You can put options in the file. For, rather than using sbatch -G 4 -o logfile, you could put

       #SBATCH -G 4
       #SBATCH -o logfile

in the file. All #SBATCH lines must be at the beginning of the file (right after the #!/bin/bash).

Jobs Information

To see a list of jobs currently managed by slurm, type squeue
To see info about your jobs, type: scontrol show job jobid where jobid is obtained from squeue
To see how much memory and other parameters you had used in the past six months shown in MaxVMSize, type: sacct -u $USER -S now-180days -o JobID,User,MaxRSS,MaxVMsize,ReqMem,Submit,Start,State,AllocTRES,Nodelist,Reason

Common Slurm commands

sacct: show accounting data for all jobs and job steps
sacctmgr: view and modify Slurm account information
salloc: set an interactive job allocation
sattach: attach to a running job step
sbatch: submit a batch script to Slurm
scancel: cancel jobs, job arrays, or job steps
scontrol: view or modify Slurm configuration and state.
sdiag: show scheduling statistics and timing parameters
sinfo: view information about Slurm nodes and partitions.
sprio: show the components of a job’s scheduling priority
squeue: show job queues
sreport: show reports from job accounting and statistics
srun: run task(s) across requested resources
sshare: show the shares and usage for each user
sstat: show the status information of a running job/step.
sview: a graphical user interface to view and modify Slurm state

Troubleshooting:

Permission denied error

In some situations, when submitting your job, you may get the following error:

slurmstepd: error: couldn't chdir to `/common/home/netid': Permission denied: going to /tmp instead

To fix this error,

1. Remove slurm.cs.rutgers.edu entry in your ~/.ssh/knownhost file. You can edit the file and remove the entry of by opening a terminal window and typing:

% ssh-keygen -R '[slurm.cs.rutgers.edu]:23'

2. Once removed, test to see if you can use slurm by typing:

% srun -G 1 nvidia-smi

This will recreate a new key and give you the following. Type ‘yes’ to connect.

The authenticity of host '[slurm.cs.rutgers.edu]:23 ()' can't be established.
ED25519 key fingerprint is SHA256:DfIMx4wqjG+FCG8cnNjSdrphErtN4w0ukDuRBUK9j54.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes

You will see nvidia-smi output. If you see one GPU, your problem is resolved and you can try to submit your job again using sbatch command.

If the problem persists, please report it to CS HelpDesk. It is possible something is not working right with Slurm itself on a specific machine.

Further Reading:

For help with our systems or immediate assistance, visit LCSR Operator at CoRE 235 or call 848-445-2443. Otherwise, see CS HelpDesk. Don’t forget to include your NetID along with descriptions of your problem.

Department of Computer Science

Technical Services and Support

Scheduler for GPU jobs

Scheduler for GPU and Long CPU Jobs

NODELIST	AVAIL_FEATURES
ilab[1-2]	a4500,ampere
ilab3,rlab1	rtx4500,ada
ilab4,rlab[2,4]	a4000,ampere
rlab3	a4000,ampere
rlab5	1080ti,pascal
rlab6	titanx, pascal