Limitations Enforced On CS Linux Machines
Last Modified Aug 23, 2023
- July 10, 2022 – Added Sessions and GPU limits and remove old methods.
- Sep 3, 2022 – Added slurm exception
- Apr 17, 2023 – Added requirement for Slurm on big jobs and notes on minor jobs.
- June 19, 2023 – Added info for users who need more than 80G of memory
- Aug 23, 2023 – Added info for special requirements for specific versions of software
NOTE: Please consult this file before running big or long jobs. We may change these limits as we see fit from time to time.
The following limitations are enforced on specified CS Linux machines by default. Please look at this page before running a big job to overcome these limitations.
If you use Scheduler for GPU Jobs (slurm), the CPU and Memory limits described here do not apply to you.
Here is a list of Limitations enforced on CS Machines.
1. Lifetime of accounts and files
When users leave the University (or computer science), their accounts are closed. Files are archived. The files will be deleted after a year, except for faculty files. Shared directories will be deleted or archived based on the user’s ownership.
Sometimes, users will continue associating with the Department even after leaving. Accounts may be continued as guest or retiree accounts. Any faculty member can sponsor a guest, but the Computer Science Department of Human Resources handles retirees’ sponsorship.
2. Sessions and CPU limits on ALL CS machines
Many X2GO, remote desktop, screen, tmux, terminator, ssh, nohup, etc., users would like to resume their session later after disconnecting or keep their program running after they log out. Many users would like to keep their big jobs running for days. At the same time, due to the educational nature of our environment, we have many runaway programs that were not properly terminated and wasting resources.
To efficiently manage resources on CS iLab Machines, we have implemented a simple way to manage user sessions. Users can manage CPU utilization and how long the session should stay running without the risk of being terminated by the system with just a single command. The command is keep-job N
, where N>24 is the number of daytime hours you want your job to continue.
To control your session and CPU usage time:
Open a terminal window and type: keep-job 30
Where 30 means you get 30 hours to return to your disconnected session. Note: to get to a terminal window in JupyterHub, you need to open a notebook of type Terminal.
Note on CPU hours:
- CPU hours is CPU usage time. A job gets a minimum of 24 hours CPU usage time. If you set N below 24, the system won’t terminate it until your CPU usage hours go above 24 hours.
- If you use 4 CPU cores, your process will only run for 6 hours to reach 24, the maximum CPU limit. After 24 CPU hours, your processes will be terminated unless you renew your time limit. Once you specify a limit, the system won’t interrupt any of your jobs until that limit has expired. The time starts when you invoke the
keep-job
command. If you need more time, rerun thekeep-job
command before time expires. - To see whether you have a job nearing your CPU hours, type
sessions
, orsessions -l
These commands will show the total CPU time for each current session.
3. MEMORY LIMIT On iLab Machines
The below limits are preset on the system and can’t be adjusted by the end user. If your memory need is high, make sure you pick a machine with the most memory available to you. When you run low on memory, Linux oomkiller will terminate your job automatically. Our script watches the log and notifies you when this happens so you are aware of the issues with your codes. Here are the current details of memory limits:
- on ilab*.cs.rutgers.edu, the maximum memory per user is 80GB.
ilab*.cs.rutgers.edu has tuned the profile virtual host to reduce its swapiness. If you need more memory, use Job Scheduler to request more memory. For example, if you need 120G of memory and no GPU:sbatch --mem=120g myJob
- on data*.cs.rutgers.edu and jupyter.cs.rutgers.edu maximum memory per user is 32GB. data*.cs.rutgers.edu have tuned profile virtual-guest to reduce swapiness
- on other servers or desktops, maximum memory per user is 50% of physical memory. Desktops have the default tuned profile, which is balanced. You could argue that a desktop would be slightly better.
- on ilab*.cs.rutgers.edu, the maximum memory per user is 80GB.
Note: ilab* and data* have swap space on Solid State Drive.
The sessions
command will show the amount of memory you’re using. If you have more than one process or thread, this may be less than the sum of usage by each process because memory is often shared between processes.
If you want to see the amount of memory each process uses, a reasonable approximation is ps ux
, in the RSS column. That is in KB. However, RSS only shows what is in the memory. If some of your processes have been swapped out, they won’t be included. It’s unusual to swap out active jobs on most of our systems.
Memory is set in different ways by different tools. Generally, it’s a parameter on the command line or a configuration option within the notebook. Please take a look at the documentation for the tool you’re using.
4. GPU LIMITS On iLab Server Cluster
As of summer 2022, most of our iLab Server machines (iLab1-4, rLab1-6) are under the Slurm Job Scheduler, which has its own policy. Without Slurm, you will not get access to any GPUs. On non-Slurm-managed machines, iLabU, and desktops, the maximum GPU you can use is 4. The nvidia-smi
command will randomly give you a list of 4 GPUs assigned to you on login.
On machines with Nvidia RTX A4000, users are advised to turn on TF32 to take advantage of the new GPUs. The RTX A4000 enables two FP32 primary data paths, doubling the peak FP32 operations.
Best GPU Utilization:
For big and long jobs, we require GPU users to utilize Job Scheduler to avoid GPU and Memory limits described in #2 and #3 above.
Specific version of Nvidia/Cuda/PyTorch
You may need to use a specific version of Conda, pytorch, or other software different from what we have installed. For this, we recommend running a container using Singularity. If you must use Python, please see Using Python on CS Machines.
For small jobs, you should utilize GPUs on iLab desktops.
5. Storage quota Limit
Every home directory /common/home
has a disk quota set. However, users can use other disk spaces to do their work with a much bigger quota and no quota storage. See the Storage and Technology options page for details on these storage options and limits.
6. Blocklisting System: On ALL CS machines
When our machines detect abnormal activities, they may put remote machines on a block list. This blocklist will block any listed machine attempt to connect. If you have an issue connecting to CS machines, click to check if your IP is blocked and how to get around the block.
7. Logging in with SSH Public Private Key
The public/private key is convenient but has security implications. Convenience applies to both users and attackers. For security reasons, we don’t recommend it.
As of fall 2017, we moved to a Kerberized Network File System, which requires a Kerberos ticket to access your home directory and other network storage. Logging in using public/private keys will not get the user the Kerberos credential, so file access won’t work.
However, we allow users to log in between our machines without additional passwords. This could be very useful if you access research machines on private IPs inaccessible outside Rutgers via SSH. To avoid additional login, first, login to iLab machines. Once logged in, you can ssh to other research machines without an additional password.
You can also set up Kerberos authentication with your home machine to avoid multiple logins.