Limitations Enforced On CS Linux Machines
July 10, 2022 – Added Sessions and GPU limits and remove old methods.
Sep 3, 2022 – Added slurm exception
Apr 17, 2023 – Added requirement for Slurm on big&log job and notes on small jobs.
June 19, 2023 – Added info for users who needs more than 80G of memory
Aug 23, 2023 – Added info for special requirement for specific version of software
NOTE: Please consult this file before running big or long jobs. We may change these limits as we see fit from time to time.
The following limitations are enforced on specified CS Linux machines by default. To learn how to overcome these limitations, please consult this page before running a big job.
If you are using Scheduler for GPU Jobs (slurm), CPU and Memory limit described here do not apply to you.
Here are a list of Limitations enforced on CS Machines.
1. Lifetime of accounts and files
When users leave the University (or computer science), their accounts are closed. Files are archived. The files will be deleted after a year, except for faculty files. Shared directories will be deleted or archived based on the user that owns them.
Sometimes users will have a continuing association with the Department even after leaving. Accounts may be continued as guest or retiree accounts. Any faculty member can sponsor a guest but retirees sponsorship are handled by the Computer Science Department Human Resource.
2. Sessions and CPU limits on ALL CS machines
Many X2GO, Remote Desktop, screen, tmux, terminator, ssh, nohup etc users would like to resume their session at a later time after disconnecting or keep their program running after they logged out. There also many users who would like to keep their big jobs running for days. At the same time due to the educational nature of our environment, we have many runaway programs that were not properly terminated and wasting resources.
To efficiently manage resources on CS iLab Machines, we have implemented a simple way to manage user sessions. User can manage CPU utilization and how long the session should stay running without risk of being terminated by the system with just a single command. The command is
keep-job N, where N>24 is a number of daytime hours you want your job to continue.
To control your session and CPU usage time:
open a terminal window and type:
where 30 means you get 30 hours to get back to your disconnected session. Note: to get to a terminal window in JupyterHub, you need to open a notebook of type Terminal.
Note on CPU hours:
- CPU hours is CPU usage time. A job gets a minimum of 24 hours CPU usage time. If you set N below 24, the system wont terminate it until your CPU usage hours goes above 24 hours.
- If you are using 4 CPU cores, your process will only run for 6 hour to reach 24 hours maximum CPU limit. After 24 CPU hours, your processes will be terminated unless you renew your time limit. Once you specify a limit, the system won’t interrupt any of your jobs until that limit has expired. The time starts from when you invoke the
keep-jobcommand. If you need more time, just rerun
keep-jobcommand before time expires.
- To see whether you have job that’s nearing your CPU hours, type
sessions -l.These commands will show the total CPU time for each current session.
3. MEMORY LIMIT On iLab Machines
Below memory limit are preset on the system and can’t be adjusted by end user. If your memory need is high, make sure you pick machine with most amount of memory available to you. When you run low on memory, Linux oomkiller will terminate your job automatically. We have a script that watches the log and notify you when this happens so you are aware of the issues with your codes. Here are the current details of memory limits:
- on ilab*.cs.rutgers.edu, maximum memory per user is 80GB.
ilab*.cs.rutgers.edu have tuned profile virtual-host to reduce its swapiness. If you need more memory, use Job Scheduler where you can request for more more memory. For example, if you need 120G of memory and no GPU :
sbatch --mem=120g myJob
- on data*.cs.rutgers.edu and jupyter.cs.rutgers.edu maximum memory per user is 32GB. data*.cs.rutgers.edu have tuned profile virtual-guest to reduces swapiness
- on other servers or desktops, maximum memory per user is 50% of physical memory. Desktops have the default tuned profile, which is balanced. You could argue that desktop would be slightly better.
- on ilab*.cs.rutgers.edu, maximum memory per user is 80GB.
Note: ilab* and data* both have swap space, on Solid State Drive.
sessions command will show the amount of memory you’re using. If you have more than one process or thread, this may be less than the sum of usage by each process, because memory is often shared between processes.
If you want to see the amount of memory used by each process, a reasonable approximation is
ps ux, in the RSS column. That is in KB. However RSS shows only what is in memory. If some of your process has been swapped out it won’t be included. On most of our systems it’s unusual to swap out active jobs.
Memory is set in different ways by different tools. Generally it’s a parameter on the command line, or a configuration option within the notebook. See documentation for the tool you’re using.
4. GPU LIMITS On iLab Server Cluster
As of Summer 2022, most of our iLab Servers machines (iLab1-4, rLab1-4) are under Slurm Job Scheduler which has its own policy. Without Slurm, you will not get access to any GPUs. On non Slurm managed machine, iLabU and desktops, maximum GPU you can use is 4.
nvidia-smi will give you a list of 4 GPUs assigned to you on login randomly.
On machines with Nvidia RTX A4000, users are advised to turn on TF32 to take advantage of the new GPUs. The RTX A4000 enables two FP32 primary data paths, doubling the peak FP32 operations.
Best GPU Utilization:
For a big and long jobs, we require GPU users to utilize Job Scheduler to avoid GPU and Memory limits described in #2 and #3 above.
Specific version of Nvidia/Cuda/PyTorch
You may need to use a specific version of Conda, pytorch, or other software that is different from what we have installed. For this we recommend using a container using Singularity.
For small jobs, you should utilize GPUs on iLab desktops.
5. Storage quota Limit
Every home directory
/common/home has disk quota set. However, there are other disk spaces that users can use to do their work with much bigger quota along with no quota storage. For details on these storage options and limit, see Storage and Technology options page.
6. Blacklisting System: On ALL CS machines
When our machines detect abnormal activities, it may put remote machines in a blacklist. This blacklist will block any listed machine attempt to connect. If you have issue connecting to CS machines, make sure to click to check if your IP is blocked and how to get around the block.
7. Logging in with SSH Public Private Key
Pubic/Private key is convenience with security implications. Convenience applies to both users and attackers. For security reasons, we don’t recommend it.
As of fall 2017, we moved to kerberized Network File System which requires kerberos ticket to access your home directory and other network storage. Logging in using public/private keys will not get user the kerberos credential, and so file access won’t work.
We do however allow users to login between our machines without additional password. This could be very useful if you need to access research machines which are on private IP and not accessible from outside Rutgers via SSH. To avoid additional login, first, login to iLab machines. Once logged in, you can ssh to other research machines without additional password.
Additionally, you can also setup Kerberos authentication with your home machine if you want to avoid multiple login.