NOTE: Please consult this file before running big or long jobs. We may change these limits as we see fit from time to time.
-Last Modified Oct 8, 2020 – added GPU selection option.
The following limitations are enforced on specified CS Linux machines by default. To learn how to overcome these limitations, please consult this page before running a big job.
Here are a list of Limitations enforced on CS Machines.
- X2GO and Remote Desktop Limits: On ALL CS Machines
- MEMORY LIMIT: On iLab machines
- GPU LIMITS: On iLab Server cluster
- CPU LIMITS: On ALL iLab and Hadoop machines
- Storage quota Limit
- Blacklisting System: On ALL CS machines
- Logging in with SSH Public Private Key
1. X2GO and Remote Desktop Limits: On ALL CS Machines
X2GO and Remote Desktop clients allow users to resume their works after disconnecting their session. Unfortunately many users do not returned back to their sessions resulting unnecessary waste of resources for long period of time. Please logout after using the machine. If you don’t logout, by default, your X2Go or Remote Desktop session will be terminated and all unsaved work will be lost.
If you would like to resume with the same session at a later time, you have to opt-in by opening a terminal window and type:
in bash :
echo 5 > /var/run/user/$UID/KeepSession
echo 5 > /var/run/user/$uid/KeepSession
Number 5 means you get 5 hours to get back to it. The maximum time you can set is 36 hours. You need to redo the above command to extend the time. The time starts from when you invoke the above command.
2. MEMORY LIMIT: On iLab machines
Below memory limit are preset on the system and can’t be adjusted by end user. If your memory need is high, make sure you pick machine with most amount of memory available to you. When you run low on memory, Linux oomkiller will terminate your job automatically. We have a script that watches the log and notify you when this happens so you are aware of the issues with your codes. Here are the current details of memory limits:
- on ilab*.cs.rutgers.edu, maximum memory per user is 80GB.
ilab*.cs.rutgers.edu have tuned profile virtual-host to reduce its swapiness.
- on data*.cs.rutgers.edu, jupyter.cs.rutgers.edu (hadoop cluster), maximum memory per user is 32GB. data*.cs.rutgers.edu have tuned profile virtual-guest to reduces swapiness
- on aurora.cs.rutgers.edu, maximum memory per user is 480GB.
- on other servers or desktops, maximum memory per user is 50% of physical memory. Desktops have the default tuned profile, which is balanced. You could argue that desktop would be slightly better.
Note: ilab* and data* both have swap space, on Solid State Drive.
sessions command will show the amount of memory you’re using. If you have more than one process or thread, this may be less than the sum of usage by each process, because memory is often shared between processes.
If you want to see the amount of memory used by each process, a reasonable approximation is
ps ux, in the RSS column. That is in KB. However RSS shows only what is in memory. If some of your process has been swapped out it won’t be included. On most of our systems it’s unusual to swap out active jobs.
Note to Hadoop users: The Hadoop tools all have options to set the amount of memory used by a job. The default size is 1 GB, though we’ve increased it in Zeppelin to 4 GB. If you run out of memory within Hadoop, it is most likely that you haven’t allocated enough memory within Hadoop, rather than that you’ve reached the 48 GB limit we set. Memory is set in different ways by different tools. Generally it’s a parameter on the command line, or a configuration option within the notebook. See documentation for the tool you’re using.
3. GPU LIMITS: On iLab Server cluster
On machine with 8 GPUs, maximum GPU you can use is 4.
nvidia-smi will give you a list of 4 GPUs assigned to you on login randomly. This selection may give you already busy GPU which prevented you from running your job. There is a command called:
pickgpu, running it will give you list of ALL GPUs where you can tell which GPUs are busy and unused. To pick specific GPUs, example GPU 0 1 6 and 7, run it as follow:
pickgpu 0 1 6 7
4. CPU LIMITS: On ALL iLab and Hadoop machines
Because jobs sometimes run away, continuing to use computer time without limit, we limit the amount of time your jobs can use, unless you specify that you are intentionally running a long job.
You only need to specify a time limit for jobs that have used more than 24 hours of CPU time for all processes in one session. A session is roughly anything you start from a single login. If you logout and login again, you’ll have a new session.
That means if you are using 24 cores, your process will only run for 1 hour to reach 24 hrs maximum CPU limit. After 24 CPU hours, your processes will be terminated unless you have specified a time limit. Once you specify a limit, we won’t interrupt any of your jobs until that limit has expired.
To see whether you have job that’s nearing 24 hours, type
sessions -l. These commands will show the total CPU time for each current session. If it looks like any session will go over 24 hours, you need to set a time limit, as described below.
If you expect any session to use more than 24 hours of CPU time, you can declare a time limit. The system won’t terminate the job unless it goes over the limit you have declared. To do that, use terminal windows and type a command below.
Note For Hadoop user: To get to a terminal window, in JupyterHub, you need to open a notebook of type Terminal. In the Zeppelin notebook, you need to use a paragraph starting with
%shand type the commands in the paragraph; when you run the paragraph you’ll see the output.
echo 48 > /run/user/$UID/LongjobLimit
echo 48 > /run/user/$uid/LongjobLimit
The number 48 means your jobs will continue up to 48 clock hours. The maximum time you can set is 80 clock hours. (This was chosen to be a bit over 3 days.) Values over 80 are treated as 80. If a job is going to run longer than 3 days, you’ll need to redo this every 3 days. Important: The time runs from when the file was created or updated.
(Yes, the 24 hour threshold is *CPU* time and the limit in
/run/user/$UID/LongjobLimit is *wall clock* time.)
5. Storage quota Limit
Every home directory has disk quota set at least 6GB. However, there are other disk spaces that users can use to do their work with 100GB of quota and no quota storage. For details on these storage options, see Storage and Technology options page.
6. Blacklisting System: On ALL CS machines
When our machines detect abnormal activities, it may put remote machines in a blacklist. This blacklist will block any listed machine attempt to connect. If you have issue connecting to CS machines, make sure to check if your IP is blocked and how to get around the block.
7. Logging in with SSH Public Private Key
Due to security issues, we do not allow user to login using public/private key to our machines. We do however allow users to login between our machines without additional password. This could be very useful if you need to access research machines which are on private IP and not accessible from outside Rutgers via SSH. To avoid additional login, first, login to iLab machines. Once logged in, you can ssh to other research machines without additional password.
Additionally, you can also setup Kerberos authentication with your home machine if you want to avoid multiple login.