Why is my job pending? Accordion Closed
Most of the time, questions that we get about the cluster related to why jobs are not completing (in the pending or “PD” state). There are often many different contributing factors to this.
When jobs remain pending for long periods of time, you can see why by running the “squeue” command. The reason slurm has declared your job to be “pending” will be listed in parentheses under the (REASON) column. For example, if user abc123 wanting to see all jobs that were pending, he would enter
squeue -u abc123 -s PD
The following questions on this page explain what these different reasons are and how you can avoid them in the future.
Dependency Accordion Closed
This occurs when a job “depends on” another being completed before it is allowed to start. This will only happen if a dependency is specified inside the job script using the — dependency option.
DependencyNeverSatisfied Accordion Closed
This occurs when a job that must run first in a dependency fails. This usually means that there was an error of some kind that caused SLURM to cancel the job. When this happens, any jobs that depend on the failed job will never be allowed to run.
ReqNodeNotAvail Accordion Closed
This occurs when a requested node is either down, or there is currently a reservation in place for it. If you have requested a specific node in your job script using the “–nodelist” command, try removing this option in order to speed your job along. Slurm is very good at allocating resources, so it is often best to let it decide which nodes and processors to use in order to run your job.
To see if the node you requested is down, run
sinfo -N -l
AssocGrpCPURunMinsLimit/AssocGrpMemRunMinsLimit Accordion Closed
This occurs when either the associated account or QOS that you belong to is currently using all available CPU minutes or MEM minutes it has been allotted. Read more about Monsoon’s GrpCPURunMinsLimit.
Resources Accordion Closed
This occurs when any resources that you have requested are not currently available. This could refer to memory, processors, gpus, or nodes. For example, if you have requested the use of 20 processors and 10 nodes, and there is currently a high volume of jobs being run, it is likely that your job will remain in the pending state due to “Resources” for a long time.
Priority Accordion Closed
This occurs when there are jobs ahead of yours that have a higher priority. Not to worry, you can help alleviate this issue by updating your job time limit to a lower estimate (See the man page for the “scontrol” command). For example, if you have put “–time=1-00:00:00” in your script, Slurm will set your time limit to one day. If you know your job will not take longer than four hours, you can set this number accordingly and Slurm will give you a higher priority due to the lower time limit. This is because we have the backfill option turned on in Slurm which will enable small jobs (lower time limit) to fill the unused small time windows.
JobHeld/admin Accordion Closed
This occurs when your job is being held by the administrator for some reason. Email email@example.com to find out why.
Expedite your job’s start time Accordion Closed
Sometimes jobs will remain in the pending state indefinitely due to certain settings in the job script. Here is a list of things you can do to ensure that your job gets going:
- Give your script a lower time limit (if at all possible).
- If you know the job will take a long time, try breaking it up into several different scripts that can be run separately and with shorter time limits. Consider job arrays for this. Also, the dependency option allows you to be sure that one job runs before another.
- Don’t request specific nodes.
- Don’t request more memory than you need.
- Don’t request more cpus than you are launching tasks, or threads for.
- Avoid using the “–contiguous” or “–exclusive” options, as these will limit which nodes your job can run on.
Requesting more memory Accordion Closed
Requesting a specific amount of memory in your script can be done using the “–mem” option. This amount is in megabytes (1000 MB is 1 GB). If you wanted 150 GB of memory, just multiply that number by 1000 to get the amount in MB. The example below shows the line on your script that you would add if you wanted to allocate 150 gigabytes of memory:
Requesting more CPUs Accordion Closed
If you would like your job to use a certain number of cpus, you may request this using the “–cpus-per-task” option. For example, if you would like each task in your script (each instance of srun you have) to use 3 cpus, you would add the following line to the script:
Requesting the GPU Accordion Closed
Monsoon currently has two nodes designated for gpu-accelerated computing. These nodes contains two or more Nvidia Tesla K80s, which provides a total of 16 GPU cores. If you are running a job that can take advantage of these resources, you would add the “–gres” option to your script. For example, to request 2 GPU cores, add this line:
** Access to the gpus requires additional access. Please send an email to firstname.lastname@example.org with your request, in addition to the project, and the software that you hope to use them with.
GPU availability Accordion Closed
Monsoon currently has 16 gpu cores available to SLURM as resources. To see if there are any available run the following command:
squeue -h -t R -O gres | grep gpu
If there are cores in use, you will get one to several lines of output in the format gpu:tesla:. Adding all of the numbers displayed will give the total number of gpus currently in use. If the count adds up to 4, then all of the gpu cores are currently being used. If nothing is displayed, then all of the cores are available.
Job resources Accordion Closed
You can determine the different resources a job has used or requested using the “sacct -X” command. Running this command with no other options will list all of the jobs you have run today with some basic information about one. This information is not very helpful on its own, but you can choose to display information using the “-o” option, followed by a comma-separated list of the different characteristics you would like to display (See the sacct web page for a complete listing).
A very practical use of this command is to display the time limit, how long the job took, the number of cpus used, how much memory was requested, how much was actually used (this is called “maxrss”), and the state of the job. Here is what this command looks like (note that there are no spaces between the comma-separated items):
sacct -X -o
Creating an enterprise group Accordion Closed
Enterprise Groups are a method used by HPC/ARSA to manage who has access to specific folders on the cluster. In regards to project areas on Monsoon, it will allow you as the data manager for your project, to assign who on the cluster has access to your data repository. That way, you can add or remove access at your will and not have to wait for us to add and remove people for you. This assumes that people you add have a monsoon account.
Currently, Enterprise Groups are managed by end-users through the Directory Services tab at my.nau.edu.
Licensing Accordion Closed
Licenses can be requested like any other resource in your SLURM scripts using #SBATCH –licenses:<program>:<# of licenses>. Read the documentation on licensing for more information.
Managing and utilizing python environments on monsoon Accordion Closed
We utilize the Anaconda Python distribution on Monsoon. If you’d like to create a Python environment in your home to install your own packages outside of what is already provided by the distribution, do the following:
- module load anaconda
- conda create –name snowflakes biopython (where snowflakes is your environment name, and biopython is the package to be installed into your new environment)
- source activate snowflakes
With the environment activated, you can install packages locally into this environment via conda, or pip.
To utilize this environment in your jobs, you must call python directly from your environment as show below. “source activate snowflakes” will not work through Slurm.
module load anaconda
Parallelism Accordion Closed
Many programs on Monsoon are written to make use of parallelization, either in the form of threading, MPI, or both. To check what type of parallelization an application supports, you can look at the libraries it was compiled with. First, find where you app is located using the “which” command. Then run “ldd /path/to/my/app.” A list of libraries will be printed out. If you see “libpthread” or “libgomp”, then it is likely that your software is capable of multi-threading (shared memory). Please read our segment on shared-memory parallelism for more info on setting up your Slurm script.
If you see libmpi* listed, then it is likely that your software is capable of MPI (distributed memory). Please read our segment on distributed-memory parallelism for more info on setting up your Slurm job script.
Running X11 Jobs on Monsoon Accordion Closed
Some programs run on the cluster will be more convenient to run and dubug with a GUI. Due to this, the Monsoon cluster supports forwarding X11 windows over your ssh session.
Enable X-forwarding to your local machine is fairly straightforward.
When connecting to monsoon using SSH, add the
ssh email@example.com -Y
On PuTTY, in the left menu naviagate to
SSH->X11 and check
Enable X11 forwarding
After connecting to Monsoon, running the GUI program with
srun will create your window.
For some programs, such as matlab, in order to ensure the window appears, add the
--pty flag to
module load matlab srun --pty -t 5:00 --mem=2000 matlab # Start an interactive matlab session
Restoring Deleted Files Accordion Closed
What if I delete a file that I need?
Snapshot of all home user directories are taken twice a day. What this means is if you accidentally delete a file from your home directory that you need back, if it was created before a snapshot window, you will be able to restore that file. If you do an ls of the snapshot directory (ls /home/.snapshot), you may see something like the following:
@GMT-2019.02.09-05.00.01 @GMT-2019.02.10-18.00.01 @GMT-2019.02.12-05.00.01 latest @GMT-2019.02.09-18.00.02 @GMT-2019.02.11-05.00.01 @GMT-2019.02.12-18.00.01 @GMT-2019.02.10-05.00.01 @GMT-2019.02.11-18.00.01 @GMT-2019.02.13-05.00.01
Each of the directories starting with @GMT are a backup. if my userid is ricky I can check my files like the following:
ls -la /home/.snapshot/@GMT-2019.02.12-18.00.01/ricky
if I know I had my file before 6pm on the 12th. I would see all of the files that I had at the time of the snapshot. If there were a file called dask_test.py that I needed to restore from that snapshot, I could type in the following:
cp /home/.snapshot/@GMT-2019.02.12-18.00.01/ricky/dask_test.py /home/ricky
Your file would then be restored.