Slurm

SLURM batch software

The Science cn-cluster uses SLURM for batch management. The cluster consists of 2 parts, determined by the ubuntu version, each has its own login node. Currently we have

Login node	Ubuntu version	number of nodes
cnlogin20	ubuntu 20.04	22
cnlogin22	ubuntu 22.04	91

Typically you login to the login node and use that to submit your jobs.

Terminology

term	meaning
node	A node is a single machine in the cluster.
cluster	A cluster exists of a set of nodes.
partition	A partition is a defined subset of nodes of the whole cluster. Next to being a subset a ‘partition’ can also limit a job’s resources. Per partition, only people in certain unix groups are allowed to run jobs on this partition.
job	A job is typically run in a partition of the cluster.
job step	A job step, is a (possibly parallel) task within a job.

Slurm cheat sheets

Partitions

Jobs always run within a partition.

The partitions for groups with their own nodes can only be used by members of these groups. These partitions usually have high priority and can run infinitely long (MaxTime=INFINITE). Jobs in these partitions will suspend jobs submitted in the partition “all” and the low priority partition “heflowprio”.

PartitionName=tcm        Nodes=cn../.. AllowGroups=tcm       Priority=10
PartitionName=hef        Nodes=cn../.. AllowGroups=ehef,thef Priority=10
PartitionName=heflowprio Nodes=../..   AllowGroups=ehef,thef Priority=1
PartitionName=milkun     Nodes=cn../.. AllowGroups=milkun    Priority=10
PartitionName=thchem     Nodes=cn../.. AllowGroups=thchem    Priority=10

To see a list of all partitions with their properties:

scontrol -a show partitions

To see a list of only the partitions you can use:

scontrol show partitions

To list a specific partition:

scontrol show partition cnczshort

Jobs in the cnczshort partition also get high priority, but they will be killed if they run more than 12 hours.

PartitionName=cnczshort Nodes=cn13 MaxTime=12:00:00 Priority=10 Preemptmode=REQUEUE

There also is a cncz partition, that may be used by all cluster users for jobs that run less than a week, it has low priority:

PartitionName=cncz Nodes=cn13 Priority=2 MaxTime=7-00:00:00

There also is an all partition with all nodes, this has the lowest priority, max 12 hours running jobs and a memory limit of 2 GB:

PartitionName=all MaxTime=12:00:00 MaxMemPerCPU=2048 Priority=1

Info

To prevent accidents, before using the all queue, please obtain permission from other groups to use their nodes and contact C&CZ for access to the clusterall unix group. Use of this queue is restricted to members of the clusterall group.

It is wise to provide the partition as an option either on the command line as -p partitionname or in the shell script by including a line:

#SBATCH --partition=partitionname

Tutorial

Submitting your first job

To execute your job on the cluster you need to write it in the form of a shell script (don’t worry this is easier than it sounds). A shell script in its most basic form is just a list of commands that you would otherwise write on the command line. The first line of the script should contain an instruction telling the system which type of shell is supposed to execute it. Unless you really need something else you should always use bash. So without further ado here is your first shell script.

#! /bin/bash
#SBATCH --partition=cnczshort
sleep 60
echo "Hello world!"

Type this into an editor and save it as hello.sh. To execute this on the login node give it executable permissions (not needed when submitting):

$ chmod u+x hello.sh

and run it:

$ ./hello.sh

It will print Hello world to stdout (the screen). If anything goes wrong an error message will be sent to stderr which in this case is also the screen.

To execute this script as a job on one of the compute nodes we submit it to the cluster scheduler slurm. This is done with the command sbatch:

$ sbatch hello.sh

The scheduler will put your job in the named job partition and respond by giving you the job number (10 in this example):

Submitted batch job 10

Your job will now wait until a slot on one of the compute nodes becomes available. Then it will be executed and the output is written to a file slurm-10.out in your home directory, unless you specify otherwise as explained later.

Inspecting jobs

At any stage after submitting a job, while it is running, you may inspect its status with squeue:

$ squeue --job 10

It will print some information including its status which can be:

status	meaning
`PD`	pending
`R`	running
`S`	suspended

It will also show the name of the job script, the user who submitted it and the time used so far. To see full information about your job use the scontrol command:

$ scontrol show job 10

You can get a quick overview of how busy the cluster is with:

$ squeue

Which lists all jobs sorted by the nodes they are running on.

To get detailed information on a specific Jobid:

$ scontrol show job -dd Jobid

As a shorthand for this one can use:

sjob Jobid

You may use the scontrol command to get information about the compute nodes e.g. the number of CPU’s, available memory and requestable resources such as gpus if available.

$ scontrol show nodes

You may use the sinfo command to get information on nodes and partitions:

$ sinfo -l

A useful overview of the state of all nodes usable by you can be achieved with:

$ sinfo -Ne -o "%.20n %.15C %.8O %.7t %25f" | uniq

As a shorthand for this one can use:

snodes

A useful overview of the state of all nodes can be achieved with:

$ sinfo -a -Ne -o "%.20n %.15C %.8O %.7t %25f" | uniq

Finally there is the sall command that gives a quick overview of all jobs on the cluster. It supports the -r, -s and -p flags for running, suspended and pending jobs respectively.

$ sall -r

Understanding job priority

The scheduler determines which job to start first when resources become available based on the job’s relative priority. This is a single floating point number calculated from various criteria.

Job priority is currently calculated based on the following criteria:

* queue time,
   jobs that are submitted earlier get a higher priority;
 * jobs relative size,
   larger jobs requesting more resources such as nodes and CPUs get a higher priority because they are harder to schedule;
 * fair share,
   the priority of jobs submitted by a user increases or decreases depending on the resources (CPU time) consumed in the last week.

With the sprio -w command you can view the current weights used in the priority computation.

Job priority formula:

Job_priority = (PriorityWeightAge) * (age_factor) +
   (PriorityWeightFairshare) * (fair-share_factor) +
   (PriorityWeightJobSize) * (job_size_factor)

The sall command, when used in combination with the -p flag lists the jobs priority as well as an estimated start time.

Deleting jobs

If, after submitting a job, you discover that you made a mistake and want to delete your job you may do so with.

$ scancel 10

Submitting multiple jobs

Sometimes you may be able to split your job into multiple independent tasks. For instance when you need to perform the same analysis on multiple data files. Splitting this work up into multiple tasks helps because they can then run simultaneously on different compute nodes. Now this could be accomplished by writing a loop that generates and submits job scripts but an easier approach is to use array jobs. When calling sbatch with the --array flag an array job consisting of multiple tasks is submitted to the scheduler. The number of the current task is available to the scripts execution environment as $SLURM_ARRAY_TASK_ID.

So for example the following test_array_job.sh script:

#! /bin/bash
#SBATCH --partition=tcm
echo This is task $SLURM_ARRAY_TASK_ID

Can be submitted with:

$ sbatch --array 1-4 ./test_array_job.sh

The four tasks in this job will now be executed independently with the same script, the only difference being the value of $SLURM_ARRAY_TASK_ID.

This value can be used in the script to for instance select different input files for your data reduction as in:

#! /bin/bash
my_command ~/input-$SLURM_ARRAY_TASK_ID

Do note however that the tasks will not necessarily be executed in order so they really need to be independent!

Advanced topics

Limiting the number of simultaneous tasks

If you submit a large array job (containing more than say a hundred tasks) you may want to limit the number of simultaneous task executions. You may do this using the % flag.

sbatch --array 1-1000%20 ./test_array_job.sh

This will ensure at maximum 20 tasks run at the same time.

Setting scheduler parameters in the job script

Instead of entering all flags to sbatch on the command line one can also choose to write them into the job script itself. For this simply prefix them with #SBATCH. This is especially useful for flags described in the following sections.

Changing the default logging location

By default standard output and error are written to a file named by the process number in your home directory. This can be changed with the --output and (optionally) --error flags.

#! /bin/bash
# Choose partition to run job in
#SBATCH --partition=hef
# Send output to test.log in your home directory
#SBATCH --output ~/test.log

Getting email updates about your job

Constantly watching squeue output to see when your job starts running and when it’s done is not very efficient. Fortunately you can also request the scheduler to send you email in these situations.

#! /bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your mail address>

This will get you email when your job begins, ends or fails. If you only want email when it is done you can use --mail-type=END.

Time limits

To allow the scheduler to efficiently distribute the available compute time each job has a one hour time limit. If your job exceeds its limit (counted from the moment execution started on the compute node) it will be killed automatically. Now what if you know your job will need more than one hour to complete? In this case you can simply request more time with the --time flag.

So for instance if you expect to need 12 hours of compute time for your job add the following to your script.

#! /bin/bash
#SBATCH --time=12:00:00

For multi-day walltime you can simply add an extra days field as in --time=DD-HH:MM:SS.

Memory

If your job needs an excessive amount of memory to run you should request it using

#! /bin/bash
#SBATCH --mem=64G

For 64GB as an example. Please only use this if you know for certain that your job needs much more than 8GB of memory!

Nodes and cores

If your code uses parallelization using threads, OpenMP or MPI for instance, you may want to request more CPU resources. This can be in the form of more than one core on a single machine / node, more than one machine or a combination of both. These resources are requested with the -N flag combined with the -n flag as follows.

#! /bin/bash
#SBATCH -N 1 -n 8

Requests 8 cores on one node.

#! /bin/bash
#SBATCH -N 10 -n 16

Requests 16 cores divided over 10 compute nodes (16 cores in total). Note that this helps only if your code is parallelized! Even if you think it is, it may still be disabled due to compile time selections so always check! For instance by running it briefly on the login node and inspecting CPU usage with top, if it rises above 100% it uses some kind of parallelization.

For MPI jobs you may only care about the total number of CPU cores used, not about if they are located on one machine or distributed over the system. Then you may use the -n flag by itself instead.

Requesting large amounts of cores and or nodes will reduce the probability that your job can be scheduled at each scheduling interval thereby pushing its start time further into the future so only request what you need!

Advanced node selection

Sometimes you may need more control over which nodes are used (for instance when your data is located in its /scratch. You can specify explicit node names with the -w flag (optionally multiple comma separated) as follows.

#! /bin/bash
#SBATCH -N 2 -n 4 -w cn90,cn91

This will request 4 cores on cn90 and cn91.

Examples

Here are a couple of example scripts (feel free to add your own). Unless otherwise stated just save them as my_job.sh and submit them to the scheduler with.

$ sbatch my_job.sh

Run a Python script

This job script, when submitted, executes a Python script ~/my_script.py on a random available node.

#! /bin/bash
/usr/bin/python ~/my_script.py

Process a list of data files using an array job

Let’s say filenames.txt is a text file containing 100 file names you wish to process. For instance created with:

ls *.dat > filenames.txt

You may now combine array jobs with clever use of the awk command to select the current filename from this file.

Submit the following script with sbatch –array 1-100 my_job_script.sh.

 #! /bin/bash
 #
 # Each task needs 1.5 hours of runtime
 #SBATCH --time=01:30:00
 #
 INPUT_FILE=`awk "NR==$SLURM_ARRAY_TASK_ID" filenames.txt`
 #
 my_command $INPUT_FILE

Run a large data reduction job

This is an example of a job that needs a significant amount of resources on a single specific node. Using cn90 in this example:

 #! /bin/bash
 #
 # Request 10 CPU cores and 64GB of memory for 7 days on cn90
 #SBATCH --partition=tcm
 #SBATCH -N 1 -n 10
 #SBATCH --mem=64GB
 #SBATCH --time=7-00:00:00
 #SBATCH -w cn90
 #
 # Get email updates
 #SBATCH --mail-type=ALL
 #SBATCH --mail-user=<username>@science.ru.nl
 #
 run_my_large_job

This assumes you have your data stored in /scratch on cn90.science.ru.nl. If your application is not IO limited, simply store your data in a network share or your home directory (~) instead and remove the -w line to allow the scheduler to pick a random node.

Request a GPU

Then use:

# set the number of GPU cards to use per node
#SBATCH --gres=gpu:1

Run an OpenMP job

OpenMP is frequently used for multithreaded parallel programming. To use OpenMP on the cluster, make sure your code is compiled using the -fopenmp flag. Then any CPU cores requested will be automatically picked up (e.g. no need to set the OMP_NUM_THREADS flag). In fact if you do set the flag it will probably lead to slower running code since forcing the number of threads to be higher than the number of CPU cores available leads to overhead.

So the following example code for a parallel for loop:

 #include <omp.h>
 int main(int argc, char *argv[]) {
   const int N = 100000;
   int i, a[N];

   #pragma omp parallel for
   for (i = 0; i < N; i++)
     a[i] = 2 * i;

   return 0;
 }

can be compiled and run on the cluster with the following job script:

#! /bin/bash
#
# Request 4 CPU cores for this OpenMP code
#SBATCH -N 1 -n 4
#
# First compile it
gcc -O2 -mtune=native -fopenmp my_code.c -o my_code 
#
# and run
./my_code

Run an MPI job

The cluster scheduler has full built in support for OpenMPI therefore it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Instead you can simply call mpirun from your job script and it will be aware of all nodes and cores requested and uses them accordingly.

#! /bin/bash
#
# Request 32 processor cores randomly distributed over nodes
#SBATCH -n 32
#
# And 12 hours of runtime
#SBATCH --time=12:00:00
#
# Get email when it's done
#SBATCH --mail-type=END
#SBATCH --mail-user=<username>@science.ru.nl
#
mpirun my_mpi_application

Copy data to local scratch from external server

This copies data from an external location to local scratch storage using SCP. If this takes a significant amount of time, or has to be done regularly you may want to submit a job to the scheduler to do it overnight.

Generate SSH keys

Info

You only need to do this once!

The job script, when executed, cannot ask you for your password so we need to setup access via a public-private key pair first. Create a new keypair with:

ssh-keygen -t ed25519

and hit enter a bunch of times. Ddo not enter a password or it will attempt to ask the script again.

Exchange keys

Now, append your public key to a file called ~/.ssh/authorized_keys:

ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.example.com

See if it worked:

ssh user@server.example.com

This should give you a shell on the other machine without having to supply your password.

Copy the data using a batch job

Now you may submit the following script to the scheduler.

#! /bin/bash
#
#SBATCH --partition=tcm
# Select the node
#SBATCH -w cn90
#
# Get email when it's done
#SBATCH --mail-type=END
#SBATCH --mail-user=<username>@science.ru.nl
scp -r user@server.example.com:/path/over/there /scratch/$USER/

Interactive jobs

If you need interactive access (a shell) you can request the scheduler for this with the following steps:

Request an allocation of resources. For instance, 1 core for 2 hours on a random node:

salloc -c 1 --partition=tcm --time 2:00:00

Attach a shell to this allocation:

srun --pty bash

You then get a prompt on an available node and can start working. This shell will be automatically closed after two hours.

Interactive job withing running job’s cgroup space

To get a shell on the node of a specific job in that job’s “mapped” space use:

srun --pty --overlap --jobid JOBIDNUM bash

SLURM batch software#

Terminology#

Slurm cheat sheets#

Partitions#

Tutorial#

Submitting your first job#

Inspecting jobs#

Understanding job priority#

Deleting jobs#

Submitting multiple jobs#

Advanced topics#

Limiting the number of simultaneous tasks#

Setting scheduler parameters in the job script#

Changing the default logging location#

Getting email updates about your job#

Time limits#

Memory#

Nodes and cores#

Advanced node selection#

Examples#

Run a Python script#

Process a list of data files using an array job#

Run a large data reduction job#

Request a GPU#

Run an OpenMP job#

Run an MPI job#

Copy data to local scratch from external server#

Generate SSH keys#

Exchange keys#

Test passwordless login#

Copy the data using a batch job#

Interactive jobs#

Interactive job withing running job’s cgroup space#