SLURM batch software
The Science cn-cluster uses SLURM for batch management. The cluster consists of 2 parts, determined by the ubuntu version, each has its own login node. Currently we have
Login node | Ubuntu version | number of nodes |
---|---|---|
cnlogin20 | ubuntu 20.04 | 22 |
cnlogin22 | ubuntu 22.04 | 91 |
Typically you login to the login node and use that to submit your jobs.
Terminology
term | meaning |
---|---|
node | A node is a single machine in the cluster. |
cluster | A cluster exists of a set of nodes. |
partition | A partition is a defined subset of nodes of the whole cluster. Next to being a subset a ‘partition’ can also limit a job’s resources. Per partition, only people in certain unix groups are allowed to run jobs on this partition. |
job | A job is typically run in a partition of the cluster. |
job step | A job step, is a (possibly parallel) task within a job. |
Slurm cheat sheets
Partitions
Jobs always run within a partition.
The partitions for groups with their own nodes can only be used by members of these groups. These partitions usually have high priority and can run infinitely long (MaxTime=INFINITE). Jobs in these partitions will suspend jobs submitted in the partition “all” and the low priority partition “heflowprio”.
PartitionName=tcm Nodes=cn../.. AllowGroups=tcm Priority=10
PartitionName=hef Nodes=cn../.. AllowGroups=ehef,thef Priority=10
PartitionName=heflowprio Nodes=../.. AllowGroups=ehef,thef Priority=1
PartitionName=milkun Nodes=cn../.. AllowGroups=milkun Priority=10
PartitionName=thchem Nodes=cn../.. AllowGroups=thchem Priority=10
To see a list of all partitions with their properties:
scontrol -a show partitions
To see a list of only the partitions you can use:
scontrol show partitions
To list a specific partition:
scontrol show partition cnczshort
Jobs in the cnczshort
partition also get high priority, but they will
be killed if they run more than 12 hours.
PartitionName=cnczshort Nodes=cn13 MaxTime=12:00:00 Priority=10 Preemptmode=REQUEUE
There also is a cncz
partition, that may be used by all cluster users for jobs that run less than a week, it has low priority:
PartitionName=cncz Nodes=cn13 Priority=2 MaxTime=7-00:00:00
There also is an all
partition with all nodes, this has the lowest priority, max 12 hours running jobs and a memory limit of 2 GB:
PartitionName=all MaxTime=12:00:00 MaxMemPerCPU=2048 Priority=1
Info
To prevent accidents, before using the all queue, please
obtain permission from other groups to use their nodes and contact C&CZ for access to the clusterall
unix group. Use
of this queue is restricted to members of the clusterall
group.
It is wise to provide the partition as an option either on the command
line as -p partitionname
or in the shell script by including a line:
#SBATCH --partition=partitionname
Tutorial
Submitting your first job
To execute your job on the cluster you need to write it in the form of a shell script (don’t worry this is easier than it sounds). A shell script in its most basic form is just a list of commands that you would otherwise write on the command line. The first line of the script should contain an instruction telling the system which type of shell is supposed to execute it. Unless you really need something else you should always use bash. So without further ado here is your first shell script.
#! /bin/bash
#SBATCH --partition=cnczshort
sleep 60
echo "Hello world!"
Type this into an editor and save it as hello.sh
. To execute this on the
login node give it executable permissions (not needed when submitting):
$ chmod u+x hello.sh
and run it:
$ ./hello.sh
It will print Hello world to stdout (the screen). If anything goes wrong an error message will be sent to stderr which in this case is also the screen.
To execute this script as a job on one of the compute nodes we submit it to the
cluster scheduler slurm. This is done with the command sbatch
:
$ sbatch hello.sh
The scheduler will put your job in the named job partition and respond by giving you the job number (10 in this example):
Submitted batch job 10
Your job will now wait until a slot on one of the compute nodes becomes
available. Then it will be executed and the output is written to a file
slurm-10.out
in your home directory, unless you specify otherwise as
explained later.
Inspecting jobs
At any stage after submitting a job, while it is running, you may
inspect its status with squeue
:
$ squeue --job 10
It will print some information including its status which can be:
status | meaning |
---|---|
PD |
pending |
R |
running |
S |
suspended |
It will also show the name of the job script, the user who submitted it
and the time used so far. To see full information about your job use the
scontrol
command:
$ scontrol show job 10
You can get a quick overview of how busy the cluster is with:
$ squeue
Which lists all jobs sorted by the nodes they are running on.
To get detailed information on a specific Jobid
:
$ scontrol show job -dd Jobid
You may use the scontrol
command to get information about the compute
nodes e.g. the number of CPU’s, available memory and requestable resources
such as gpus if available.
$ scontrol show nodes
You may use the sinfo
command to get information on nodes and partitions:
$ sinfo -l
A useful overview of the state of all nodes usable by you can be achieved with:
$ sinfo -Ne -o "%.20n %.15C %.8O %.7t %25f" | uniq
A useful overview of the state of all nodes can be achieved with:
$ sinfo -a -Ne -o "%.20n %.15C %.8O %.7t %25f" | uniq
Finally there is the sall
command that gives a quick overview of all
jobs on the cluster. It supports the -r
, -s
and -p
flags for running,
suspended and pending jobs respectively.
$ sall -r
Understanding job priority
The scheduler determines which job to start first when resources become available based on the job’s relative priority. This is a single floating point number calculated from various criteria.
Job priority is currently calculated based on the following criteria:
* queue time,
jobs that are submitted earlier get a higher priority;
* jobs relative size,
larger jobs requesting more resources such as nodes and CPUs get a higher priority because they are harder to schedule;
* fair share,
the priority of jobs submitted by a user increases or decreases depending on the resources (CPU time) consumed in the last week.
With the sprio -w
command you can view the current weights used in the priority computation.
Job priority formula:
Job_priority = (PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor)
The sall
command, when used in combination with the -p
flag lists
the jobs priority as well as an estimated start time.
Deleting jobs
If, after submitting a job, you discover that you made a mistake and want to delete your job you may do so with.
$ scancel 10
Submitting multiple jobs
Sometimes you may be able to split your job into multiple independent
tasks. For instance when you need to perform the same analysis on
multiple data files. Splitting this work up into multiple tasks helps
because they can then run simultaneously on different compute nodes. Now
this could be accomplished by writing a loop that generates and submits
job scripts but an easier approach is to use array jobs.
When calling sbatch
with the --array
flag an array job consisting of
multiple tasks is submitted to the scheduler. The number of the
current task is available to the scripts execution environment as
$SLURM_ARRAY_TASK_ID
.
So for example the following test_array_job.sh
script:
#! /bin/bash
#SBATCH --partition=tcm
echo This is task $SLURM_ARRAY_TASK_ID
Can be submitted with:
$ sbatch --array 1-4 ./test_array_job.sh
The four tasks in this job will now be executed independently with the
same script, the only difference being the value of
$SLURM_ARRAY_TASK_ID
.
This value can be used in the script to for instance select different input files for your data reduction as in:
#! /bin/bash
my_command ~/input-$SLURM_ARRAY_TASK_ID
Do note however that the tasks will not necessarily be executed in order so they really need to be independent!
Advanced topics
Limiting the number of simultaneous tasks
If you submit a large array job (containing more than say a hundred
tasks) you may want to limit the number of simultaneous task executions.
You may do this using the %
flag.
sbatch --array 1-1000%20 ./test_array_job.sh
This will ensure at maximum 20 tasks run at the same time.
Setting scheduler parameters in the job script
Instead of entering all flags to sbatch
on the command line one can
also choose to write them into the job script itself. For this simply
prefix them with #SBATCH
. This is especially useful for flags described in the following
sections.
Changing the default logging location
By default standard output and error are written to a file named by the
process number in your home directory. This can be changed with the
--output
and (optionally) --error
flags.
#! /bin/bash
# Choose partition to run job in
#SBATCH --partition=hef
# Send output to test.log in your home directory
#SBATCH --output ~/test.log
Getting email updates about your job
Constantly watching squeue
output to see when your job starts running
and when it’s done is not very efficient. Fortunately you can also
request the scheduler to send you email in these situations.
#! /bin/bash
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your mail address>
This will get you email when your job begins, ends or fails. If you only
want email when it is done you can use --mail-type=END
.
Time limits
To allow the scheduler to efficiently distribute the available compute
time each job has a one hour time limit. If your job exceeds its
limit (counted from the moment execution started on the compute node) it
will be killed automatically. Now what if you know your job will need
more than one hour to complete? In this case you can simply request more
time with the --time
flag.
So for instance if you expect to need 12 hours of compute time for your job add the following to your script.
#! /bin/bash
#SBATCH --time=12:00:00
For multi-day walltime you can simply add an extra days field as in
--time=DD-HH:MM:SS
.
Memory
If your job needs an excessive amount of memory to run you should request it using
#! /bin/bash
#SBATCH --mem=64G
For 64GB as an example. Please only use this if you know for certain that your job needs much more than 8GB of memory!
Nodes and cores
If your code uses parallelization using threads, OpenMP or MPI for
instance, you may want to request more CPU resources. This can be in the
form of more than one core on a single machine / node, more than one
machine or a combination of both. These resources are requested with the
-N
flag combined with the -n
flag as follows.
#! /bin/bash
#SBATCH -N 1 -n 8
Requests 8 cores on one node.
#! /bin/bash
#SBATCH -N 10 -n 16
Requests 16 cores divided over 10 compute nodes (16 cores in total). Note that this helps only if your code is parallelized! Even if you think it is, it may still be disabled due to compile time selections so always check! For instance by running it briefly on the login node and inspecting CPU usage with top, if it rises above 100% it uses some kind of parallelization.
For MPI jobs you may only care about the total number of CPU cores used, not about if they are located on one machine or distributed over the system. Then you may use the -n flag by itself instead.
Requesting large amounts of cores and or nodes will reduce the probability that your job can be scheduled at each scheduling interval thereby pushing its start time further into the future so only request what you need!
Advanced node selection
Sometimes you may need more control over which nodes are used (for
instance when your data is located in its /scratch
. You can specify
explicit node names with the -w
flag (optionally multiple comma separated) as follows.
#! /bin/bash
#SBATCH -N 2 -n 4 -w cn90,cn91
This will request 4 cores on cn90
and cn91
.
Examples
Here are a couple of example scripts (feel free to add your own). Unless
otherwise stated just save them as my_job.sh
and submit them to the
scheduler with.
$ sbatch my_job.sh
Run a Python script
This job script, when submitted, executes a Python script ~/my_script.py
on a random available node.
#! /bin/bash
/usr/bin/python ~/my_script.py
Process a list of data files using an array job
Let’s say filenames.txt
is a text file containing 100 file names you
wish to process. For instance created with:
ls *.dat > filenames.txt
You may now combine array jobs with clever use of the awk
command to
select the current filename from this file.
Submit the following script with sbatch –array 1-100 my_job_script.sh
.
#! /bin/bash
#
# Each task needs 1.5 hours of runtime
#SBATCH --time=01:30:00
#
INPUT_FILE=`awk "NR==$SLURM_ARRAY_TASK_ID" filenames.txt`
#
my_command $INPUT_FILE
Run a large data reduction job
This is an example of a job that needs a significant amount of resources
on a single specific node. Using cn90
in this example:
#! /bin/bash
#
# Request 10 CPU cores and 64GB of memory for 7 days on cn90
#SBATCH --partition=tcm
#SBATCH -N 1 -n 10
#SBATCH --mem=64GB
#SBATCH --time=7-00:00:00
#SBATCH -w cn90
#
# Get email updates
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<username>@science.ru.nl
#
run_my_large_job
This assumes you have your data stored in /scratch
on
cn90.science.ru.nl
. If your application is not IO limited, simply store
your data in a network share or your home directory (~
) instead and remove the -w
line to allow the scheduler to pick a random node.
Request a GPU
Then use:
# set the number of GPU cards to use per node
#SBATCH --gres=gpu:1
Run an OpenMP job
OpenMP is frequently used for multithreaded parallel programming. To use
OpenMP on the cluster, make sure your code is compiled using the
-fopenmp
flag. Then any CPU cores requested will be automatically
picked up (e.g. no need to set the OMP_NUM_THREADS
flag). In fact if
you do set the flag it will probably lead to slower running code since
forcing the number of threads to be higher than the number of CPU cores
available leads to overhead.
So the following example code for a parallel for loop:
#include <omp.h>
int main(int argc, char *argv[]) {
const int N = 100000;
int i, a[N];
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i] = 2 * i;
return 0;
}
can be compiled and run on the cluster with the following job script:
#! /bin/bash
#
# Request 4 CPU cores for this OpenMP code
#SBATCH -N 1 -n 4
#
# First compile it
gcc -O2 -mtune=native -fopenmp my_code.c -o my_code
#
# and run
./my_code
Run an MPI job
The cluster scheduler has full built in support for OpenMPI therefore it
is unnecessary to specify the --hostfile
, --host
, or -np
options to
mpirun
. Instead you can simply call mpirun
from your job script and
it will be aware of all nodes and cores requested and uses them
accordingly.
#! /bin/bash
#
# Request 32 processor cores randomly distributed over nodes
#SBATCH -n 32
#
# And 12 hours of runtime
#SBATCH --time=12:00:00
#
# Get email when it's done
#SBATCH --mail-type=END
#SBATCH --mail-user=<username>@science.ru.nl
#
mpirun my_mpi_application
Copy data to local scratch from external server
This copies data from an external location to local scratch storage using SCP. If this takes a significant amount of time, or has to be done regularly you may want to submit a job to the scheduler to do it overnight.
Generate SSH keys
Info
You only need to do this once!
The job script, when executed, cannot ask you for your password so we need to setup access via a public-private key pair first. Create a new keypair with:
ssh-keygen -t ed25519
and hit enter a bunch of times. Ddo not enter a password or it will attempt to ask the script again.
Exchange keys
Now, append your public key to a file called ~/.ssh/authorized_keys:
ssh-copy-id -i ~/.ssh/id_ed25519.pub user@server.example.com
Test passwordless login
See if it worked:
ssh user@server.example.com
This should give you a shell on the other machine without having to supply your password.
Copy the data using a batch job
Now you may submit the following script to the scheduler.
#! /bin/bash
#
#SBATCH --partition=tcm
# Select the node
#SBATCH -w cn90
#
# Get email when it's done
#SBATCH --mail-type=END
#SBATCH --mail-user=<username>@science.ru.nl
scp -r user@server.example.com:/path/over/there /scratch/$USER/
Interactive jobs
If you need interactive access (a shell) you can request the scheduler for this with the following steps:
- Request an allocation of resources. For instance, 1 core for 2 hours on a random node:
salloc -c 1 --partition=tcm --time 2:00:00
- Attach a shell to this allocation:
srun --pty bash
You then get a prompt on an available node and can start working. This shell will be automatically closed after two hours.