SLURM
SLURM
SLURM (Simple Linux Utility for Resource Management) is a job scheduler for clusters.
Check SLURM Availability
Ensure SLURM is installed and accessible:
1
sinfo
This shows available partitions (queues) and their statuses. For example,
1
2
3
4
5
6
> sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
c003t1* up 1-00:00:00 36 idle cpu[1-36]
c003t2 up 7-00:00:00 1 mix cpu37
c003t2 up 7-00:00:00 23 idle cpu[38-60]
b006t up 7-00:00:00 1 idle cpu61
The sinfo output provides details about the available partitions, their status, time limits, and the state of compute nodes. Let’s break it down:
Column | Meaning |
---|---|
PARTITION | The name of the SLURM partition (queue) where jobs can be submitted. |
AVAIL | Availability status (up means available for use). |
TIMELIMIT | Maximum allowed runtime for jobs in this partition (D-HH:MM:SS format). |
NODES | Number of nodes in this state. |
STATE | Status of the nodes (e.g., idle, mix, alloc). |
NODELIST | List of nodes in this partition (ranges are compacted with [ ]). |
Understanding Each Row
- Partition: c003t1 (Default *), 36 nodes are idle
1
c003t1* up 1-00:00:00 36 idle cpu[1-36]
- Default partition (* after name).
- Jobs in this partition can run for up to 1 day.
- 36 nodes are available (idle), named cpu1 to cpu36.
- Partition: c003t2, 1 node (cpu37) is partially used (mix), 23 nodes are idle
1
2
c003t2 up 7-00:00:00 1 mix cpu37
c003t2 up 7-00:00:00 23 idle cpu[38-60]
- Jobs in this partition can run for up to 7 days.
- cpu37 is in a mixed state (mix means partially allocated).
- cpu38 to cpu60 are idle and available.
- Partition: b006t, 1 idle node (cpu61)
1
b006t up 7-00:00:00 1 idle cpu61
- This partition allows jobs for up to 7 days.
- One node (cpu61) is idle and ready for use.
Summary
- c003t1 has 36 idle nodes and is the default partition.
- c003t2 has 1 mixed node (cpu37) and 23 idle nodes.
- b006t has 1 idle node (cpu61).
- If you submit a job without specifying a partition, it will go to c003t1 by default.
Submit a Job
Create a script (e.g., job.sh) with SLURM directives:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
#SBATCH --job-name=my_job # Job name
#SBATCH --output=output.log # Standard output and error log
#SBATCH --error=error.log # Error log
#SBATCH --partition=compute # Partition name (check with `sinfo`)
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1 # Number of tasks (processes)
#SBATCH --cpus-per-task=4 # CPU cores per task
#SBATCH --mem=16G # Memory per node
#SBATCH --time=01:00:00 # Time limit (hh:mm:ss)
#SBATCH --gres=gpu:1 # Request a GPU (if needed)
# Commands to run
echo "Starting job"
python my_script.py
Submit the job:
1
sbatch job.sh
Check Job Status
1
squeue -u $USER
This shows your running or pending jobs.
To check detailed job information:
1
scontrol show job job_id
Cancel a Job
1
scancel job_id
Interactive Session
If you want an interactive session for debugging:
1
srun --partition=compute --ntasks=1 --cpus-per-task=4 --mem=16G --time=01:00:00 --pty bash
Job Dependencies
To run a job only after another job finishes:
1
sbatch --dependency=afterok:job_id job.sh
Further Resources
This post is licensed under CC BY 4.0 by the author.