Post

SLURM

SLURM

SLURM (Simple Linux Utility for Resource Management) is a job scheduler for clusters.

Check SLURM Availability

Ensure SLURM is installed and accessible:

1
sinfo

This shows available partitions (queues) and their statuses. For example,

1
2
3
4
5
6
> sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
c003t1*      up 1-00:00:00     36   idle cpu[1-36]
c003t2       up 7-00:00:00      1    mix cpu37
c003t2       up 7-00:00:00     23   idle cpu[38-60]
b006t        up 7-00:00:00      1   idle cpu61

The sinfo output provides details about the available partitions, their status, time limits, and the state of compute nodes. Let’s break it down:

ColumnMeaning
PARTITIONThe name of the SLURM partition (queue) where jobs can be submitted.
AVAILAvailability status (up means available for use).
TIMELIMITMaximum allowed runtime for jobs in this partition (D-HH:MM:SS format).
NODESNumber of nodes in this state.
STATEStatus of the nodes (e.g., idle, mix, alloc).
NODELISTList of nodes in this partition (ranges are compacted with [ ]).

Understanding Each Row

  1. Partition: c003t1 (Default *), 36 nodes are idle
1
c003t1*      up 1-00:00:00     36   idle cpu[1-36]
  • Default partition (* after name).
  • Jobs in this partition can run for up to 1 day.
  • 36 nodes are available (idle), named cpu1 to cpu36.
  1. Partition: c003t2, 1 node (cpu37) is partially used (mix), 23 nodes are idle
1
2
c003t2       up 7-00:00:00      1    mix cpu37
c003t2       up 7-00:00:00     23   idle cpu[38-60]
  • Jobs in this partition can run for up to 7 days.
  • cpu37 is in a mixed state (mix means partially allocated).
  • cpu38 to cpu60 are idle and available.
  1. Partition: b006t, 1 idle node (cpu61)
1
b006t        up 7-00:00:00      1   idle cpu61
  • This partition allows jobs for up to 7 days.
  • One node (cpu61) is idle and ready for use.

Summary

  • c003t1 has 36 idle nodes and is the default partition.
  • c003t2 has 1 mixed node (cpu37) and 23 idle nodes.
  • b006t has 1 idle node (cpu61).
  • If you submit a job without specifying a partition, it will go to c003t1 by default.

Submit a Job

Create a script (e.g., job.sh) with SLURM directives:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
#SBATCH --job-name=my_job        # Job name
#SBATCH --output=output.log       # Standard output and error log
#SBATCH --error=error.log         # Error log
#SBATCH --partition=compute       # Partition name (check with `sinfo`)
#SBATCH --nodes=1                 # Number of nodes
#SBATCH --ntasks=1                # Number of tasks (processes)
#SBATCH --cpus-per-task=4         # CPU cores per task
#SBATCH --mem=16G                 # Memory per node
#SBATCH --time=01:00:00           # Time limit (hh:mm:ss)
#SBATCH --gres=gpu:1              # Request a GPU (if needed)

# Commands to run
echo "Starting job"
python my_script.py

Submit the job:

1
sbatch job.sh

Check Job Status

1
squeue -u $USER

This shows your running or pending jobs.

To check detailed job information:

1
scontrol show job job_id

Cancel a Job

1
scancel job_id

Interactive Session

If you want an interactive session for debugging:

1
srun --partition=compute --ntasks=1 --cpus-per-task=4 --mem=16G --time=01:00:00 --pty bash

Job Dependencies

To run a job only after another job finishes:

1
sbatch --dependency=afterok:job_id job.sh

Further Resources

This post is licensed under CC BY 4.0 by the author.