SLURM

Posted Feb 1, 2025

By Hongming Zhu

2 min read

SLURM

SLURM (Simple Linux Utility for Resource Management) is a job scheduler for clusters.

Check SLURM Availability

Ensure SLURM is installed and accessible:

sinfo

This shows available partitions (queues) and their statuses. For example,

> sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
c003t1*      up 1-00:00:00     36   idle cpu[1-36]
c003t2       up 7-00:00:00      1    mix cpu37
c003t2       up 7-00:00:00     23   idle cpu[38-60]
b006t        up 7-00:00:00      1   idle cpu61

The sinfo output provides details about the available partitions, their status, time limits, and the state of compute nodes. Let’s break it down:

Column	Meaning
PARTITION	The name of the SLURM partition (queue) where jobs can be submitted.
AVAIL	Availability status (up means available for use).
TIMELIMIT	Maximum allowed runtime for jobs in this partition (D-HH:MM:SS format).
NODES	Number of nodes in this state.
STATE	Status of the nodes (e.g., idle, mix, alloc).
NODELIST	List of nodes in this partition (ranges are compacted with [ ]).

Understanding Each Row

Partition: c003t1 (Default *), 36 nodes are idle

c003t1*      up 1-00:00:00     36   idle cpu[1-36]

Default partition (* after name).
Jobs in this partition can run for up to 1 day.
36 nodes are available (idle), named cpu1 to cpu36.

Partition: c003t2, 1 node (cpu37) is partially used (mix), 23 nodes are idle

c003t2       up 7-00:00:00      1    mix cpu37
c003t2       up 7-00:00:00     23   idle cpu[38-60]

Jobs in this partition can run for up to 7 days.
cpu37 is in a mixed state (mix means partially allocated).
cpu38 to cpu60 are idle and available.

Partition: b006t, 1 idle node (cpu61)

b006t        up 7-00:00:00      1   idle cpu61

This partition allows jobs for up to 7 days.
One node (cpu61) is idle and ready for use.

Summary

c003t1 has 36 idle nodes and is the default partition.
c003t2 has 1 mixed node (cpu37) and 23 idle nodes.
b006t has 1 idle node (cpu61).
If you submit a job without specifying a partition, it will go to c003t1 by default.

Submit a Job

Create a script (e.g., job.sh) with SLURM directives:

  
#!/bin/bash
#SBATCH --job-name=my_job        # Job name
#SBATCH --output=output.log       # Standard output and error log
#SBATCH --error=error.log         # Error log
#SBATCH --partition=compute       # Partition name (check with `sinfo`)
#SBATCH --nodes=1                 # Number of nodes
#SBATCH --ntasks=1                # Number of tasks (processes)
#SBATCH --cpus-per-task=4         # CPU cores per task
#SBATCH --mem=16G                 # Memory per node
#SBATCH --time=01:00:00           # Time limit (hh:mm:ss)
#SBATCH --gres=gpu:1              # Request a GPU (if needed)

# Commands to run
echo "Starting job"
python my_script.py

Submit the job:

sbatch job.sh

Check Job Status

squeue -u $USER

This shows your running or pending jobs.

To check detailed job information:

scontrol show job job_id

Cancel a Job

scancel job_id

Interactive Session

If you want an interactive session for debugging:

  
srun --partition=compute --ntasks=1 --cpus-per-task=4 --mem=16G --time=01:00:00 --pty bash

Job Dependencies

To run a job only after another job finishes:

sbatch --dependency=afterok:job_id job.sh

Further Resources

SLURM quickstart

High Performance Computing, Tutorials

hpc slurm

This post is licensed under CC BY 4.0 by the author.