Scheduling Jobs in a Super Computer Cluster Using Github Actions

A simple setup to programmatically enqueue jobs in a SLURM cluster.

Image for post
Image for post
Titan 3 Supercomputer Image from Pixabay

Introduction

For those that have worked with a super computer cluster before, you know that scheduling batches of jobs where each job may not only have different settings (Ex: Hyper-parameters) but also different code can be a cumbersome task. Moreover, keeping track of all the different versions of code when analyzing the results can be tricky and a simple note taking error can result in many hours rerunning jobs.

Methods

Part 1: Setup a Git Repository

  1. If you don’t already have a Github account create one here. It is free.
  2. Select New repository from the top right corner of the page.
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Part 2: Structure Your Code

Organize your code as shown below and push it to the git repo.

my_slurm_deploy_repo/
├── .github
│ └── workflows/
│ └── slurm-enqueue-job.yml
├── src/
│ ├── code_file_1.js
│ ├── ...
│ └── code_file_n.js
├── .gitignore
└── job-sbatcher.sh
#!/bin/bash
#SBATCH -N1
#SBATCH --ntasks=1 --cpus-per-task=32
#SBATCH --gres=gpu:1
#SBATCH -C T4
#SBATCH --mem=32G
#SBATCH --mail-user=your-email@your-domain.com
#SBATCH --mail-type=ALL

module load cuda10.1/toolkit/10.1.105
module load matlab/R2020a

cd src/

echo $SLURM_JOB_NODELIST

matlab -r 'your_command_to_run;exit;'

Part 3: Setup the Github Action

If you are not familiar with Github Actions I explain it in more detail in the article below or visit the official Github documentation here.

  1. The step with the id: getshortsha is a helper step that gets an 8 character version of the commit hash. This will be used later as a unique identifier for the file and folder names we will copy over to the cluster. Having unique names ensures that different deploys will not overwrite one another.
  2. The step with uses: garygrossgarten/github-action-scp@v0.6.0 makes use of an open source action that runs a scp command which will be used to copy over the repository files from the virtual machine running the action to the cluster. The repository of the action can be found here.
  3. The final step makes use of the library here to SSH into the cluster and run the batching command in much the same way that we would when performing a manual run. It prepends the name of the job-batcher.sh script with the GIT SHA so that the run appears with a unique name in the queue.

Part 4: Make Your First Deploy!

We now have all the code required to enqueue our first job. Just make a push to the branch you designated in the workflow configuration file and the Github Action should start running automatically.

Conclusion

We have reviewed a simple Github Action workflow for programmatically enqueuing jobs in a super computing cluster. This approach shows a few benefits when compared to running manually. It automatically tracks changes to the code between runs in the git history which makes it robust to errors in note taking. Furthermore, it allows enqueuing multiple jobs with different versions of the code simultaneously without having to manually copy the necessary files or waiting for a job to start before modifying code.

Written by

A curious minded engineer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store