Scheduling Jobs in a Super Computer Cluster Using Github Actions

A simple setup to programmatically enqueue jobs in a SLURM cluster.

Image for post
Image for post
Titan 3 Supercomputer Image from Pixabay

Introduction

For those that have worked with a super computer cluster before, you know that scheduling batches of jobs where each job may not only have different settings (Ex: Hyper-parameters) but also different code can be a cumbersome task. Moreover, keeping track of all the different versions of code when analyzing the results can be tricky and a simple note taking error can result in many hours rerunning jobs.

Here, I demonstrate how to setup a Github Actions based approach to deploying jobs to a supercomputing cluster managed by SLURM. A similar approach could be extended to non-SLURM clusters as well.

The approach here relies on using a Git repository to keep track of code changes. A Github Action runs automatically when code is changed and deploys the new code as a job on the cluster independently from other queued jobs.

Methods

  1. If you don’t already have a Github account create one here. It is free.
  2. Select New repository from the top right corner of the page.
Image for post
Image for post

3. Give your repository a name and description. Set it up as private (unless there is a reason to have a public repository).

Image for post
Image for post

4. Your repository is now setup and ready to receive code.

Image for post
Image for post

Organize your code as shown below and push it to the git repo.

my_slurm_deploy_repo/
├── .github
│ └── workflows/
│ └── slurm-enqueue-job.yml
├── src/
│ ├── code_file_1.js
│ ├── ...
│ └── code_file_n.js
├── .gitignore
└── job-sbatcher.sh

The .github/workflows folder will contain the Github Actions configuration file. The src/ folder will contain the source code of the program you are going to run on the cluster. The .gitignore file will be used to exclude from the git repo files that should not be save there such as .DS_STORE files created by MacOS. Finally, the job-sbatcher.sh is a script that will be run in the cluster login node to configure the job. This script should be the same script already in use for the manual runs of the your program. A sample script is shown below.

#!/bin/bash
#SBATCH -N1
#SBATCH --ntasks=1 --cpus-per-task=32
#SBATCH --gres=gpu:1
#SBATCH -C T4
#SBATCH --mem=32G
#SBATCH --mail-user=your-email@your-domain.com
#SBATCH --mail-type=ALL

module load cuda10.1/toolkit/10.1.105
module load matlab/R2020a

cd src/

echo $SLURM_JOB_NODELIST

matlab -r 'your_command_to_run;exit;'

In the above sample script, we allocate the resources we require for this job as follows: 1 Node with 32 CPU cores, 32GB of RAM and 1 T4 GPU. The optional email configuration can be useful to receive emails regarding the job being queued, starting and completing.

The module load commands loads up shared libraries that may be required for your code to run. This specific command is specific to this cluster but a similar command should exist in your cluster.

We cd (change directory) into the folder containing our code so that when we run our program we are in the expected directory. The final command matlab -r ... runs the actual program. In this case, a MATLAB script.

If you are not familiar with Github Actions I explain it in more detail in the article below or visit the official Github documentation here.

The YAML file we will be reviewing in this section, is responsible for checking out (getting code from) our GIT repo, copying it over to the cluster and queuing it as a job. The full script is shown below:

This script consists of 4 steps:

  1. uses: actions/checkout@v2 is an official Github Action for pulling the GIT repo into the virtual machine that will be running action.
  2. The step with the id: getshortsha is a helper step that gets an 8 character version of the commit hash. This will be used later as a unique identifier for the file and folder names we will copy over to the cluster. Having unique names ensures that different deploys will not overwrite one another.
  3. The step with uses: garygrossgarten/github-action-scp@v0.6.0 makes use of an open source action that runs a scp command which will be used to copy over the repository files from the virtual machine running the action to the cluster. The repository of the action can be found here.
  4. The final step makes use of the library here to SSH into the cluster and run the batching command in much the same way that we would when performing a manual run. It prepends the name of the job-batcher.sh script with the GIT SHA so that the run appears with a unique name in the queue.

Note that both steps 3 and 4 make use of Github Secrets to store the cluster username, password and hostname without exposing it as plaintext in the repository. To add secrets, go to the settings tab of your Github repository and find the secrets link on the side menu.

We now have all the code required to enqueue our first job. Just make a push to the branch you designated in the workflow configuration file and the Github Action should start running automatically.

To verify your job has been enqueued, SSH into the cluster login node and run squeue -u your-username.

Conclusion

We have reviewed a simple Github Action workflow for programmatically enqueuing jobs in a super computing cluster. This approach shows a few benefits when compared to running manually. It automatically tracks changes to the code between runs in the git history which makes it robust to errors in note taking. Furthermore, it allows enqueuing multiple jobs with different versions of the code simultaneously without having to manually copy the necessary files or waiting for a job to start before modifying code.

Hope this was helpful!

Special thanks to PhD candidate Katie Gandomi for help with development.

Written by

A curious minded engineer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store