Scheduling Jobs in a Super Computer Cluster Using Github Actions

Paulo Carvalho
5 min readOct 21, 2020

A simple setup to programmatically enqueue jobs in a SLURM cluster.

Titan 3 Supercomputer Image from Pixabay

Introduction

For those that have worked with a super computer cluster before, you know that scheduling batches of jobs where each job may not only have different settings (Ex: Hyper-parameters) but also different code can be a cumbersome task. Moreover, keeping track of all the different versions of code when analyzing the results can be tricky and a simple note taking error can result in many hours rerunning jobs.

Here, I demonstrate how to setup a Github Actions based approach to deploying jobs to a supercomputing cluster managed by SLURM. A similar approach could be extended to non-SLURM clusters as well.

The approach here relies on using a Git repository to keep track of code changes. A Github Action runs automatically when code is changed and deploys the new code as a job on the cluster independently from other queued jobs.

Methods

Part 1: Setup a Git Repository

  1. If you don’t already have a Github account create one here. It is free.
  2. Select New repository from the top right corner of the page.

3. Give your repository a name and description. Set it up as private (unless there is a reason to have a public repository).

4. Your repository is now setup and ready to receive code.

Part 2: Structure Your Code

Organize your code as shown below and push it to the git repo.

my_slurm_deploy_repo/
├── .github
│ └── workflows/
│ └── slurm-enqueue-job.yml
├── src/
│ ├── code_file_1.js
│ ├── …
Paulo Carvalho

Want to chat about startups, consulting or engineering? Just send me an email on paulo@avantsoft.com.br.