Scheduling Jobs in a Super Computer Cluster Using Github Actions
A simple setup to programmatically enqueue jobs in a SLURM cluster.
For those that have worked with a super computer cluster before, you know that scheduling batches of jobs where each job may not only have different settings (Ex: Hyper-parameters) but also different code can be a cumbersome task. Moreover, keeping track of all the different versions of code when analyzing the results can be tricky and a simple note taking error can result in many hours rerunning jobs.
The approach here relies on using a Git repository to keep track of code changes. A Github Action runs automatically when code is changed and deploys the new code as a job on the cluster independently from other queued jobs.
Part 1: Setup a Git Repository
- If you don’t already have a Github account create one here. It is free.
- Select New repository from the top right corner of the page.
3. Give your repository a name and description. Set it up as private (unless there is a reason to have a public repository).
4. Your repository is now setup and ready to receive code.
Part 2: Structure Your Code
Organize your code as shown below and push it to the git repo.
│ └── workflows/
│ └── slurm-enqueue-job.yml
│ ├── code_file_1.js
│ ├── …