Scheduling Jobs in a Super Computer Cluster Using Github Actions
A simple setup to programmatically enqueue jobs in a SLURM cluster.
For those that have worked with a super computer cluster before, you know that scheduling batches of jobs where each job may not only have different settings (Ex: Hyper-parameters) but also different code can be a cumbersome task. Moreover, keeping track of all the different versions of code when analyzing the results can be tricky and a simple note taking error can result in many hours rerunning jobs.
The approach here relies on using a Git repository to keep track of code changes. A Github Action runs automatically when code is changed and deploys the new code as a job on the cluster independently from other queued jobs.
Part 1: Setup a Git Repository
- If you don’t already have a Github account create one here. It is free.
- Select New repository from the top right corner of the page.
3. Give your repository a name and description. Set it up as private (unless there is a reason to have a public repository).
4. Your repository is now setup and ready to receive code.
Part 2: Structure Your Code
Organize your code as shown below and push it to the git repo.
│ └── workflows/
│ └── slurm-enqueue-job.yml
│ ├── code_file_1.js
│ ├── ...
│ └── code_file_n.js
.github/workflows folder will contain the Github Actions configuration file. The
src/ folder will contain the source code of the program you are going to run on the cluster. The
.gitignore file will be used to exclude from the git repo files that should not be save there such as
.DS_STORE files created by MacOS. Finally, the
job-sbatcher.sh is a script that will be run in the cluster login node to configure the job. This script should be the same script already in use for the manual runs of the your program. A sample script is shown below.
#SBATCH --ntasks=1 --cpus-per-task=32
#SBATCH -C T4
module load cuda10.1/toolkit/10.1.105
module load matlab/R2020a
matlab -r 'your_command_to_run;exit;'
In the above sample script, we allocate the resources we require for this job as follows: 1 Node with 32 CPU cores, 32GB of RAM and 1 T4 GPU. The optional email configuration can be useful to receive emails regarding the job being queued, starting and completing.
module load commands loads up shared libraries that may be required for your code to run. This specific command is specific to this cluster but a similar command should exist in your cluster.
cd (change directory) into the folder containing our code so that when we run our program we are in the expected directory. The final command
matlab -r ... runs the actual program. In this case, a MATLAB script.
Part 3: Setup the Github Action
If you are not familiar with Github Actions I explain it in more detail in the article below or visit the official Github documentation here.
Continuous Deployment for ZAT App
A simple continuous deployment (CICD) pipeline for your private Zendesk applications.
The YAML file we will be reviewing in this section, is responsible for checking out (getting code from) our GIT repo, copying it over to the cluster and queuing it as a job. The full script is shown below:
This script consists of 4 steps:
uses: actions/checkout@v2is an official Github Action for pulling the GIT repo into the virtual machine that will be running action.
- The step with the
id: getshortshais a helper step that gets an 8 character version of the commit hash. This will be used later as a unique identifier for the file and folder names we will copy over to the cluster. Having unique names ensures that different deploys will not overwrite one another.
- The step with
uses: firstname.lastname@example.org use of an open source action that runs a
scpcommand which will be used to copy over the repository files from the virtual machine running the action to the cluster. The repository of the action can be found here.
- The final step makes use of the library here to SSH into the cluster and run the batching command in much the same way that we would when performing a manual run. It prepends the name of the
job-batcher.shscript with the GIT SHA so that the run appears with a unique name in the queue.
Note that both steps 3 and 4 make use of Github Secrets to store the cluster username, password and hostname without exposing it as plaintext in the repository. To add secrets, go to the settings tab of your Github repository and find the secrets link on the side menu.
Part 4: Make Your First Deploy!
We now have all the code required to enqueue our first job. Just make a push to the branch you designated in the workflow configuration file and the Github Action should start running automatically.
To verify your job has been enqueued, SSH into the cluster login node and run
squeue -u your-username.
We have reviewed a simple Github Action workflow for programmatically enqueuing jobs in a super computing cluster. This approach shows a few benefits when compared to running manually. It automatically tracks changes to the code between runs in the git history which makes it robust to errors in note taking. Furthermore, it allows enqueuing multiple jobs with different versions of the code simultaneously without having to manually copy the necessary files or waiting for a job to start before modifying code.
Hope this was helpful!
Special thanks to PhD candidate Katie Gandomi for help with development.