Job Submission¶
SLURM - Overview¶
SC cluster uses SLURM, an open-source workload/resource management for job scheduling. SLURM is used in many institutions and national laboratories around the world, if you have prior experience with SLURM from another entities, you should be able to apply them here from the get-go, as our installation is fairly standard. For those who are new to SLURM, it's easy to get started and learning this will help prepare your future across many sectors.
On the SC cluster, each group has their own dedicated compute servers and are setup for their own specific partitions (or queues), where access are limited to those who has an affiliation with the group. That is why a group-affiliation is required when applying for access and it is crucial to update this info when you rotate/change group.
Interactive Jobs (shell access)¶
An interactive job is when real-time interaction with your job is required, most commonly to prototype for a bigger job, testing and debugging codes. It is NOT an effective use of the cluster as these jobs tend to have low utilization relative to the compute capabilities. Most group has a dedicated -interactive partition, often with reduced walltime and tighter resource limits.
To request an interative session, SSH to sc.stanford.edu and use the srun command as follows:
srun --account=your_group_account --partition=my_partition --pty bash
Likewise, to request an interactive session with a GPU, note the --gres=gpu:1
srun --account=your_group_account --partition=my_partition --nodelist=node1 --gres=gpu:1 --pty bash
Batch Jobs¶
Batch jobs are one of the most efficient way to interact with the cluster, and are useful when real-time interaction is not required. Two clear advantages are that your job will run automatically as soon as the required resources are available, and that placing your setup commands in a shell script lets you efficiently dispatch multiple similar jobs. There are many parameters you can define based on your requirement. You can reference a sample submit script at: /sailhome/software/sample-batch.sh
To start a simple batch job on a given partition, simply use the sbatch command as follows:
sbatch my_script.sh
GPU¶
Users can request for a specific type of GPU or specify a vRAM/arch constraint if they choose to:
srun --account=your_group_account --partition=mypartition --gres=gpu:h100:1 --pty bash
The above will request 1 H100 GPU from any nodes in mypartition.
srun --account=your_group_account --partition=mypartition --gres=gpu:1 --constraint=80G --pty bash
The above will request 1 GPU with 80G vRAM from any nodes in mypartition.