Modify Slurm configuration files to optimize scheduling, resource allocation, and job management for your GPU cluster.
Prerequisites
kubectl CLI installed and configured
- Kubeconfig downloaded from your cluster
- Access to your cluster’s Slurm namespace
Configuration Files
Your Slurm cluster configuration is stored in a Kubernetes ConfigMap with four main files:
| File | Purpose |
|---|
slurm.conf | Main cluster configuration (nodes, partitions, scheduling) |
gres.conf | GPU and generic resource definitions |
cgroup.conf | Control group resource management |
plugstack.conf | SPANK plugin configuration |
Edit Configuration
Update ConfigMap
Edit the ConfigMap directly:
kubectl edit configmap slurm -n slurm
This opens the ConfigMap in your default editor. Make your changes and save.
Alternative method:
# Export to local file
kubectl get configmap slurm -n slurm -o yaml > slurm-config.yaml
# Edit locally
# ... make your changes ...
# Apply changes
kubectl apply -f slurm-config.yaml
Restart Components
After editing the ConfigMap, restart the appropriate components:
For slurm.conf changes:
# Restart controller
kubectl rollout restart statefulset slurm-controller -n slurm
# Restart compute nodes
kubectl rollout restart daemonset slurm-node -n slurm
For gres.conf or plugstack.conf changes:
# Restart compute nodes only
kubectl rollout restart daemonset slurm-node -n slurm
Verify Changes
# Check rollout status
kubectl rollout status statefulset slurm-controller -n slurm
# Verify configuration in pod
kubectl exec -it slurm-controller-0 -n slurm -- cat /etc/slurm/slurm.conf
# Test Slurm functionality
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show config
Configuration Examples
Edit gres.conf to define GPU resources:
Name=gpu Type=a100 File=/dev/nvidia[0-7]
Name=gpu Type=h100 File=/dev/nvidia[8-15]
Modify Partitions
Edit the partition section in slurm.conf:
PartitionName=gpu Nodes=gpu-nodes State=UP Default=NO MaxTime=24:00:00
PartitionName=cpu Nodes=cpu-nodes State=UP Default=YES
Tune Scheduler
Adjust scheduler parameters in slurm.conf:
SchedulerParameters=batch_sched_delay=10,bf_interval=180,sched_max_job_start=500
Update Resource Allocation
Modify resource allocation settings:
SelectTypeParameters=CR_Core_Memory
DefMemPerCPU=4096 # 4GB per CPU
Enable Cgroup Limits
Edit cgroup.conf to enforce resource limits:
CgroupPlugin=cgroup/v1
ConstrainCores=yes
ConstrainRAMSpace=yes
Then update slurm.conf:
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
Troubleshooting
Configuration Not Applied
# Verify ConfigMap was updated
kubectl get configmap slurm -n slurm -o yaml
# Check pod age (should be recent after restart)
kubectl get pods -n slurm
# View controller logs
kubectl logs slurm-controller-0 -n slurm
Syntax Errors
# Check controller logs for errors
kubectl logs slurm-controller-0 -n slurm | grep -i error
# View recent events
kubectl get events -n slurm --sort-by='.lastTimestamp'
Pods Not Restarting
# Check rollout status
kubectl rollout status statefulset slurm-controller -n slurm
# Force delete and recreate pod
kubectl delete pod slurm-controller-0 -n slurm
Jobs Failing After Changes
# Check node status
kubectl exec -it slurm-controller-0 -n slurm -- sinfo
# Check specific node details
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show node <nodename>
# View job errors
kubectl exec -it slurm-controller-0 -n slurm -- scontrol show job <jobid>
Quick Reference
View Configurations
# View all Slurm configmaps
kubectl get configmaps -n slurm | grep slurm
# View slurm.conf content
kubectl get configmap slurm -n slurm -o jsonpath='{.data.slurm\.conf}'
# View gres.conf content
kubectl get configmap slurm -n slurm -o jsonpath='{.data.gres\.conf}'
Restart Components
# Restart controller
kubectl rollout restart statefulset slurm-controller -n slurm
# Restart accounting daemon
kubectl rollout restart statefulset slurm-accounting -n slurm
# Restart compute nodes
kubectl rollout restart daemonset slurm-node -n slurm
Monitor Cluster
# Watch pod status
kubectl get pods -n slurm -w
# View logs (follow mode)
kubectl logs -f slurm-controller-0 -n slurm
# Check Slurm cluster status
kubectl exec -it slurm-controller-0 -n slurm -- sinfo
Best Practices
- Back up configurations before making changes
- Test in development before applying to production
- Make incremental changes to isolate issues
- Document your changes for future reference
- Monitor logs and jobs after applying changes
- Use version control to track configuration changes
Running jobs are not affected by configuration changes. Changes persist across pod restarts, but rolling restarts minimize downtime.
Additional Resources