About
Computational resource management is a key infrastructure component for large scale distributed training in AI/ML. Engineers perform a myriad of tasks like hyperparameter tuning, pretraining, finetuning, and more, which need to be coordinated across multiple users with access to hundreds and thousands of computational resources. One of the most widely used tool for managing resources is Slurm. Some of the largest supercomputers in the world have relied on Slurm for their management, and companies like Meta and OpenAI use it in their on-prem infrastructure. While there is plenty of technical information online about how to use Slurm, there is not a readily available tool for testing and deploying Slurm clusters on a local computer. This is necessary to allow users and developers to get familiar with the tool when they need rapid development or have no access to a deployed Slurm cluster. The aim of this project is to create such tool following the same objectives as those available for Kubernetes such as minikube or kind.