Slurm and DGX1

Working with Slurm and DGX1

Monitoring Slurm Jobs

Working with Slurm and DGX1

The current setup includes a controller machine ( and compute nodes (two nodes:, ) Allowed users can login to op-controller using their regular credentials. .Currently working with containers is being done through the easy_ngc program You can run commands such as:

> ssh op-controller

You need first to connect to server

DL containers from Nvidia (ngc):

> srun -G 2 --pty easy_ngc tensorflow

Use 2 GPUs, run a container with the latest version of tensorflow

> srun -G 3 --pty easy_ngc --cmd nvidia-smi pytorch

Use 3 GPUs, run the 'nvidia-smi' command in a container with the latest version of pytorch

> srun -G 2 --pty easy_ngc --jupyter mxnet

Use 2 GPUs, run a jupyter notebook server in a container with the latest version of mxnet

> srun -G 5 --mem 120G --pty easy_ngc --version 19.03-py3 tensorflow

Use 5 GPUs, allocate 120G of system memory and use the March 2019 release of the Tensorflow NGC container with python 3

> srun -G 2 --pty easy_ngc --modules=imagehash --packages=julia tensorflow

Use 2 GPUs, run a container with the latest version of tensorflow apt-get install julia and pip install imagehash in the container

> srun -G 2 --pty easy_ngc --modules=/home_dir/requirements.txt  tensorflow

Use 2 GPUs, run a container with the latest version of tensorflow  and pip install a list of python packages with latest or specific version (You need to create your own requirements.txt file)

Non-DL containers (from Nvidia or elsewhere):

> srun -G 3 --pty easy_ngc

Pull the VMD HPC container from NGC and run a command line on it (with 3 GPUs)

>srun -G 5 --pty easy_ngc --cmd 'nvidia-smi'

Pull a generic Ubuntu 19.10 container from from Dockerhub and run nvidia-smi on it (with 5 GPUs)

Note: For non-DL containers (HPC or non-NGC), use the complete image name along with a version tag. The –version command will not work.

Note: While running containers with slurm, please ignore messages like "groups: cannot find name for group ID XXXX" or prompts with text like "I have no name!" This happens because the operating system within the controller is unable to translate UIDs to user names and GIDs to group names. The users and groups are still known to the system (as UIDs and GIDs) and everything should work properly..

This is an initial configuration opened for you to try. It will develop and change gradually. We need your feedback.



Monitoring Slurm Jobs

To monitor the status of your jobs in the Slurm partitions, use the squeue command.  You will only have access to see your queued jobs.  Options to this command will help filter and format the output to meet your needs.  

Squeue Option Action
--user=<username> Lists entries only belonging to username, only available to administrator
--jobs=<job_id> List entry, if any, for job_id
--partition=<partition_name> Lists entries only belonging to partition_name
               312  killable     bash     user  R       1:08      1 rack-gww-dgx1

Squeue Output Column Header Definition
JOBID Unique number assigned to each job
PARTITION Partition id the job is scheduled to run, or is running, on
NAME Name of the job, typically the job script name
USER User id of the job
ST Current state of the job (see table below for meaning)
TIME Amount of time job has been running
NODES Number of nodes job is scheduled to run across
NODELIST(REASON) If running, the list of the nodes the job is running on. If pending, the reason the job is waiting

Valid Job States

Code State Meaning
CA Canceled Job was canceled
CD Completed Job completed
CF Configuring Job resources being configured
CG Completing Job is completing
F Failed Job terminated with non-zero exit code
NF Node Fail Job terminated due to failure of node(s)
PD Pending Job is waiting for compute node(s)
R Running Job is running on compute node(s)
TO Timeout Job terminated upon reaching its time limit