Performance issues on Baobab

If you are asking for help, try to provide information that can help us solve your issue, such as :

what did you try: Train a deep neural model on baobab

what didn’t work: GPU was underrused

what was the expected result: Explanation later

path to the relevant files (logs, sbatch script, etc):

  • logs : /home/users/l/leblancq/pdm/git/git-vizwiz-65733390.out
  • sbatch script : /home/users/l/leblancq/pdm/git/2080.sbatch

Description :

For a project I am trying to train a deep neural model on Baobab.
For testing purposes I tried to train this model on a personnal GPU that I have access to.
I noticed that the performances on this GPU were better than on the cluster.
The GPU I used was a 2080Ti.
I tried to reproduce the exact same experience on the cluster and it surprised me that baobab was around 2 times slower than the GPU I used for my test.

What could be the reason for this ?
The training for this model is long and performance is really an issue.
Particularly when what could take 8h takes instead 14.

Thank you for the attention you’ll give to this post.

(baobab)-[root@admin1 ~]$ scontrol show job 65733390
JobId=65733390 JobName=compare-git-train
   UserId=XXXXX GroupId=XXXXX MCS_label=N/A
   Priority=1489182 Nice=0 Account=XXXXX QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:53:29 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2023-03-03T11:14:57 EligibleTime=2023-03-03T11:14:57
   AccrueTime=2023-03-03T11:14:57
   StartTime=2023-03-03T11:15:04 EndTime=2023-03-03T23:15:04 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-03T11:15:04 Scheduler=Backfill
   Partition=shared-gpu AllocNode:Sid=login2:103313
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=gpu011
   BatchHost=gpu011
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=16000M,node=1,billing=4,gres/gpu=1,gres/gpu:turing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/users/l/XXXXX/pdm/git/2080.sbatch
   WorkDir=/home/users/l/XXXXX/pdm/git
   StdErr=/home/users/l/XXXX/pdm/git/git-vizwiz-%J.out
   StdIn=/dev/null
   StdOut=/home/users/l/XXXX/pdm/git/git-vizwiz-%J.out
   Power=
   TresPerJob=gres:gpu:turing:1

Hi @quentin.leblanc ,

Your job is running, I saw you use beetween 40% and 83% of the gpu allocated.

The explanation could be:

  • The frenquency of “server” cpu is less than “worksession” cpu. what is your cpu (on PC)?
  • based and can be less efficient if you do a lot of I/O. Maybe using local scratch (SSD) on the machine will improve performance.