If you are asking for help, try to provide information that can help us solve your issue, such as :
what did you try: Train a deep neural model on baobab
what didn’t work: GPU was underrused
what was the expected result: Explanation later
path to the relevant files (logs, sbatch script, etc):
- logs : /home/users/l/leblancq/pdm/git/git-vizwiz-65733390.out
- sbatch script : /home/users/l/leblancq/pdm/git/2080.sbatch
Description :
For a project I am trying to train a deep neural model on Baobab.
For testing purposes I tried to train this model on a personnal GPU that I have access to.
I noticed that the performances on this GPU were better than on the cluster.
The GPU I used was a 2080Ti.
I tried to reproduce the exact same experience on the cluster and it surprised me that baobab was around 2 times slower than the GPU I used for my test.
What could be the reason for this ?
The training for this model is long and performance is really an issue.
Particularly when what could take 8h takes instead 14.
Thank you for the attention you’ll give to this post.
(baobab)-[root@admin1 ~]$ scontrol show job 65733390
JobId=65733390 JobName=compare-git-train
UserId=XXXXX GroupId=XXXXX MCS_label=N/A
Priority=1489182 Nice=0 Account=XXXXX QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:53:29 TimeLimit=12:00:00 TimeMin=N/A
SubmitTime=2023-03-03T11:14:57 EligibleTime=2023-03-03T11:14:57
AccrueTime=2023-03-03T11:14:57
StartTime=2023-03-03T11:15:04 EndTime=2023-03-03T23:15:04 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-03T11:15:04 Scheduler=Backfill
Partition=shared-gpu AllocNode:Sid=login2:103313
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gpu011
BatchHost=gpu011
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=16000M,node=1,billing=4,gres/gpu=1,gres/gpu:turing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=4 MinMemoryCPU=4000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/users/l/XXXXX/pdm/git/2080.sbatch
WorkDir=/home/users/l/XXXXX/pdm/git
StdErr=/home/users/l/XXXX/pdm/git/git-vizwiz-%J.out
StdIn=/dev/null
StdOut=/home/users/l/XXXX/pdm/git/git-vizwiz-%J.out
Power=
TresPerJob=gres:gpu:turing:1
Hi @quentin.leblanc ,
Your job is running, I saw you use beetween 40% and 83% of the gpu allocated.
The explanation could be:
- The frenquency of “server” cpu is less than “worksession” cpu. what is your cpu (on PC)?
- based and can be less efficient if you do a lot of I/O. Maybe using local scratch (SSD) on the machine will improve performance.