Performance problems on Yggdrasil?

Hi all,

This is a follow-up to this thread. After @Yann.Sagon managed to install Fall3d for me on Yggdrasil, I’ve been running some benchmarks. On my previous cluster, this benchmark was running on 32 cores in ~7-8 hours. On Yggdrasil, it keeps being killed by the 12h wall time I’ve set, even when running on 64 cores. No error is produced in any of the log files.

So - can you help me investigating the source of this loss of performance? Is there a way to monitor a job as it is running? My working folder is /home/biasses/Fall3d/Runs/CC_v3.

Thanks a lot

S

My impression is that all the problems that were previously present only on Baobab are now also on Yggdrasil.

I can see a drop in performance on the login node. I can see a drop in performance in data read/write from scratch partition. And I can see a drop in performance of internet connection speed.

Hi @Sebastien.Biasse

if you still have access to your previous cluster, do you mind to let us know what CPUs were available on the compute nodes so we may try to find an explanation.

The command to list the CPU model is lscpu.

What is strange: you say you were using 32cores. In your previous post and according to your sbatch, you should have used 64 cores as the problem size you try to solve is 4 4 4 and on Fall3d, there is this note:

For parallel runs the total number of processors must be:
NPX * NPY * NPZ

Are you comparing the runt time of your job with your previous cluster for the same job, same size?

I wanted to run a benchmark myself following the instructions here.

But I’m unable to run it and as I don’t know the software, I have no clue what is going on.

My sbatch:

#!/bin/sh
#SBATCH --ntasks=8

ml  GCC/9.3.0  OpenMPI/4.0.3 fall3d/8.0.1

# wget "https://gitlab.com/fall3d-distribution/testsuite/-/raw/master/example-8.0/InputFiles/example-8.0.inp?inline=false" -O example-8.0.inp
# wget "https://gitlab.com/fall3d-distribution/testsuite/-/raw/master/example-8.0/InputFiles/example-8.0.pts?inline=false" -O example-8.0.pts
# wget "https://gitlab.com/fall3d-distribution/testsuite/-/raw/master/example-8.0/InputFiles/example-8.0.wrf.nc?inline=false" -O example-8.0.wrf.nc

srun Fall3d.r8.x ALL example-8.0.inp 2 2 2

And the output:

[sagon@login1.yggdrasil bench]$ cat example-8.0.SetDbs.log
---------------------------------------------------------
                      FALL3D suite
  Version : 8.0.1
  Task    : SetDbs

  Copyright (C) 2018 GNU General Public License version x
  See licence for details
---------------------------------------------------------
  Run start time     : 04 mar 2022 at 14:00:48
  Real precision     :        8
  Num. processors    :        8
       npx           :        2
       npy           :        2
       npz           :        2

  INPUT FILES
  Input        file  : example-8.0.inp

  OUTPUT FILES
  Log          file  : ./example-8.0.SetDbs.log
  Meteo (dbs)  file  : ./example-8.0.dbs.nc
  Meteo profil file  : ./example-8.0.dbs.pro

  End time           : 04 mar 2022 at 14:00:48
  CPU time (s)       :         0.
  Number of warnings : 0
  Number of errors   : 1
  ERROR TYPE         : error opening the input file WRF.tbl
  ERROR SOURCE       : dbs_read_dictionary
  Task SetDbs        : ends ABNORMALLY

Hi @Yann.Sagon,

Thanks a lot for your answer! So, lscpu returns this:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
Stepping:              4
CPU MHz:               3200.000
BogoMIPS:              6400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp_epp

I was using 32 cores on the previous cluster, the 4 4 4 option was indeed one of the many tests I’ve done on Yggdrasil with variable amounts of cores.

Ha, and good luck for using any of the benchmark they provide, I think these are more there for display rather than truly be useful!!! You can run my benchmark if needed, I don’t mind if it is overwritten.

But again, discussing with @Jonathan.Lemus, I realised I was previously using 32 cores of a single node. Since the I/O files of Fall3d are fairly heavy (5-10 Gb), I am wondering if that was not a factor contributing to the relatively poorer performance. I will redo a test asap.

Thanks for your help (and patience) and have a nice weekend.

Hi,

lscpu was run on a compute node right? This seems like an extremely fast CPU for a standard CPU.

On yyggdrasil, the CPUs we have is this model: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz.
AMD CPUs we have are even slower than that.

It means that on your previous cluster, the speed is ~25% faster if you ask for the same number of cores.

So I’m awaiting for your new test on a single or two node(s). You may as well try to run your code on a single AMD node as they have 128cores.

Best