Strange performance behaviour on Yggdrasil

Hi,

For the last ~2 weeks, I observe really strange behavior on Yggdrasil.

  1. When I use srun with shared-gpu often after has been allocated resources everything hangs, the job doesn’t start. At the same time, I can start without any problems a srun job on shared-gpu from a different location (same code, same data, different copy of repository in my scratch directory).

  2. Some jobs are extremely slow without a reason. When I restart the same code from a different location it works as expected. There are also cases when a job runs normally for several epochs and suddenly becomes very slow.

This whole situation is problematic because right now there is no reliable server to run experiments. Previously only Baobab was unusable, but Yggdrasil was reliable. Did people migrate and this is where the performance drop comes from?

Regards,
Maciej

UPDATE: Right now I am having the exact situation (actually, both of them). I request a job on shared-gpu, get allocation on gpu002, it hangs for ~15 minutes and when the actual computation really starts it is extremely slow. When I cancel and request it on debug-gpu everything goes well but I won’t fit in the 15 minutes limit. Maybe there is something wrong with gpu002?

UPDATE 2: Yes, gpu002 is the problem. All jobs running there are freezed, while the ones running elsewhere (gpu003 for example) have no issues.

Now the same problem on gpu007.

@support can you help with the problem?

Best regards,
Maciej

And today I am getting

slurmstepd: error: execve(): : No such file or directory
srun: error: gpu003: task 0: Exited with exit code 2

Previously I had it with gpu002.

@support What is wrong with GPU partitions on Yggdrasil?

Best regards,
Maciej

What is the full command line you typed? Did you previously loaded modules?

[sagon@login1.yggdrasil ~]$ srun --partition=shared-gpu hostname
gpu002.yggdrasil

srun.sh:

#!/usr/bin/env bash

if [[ "${#}" == 5 ]]; then
  PART="${1}"
  TIME="${2}"
  MEM="${3}"
  CPU="${4}"
  EXEC="${5}"
  SUFF=""
else
  PART="${1}"
  TIME="${2}"
  MEM="${3}"
  CPU="${4}"
  EXEC="${5}"
  SUFF="${6}"
fi

module load GCC/9.3.0 Singularity/3.7.3-GCC-9.3.0-Go-1.14

srun -p "${PART}" --gpus=turing:1 --time="${TIME}" --mem="${MEM}" --cpus-per-task="${CPU}" "${SUFF}" singularity exec --nv --bind "${PWD}:/${PWD##*/}" --pwd "/${PWD##*/}" --env-file ./.env ./singularity/image.sif bash -c "${EXEC}"

Command executed

[falkiewi@login1.yggdrasil ~]$ bash ./singularity/srun.sh shared-gpu 2:30:00 24000 8 "<experiment_command>"

Currently, the problem occurs on the nodes in debug-gpu but not gpu002 from shared-gpu. I haven’t checked other shared-gpu nodes.

When I previously checked problematic nodes with the Graphana Tool I was observing very high network utilization (like tens of GB in 1 hour). But this might be just a coincidence.

I am also getting signals from other members of my team, that recently Yggdrasil is unstable, unreliable, and nodes “hanging” spontaneously is a common problem.

EDIT: gpu003 gives

slurmstepd: error: execve(): : No such file or directory
srun: error: gpu003: task 0: Exited with exit code 2

right now.

Hi,

I’m not sure to understand the point to submit a script doing an srun inside instead of using sbatch?

Like you are doing, your script hangs until the job finishes. If you use sbatch, once submitted you can close your terminal without an issue and you should as well include the module load inside the sbatch to have a clean script.

I’m not able to reproduce your issue with a simple test on the same node:

[falkiewi@login1.yggdrasil ~]$ srun --partition=shared-gpu --nodelist=gpu003 singularity
srun: job 8945032 queued and waiting for resources
srun: job 8945032 has been allocated resources
Usage:
  singularity [global options...] <command>

Available Commands:
  build       Build a Singularity image
 [...]
Run 'singularity --help' for more detailed usage information.
srun: error: gpu003: task 0: Exited with exit code 1

This is only for dev purposes. I prefer to have a tmux session with srun and monitor everything nicely, instead of finding and printing the right out file of sbatch.

The script hangs before the job starts. Once it starts, I can see the progress. But the problem with hanging is not there anymore. So I guess that it was slow disk access. During that time, read/write wasn’t smooth also in other scenarios.

The problem is that if you try to reproduce it at a different time, it may not be there anymore :slight_smile: The same code, executed from the exact same location in my scratch, has different behavior on different nodes and it is evolving over time.

I’ve googled for the uninformative error message

slurmstepd: error: execve(): : No such file or directory
srun: error: gpu001: task 0: Exited with exit code 2

and unfortunately, anything can hide underneath. This might be an error from my code, but I do not have access to it (the message). Anyway, I would expect to have the same behavior on all nodes of the same type which is not what happens. I will do more tests next week.

In fact this message says exactly what it says :smile:: Slurm is unable to find the executable you specified in the srun line. The reason may:

  • you did a typo
  • you specified a binary without the full path and the binary path wasn’t found in the $PATH environment variable
  • the binary doesn’t exists
  • the binary exists on the login node but not on the compute node (filesystem not mounted or local disk)

A suggestion: as your srun line is built with variables, you should print the full line using echo just before using the line, it is easier to debug later.

But is it normal that the same command executed on one node fails and then executed on another node 5 seconds later works fine? Shouldn’t the nodes be software-wise same?

Also, why ~3 weeks ago it always worked and now it sometimes does and sometimes doesn’t?

EDIT: And I found it :slight_smile: One of my modifications for excluding slow nodes mentioned at the beginning of the thread was making problems. Thank you @Yann.Sagon for pointing me in the right direction!

As a rule of thumb: its always user’s fault :stuck_out_tongue_winking_eye:

1 Like