Update on Baobab Maintenance Aug 15-16? + Persistent Post Maintenance GPU Issue

Hi there,

@Yann.Sagon , @Luca.Capello : Just wanted to confirm that maintenance has extended to the 17th as well? My colleagues and I cannot log into baobab2.unige.ch or baobab.unige.ch

Sincerely,
Jason

2 Likes

Looks like the login node is now accessible and jobs deployable, but we seem to be hitting the same problem as always: certain GPU nodes don’t have cuda access. This seems to be a recurring problem and probably deserves a more thorough look since the temporary fix applied every time needs to be done at every maintenance almost. Logs below

ramapur0@login2:~/kanerva_plus_plus/fixed_experiments$ srun --partition=shared-gpu-EL7 --gres=gpu:1 --mem=32000 --cpus-per-task=2 --time=12:00:00 --pty bash
ramapur0@gpu006:~/kanerva_plus_plus/fixed_experiments$ singularity exec --nv /home/ramapur0/docker/pytorch1.1.0_cuda9.simg python -c ‘import torch; print(torch.version); print("cuda = ", torch.cuda.is_available())’
1.1.0
> cuda = True

ramapur0@login2:~/kanerva_plus_plus/fixed_experiments$ srun --partition=kalousis-gpu-EL7 --gres=gpu:1 --mem=32000 --cpus-per-task=2 --time=12:00:00 --pty bash
ramapur0@gpu008:~/kanerva_plus_plus/fixed_experiments$ singularity exec --nv /home/ramapur0/docker/pytorch1.1.0_cuda9.simg python -c ‘import torch; print(torch.version); print("cuda = ", torch.cuda.is_available())’
1.1.0
> cuda = False

ramapur0@login2:~/kanerva_plus_plus/fixed_experiments$ srun --partition=cui-gpu-EL7 --gres=gpu:1 --mem=32000 --cpus-per-task=2 --time=12:00:00 --pty bash
ramapur0@gpu009:~/kanerva_plus_plus/fixed_experiments$ singularity exec --nv /home/ramapur0/docker/pytorch1.1.0_cuda9.simg python -c ‘import torch; print(torch.version); print("cuda = ", torch.cuda.is_available())’
1.1.0
> cuda = False

1 Like

Hi, I confirm that gpu008 does not work.

I used the “test” case of https://gitlab.unige.ch/hpc/softs/commit/b9973e982654776742faefd79f016777e9ad56e6 .
I also tried with version 2.4.2 and 3.2.1-1.1.el7 of singularity.
It would be interesting to know why this happen in every update. In a private conversation Jason had the hypothesis that it may be something like a Titan vs P100 problem. As gpu011 is down I was not able to test the hypothesis. The gpu011 has a 2080 ti so it could be interesting to test there.

In all case, I think the test case need to be modified to test at least every different kind of gpu. Ie one node with a p100, one node with a Titan and one node with a 2080 ti.

Thanks in advance for your help.

1 Like

Hello,
I have the same problem. I cannot login to any of the baobab2.unige.ch or baobab.unige.ch.
Sincerely,
Damien

My access is now restored.
Thank you
Best regards
Damien

Hi there,

For the next time, please use the test case we provide (as @Pablo.Strasser did, cf. p/pytorch/cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch: new (b9973e98) · Commits · hpc / softs · GitLab ) and post the with full output.

There are questions here:

  1. is CUDA installed and correctly working standalone?
  2. if so, is it working with other software?

I will reply the two questions separately in a moment.

Thx, bye,
Luca

Edit: link to Pablo’s post fixed.

For information, the link you posted (https://gitlab.unige.ch/prods/ies/recherche/hpc/issues/620#note_19506) is not accessible for normal user. Feel free to delete this post once read.

This link is not accessible. The demo I used is a simple MWE that demonstrates this recurring problem. Note that @Pablo.Strasser was able to duplicate your MWE and reproduced the same problem.

Ideally any MWE that demonstrates the problem should be run on every GPU after a deployment since this is a perennial regression and happens after almost every upgrade on kalousis-gpu-EL7 & cui-gpu-EL7.

Edit: Looks like there was a change done this morning (19-Aug) that has resolved the issue.

1 Like

Hi there,

This can be tested with the upstream deviceQuery tool (cf. https://baobabmaster.unige.ch/enduser/src/enduser/applications.html#nvida-cuda ) and our “internal” CUDA test case (cf. c/cuda/cuda_visible_devices_9.1.85.sbatch · f172a4888789c8f8cdc9c97c5d36d47f5b68f789 · hpc / softs · GitLab ):
ATTENTION , the sbatch asks for one single GPU!

capello@login2:~/scratch/softs/c/cuda (master)$ for I in gpu{002..011}; do \
    sbatch --nodelist=${I} --output=./cuda_visible_devices_9.1.85.sbatch_-_slurm-%j.out ./cuda_visible_devices_9.1.85.sbatch; \
 done
Submitted batch job 19719987
Submitted batch job 19719988
Submitted batch job 19719989
Submitted batch job 19719990
Submitted batch job 19719991
Submitted batch job 19719992
Submitted batch job 19719993
Submitted batch job 19719994
Submitted batch job 19719995
Submitted batch job 19719996
capello@login2:~/scratch/softs/c/cuda (master)$ for I in {87..96}; do grep -e 'hostname:' -e '^Device' cuda_visible_devices_9.1.85.sbatch_-_slurm-197199${I}.out; echo; done
I: full hostname: gpu002.cluster
Device 0: "TITAN X (Pascal)"

I: full hostname: gpu003.cluster
Device 0: "TITAN X (Pascal)"

I: full hostname: gpu004.cluster
Device 0: "Tesla P100-PCIE-12GB"

I: full hostname: gpu005.cluster
Device 0: "Tesla P100-PCIE-12GB"

I: full hostname: gpu006.cluster
Device 0: "Tesla P100-PCIE-12GB"

I: full hostname: gpu007.cluster
Device 0: "Tesla P100-PCIE-12GB"

I: full hostname: gpu008.cluster
Device 0: "TITAN Xp"

I: full hostname: gpu009.cluster
Device 0: "TITAN Xp"

I: full hostname: gpu010.cluster
Device 0: "TITAN Xp"

I: full hostname: gpu011.cluster
Device 0: "GeForce RTX 2080 Ti"

capello@login2:~/scratch/softs/c/cuda (master)$ 

Thx, bye,
Luca

I confirmed that as Jason said in his last edit it works now. I tested with the “official” test using pytorch.
Is it possible to know what changes was made on the nodes.
So that we know what the problem is for next time?

1 Like

Hi there,

We provide two different tests:

  1. Pytorch only, i.e. everything is provided by the cluster:
    p/pytorch/cuda_9.1.85_-_device_count.sbatch · f172a4888789c8f8cdc9c97c5d36d47f5b68f789 · hpc / softs · GitLab
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in gpu{002..011}; do \
    sbatch --nodelist=${I} --output=./cuda_9.1.85_-_device_count.sbatch_-_slurm-%j.out ./cuda_9.1.85_-_device_count.sbatch; \
 done
Submitted batch job 19720202
Submitted batch job 19720203
Submitted batch job 19720204
Submitted batch job 19720205
Submitted batch job 19720206
Submitted batch job 19720207
Submitted batch job 19720208
Submitted batch job 19720209
Submitted batch job 19720210
Submitted batch job 19720211
capello@login2:~/scratch/softs/p/pytorch (master)$ 
  1. Pytorch via Singularity:
    p/pytorch/cuda_-_matrix_zeros.py · f172a4888789c8f8cdc9c97c5d36d47f5b68f789 · hpc / softs · GitLab
capello@login2:~/scratch/softs/p/pytorch (master)$ ls -l pytorch.simg 
-rwxr-xr-x 1 capello unige 2637373471 Jul  2 16:53 pytorch.simg
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in gpu{002..011}; do \
    sbatch --nodelist=${I} --output=./cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch_-_slurm-%j.out ./cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch; \
 done
Submitted batch job 19720722
Submitted batch job 19720723
Submitted batch job 19720724
Submitted batch job 19720725
Submitted batch job 19720726
Submitted batch job 19720727
Submitted batch job 19720728
Submitted batch job 19720729
Submitted batch job 19720730
Submitted batch job 19720731
capello@login2:~/scratch/softs/p/pytorch (master)$ 

I will report back once the jobs above have been completed.

Thx, bye,
Luca

Some operand taken this morning (August 19) has fixed this issue as per my earlier post. It would be good to have this quantified for the next cluster upgrade though.

Hi @Jason.Ramapuram , @Pablo.Strasser,

FYI, the only change done this morning (and since last Friday afternoon) was fixing the SSH configuration on login2 (as per the announcement, cf. msgid:baobab-announce.46@anonymous).

I understand your frustration, we are trying hard to gather as much information as possible to actually fix any issue you encounter in as less time as we can.

Thank you for all the suggestions both of you gave here (and in the past), the idea behind us providing test scripts is to give us the possibility to do tests before declaring a node (or the whole cluster) functional. And FYI, again, each installation ends with a series of automatic tests (mostly for configurations outside Slurm).

On top of that, however, Friday afternoon I did test the following Slurm jobs (cf. Files · f172a4888789c8f8cdc9c97c5d36d47f5b68f789 · hpc / softs · GitLab ) on a single CPU or GPU node:

  • m/matlab/parallel/launch.sh (graphical standalone MATLAB as well)
  • m/matplotlib/sbatch_3.0.0-Python-3.6.6.sh
  • m/mpi4py/launchMPI4py.sh
  • p/palabos/runPalabosCavity3d.sh
  • p/palabos/runPalabosCavity3d_multi.sh
  • p/pytorch/cuda_9.1.85-device_count.sbatch
  • p/pytorch/cuda_9.2.148.1-matrix_zeros-singularity.sbatch

The two 2 Pytorch scripts we have (even the one with Singularity, cf. Update on Baobab Maintenance Aug 15-16? + Persistent Post Maintenance GPU Issue - #11 by Luca.Capello ) ended up on gpu006 .

Given that all CPU/GPU node installations are the same, with each GPU node simply having the upstream CUDA libraries as a plus, I assumed everything was OK.

I will come back as soon as the tests at Update on Baobab Maintenance Aug 15-16? + Persistent Post Maintenance GPU Issue - #11 by Luca.Capello have finished, at least to have a reference for the next maintenance.

Thx, bye,
Luca

Thanks a lot for the information & all the hard work @Luca.Capello ! It is much appreciated!

Hi there,

Here we are:

  1. Pytorch only:
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in {02..11}; do \
    cat cuda_9.1.85_-_device_count.sbatch_-_slurm-197202${I}.out; \
    echo; \
 done
I: full hostname: gpu002.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu003.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu004.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu005.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu006.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu007.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu008.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu009.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu010.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu011.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

capello@login2:~/scratch/softs/p/pytorch (master)$ 
  1. Pytorch via Singularity:
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in {22..31}; do \
    cat cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch_-_slurm-197207${I}.out; \
    echo; \
 done
I: full hostname: gpu002.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu003.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu004.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu005.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu006.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu007.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu008.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu009.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu010.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu011.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

capello@login2:~/scratch/softs/p/pytorch (master)$ 

Thus, the local tests are OK, I will check how to add them to our automatic-installation-is-finished-OK script to have them logged, at least for maintenances.

Thx, bye,
Luca

PS, some links to previous discussions:

A small clarification:

I: CUDA_VISIBLE_DEVICES: 0

doesn’t means zero device were found but one device found with index zero.