Hi there,
@Yann.Sagon , @Luca.Capello : Just wanted to confirm that maintenance has extended to the 17th as well? My colleagues and I cannot log into baobab2.unige.ch or baobab.unige.ch
Sincerely,
Jason
Hi there,
@Yann.Sagon , @Luca.Capello : Just wanted to confirm that maintenance has extended to the 17th as well? My colleagues and I cannot log into baobab2.unige.ch or baobab.unige.ch
Sincerely,
Jason
Looks like the login node is now accessible and jobs deployable, but we seem to be hitting the same problem as always: certain GPU nodes donât have cuda access. This seems to be a recurring problem and probably deserves a more thorough look since the temporary fix applied every time needs to be done at every maintenance almost. Logs below
ramapur0@login2:~/kanerva_plus_plus/fixed_experiments$ srun --partition=shared-gpu-EL7 --gres=gpu:1 --mem=32000 --cpus-per-task=2 --time=12:00:00 --pty bash
ramapur0@gpu006:~/kanerva_plus_plus/fixed_experiments$ singularity exec --nv /home/ramapur0/docker/pytorch1.1.0_cuda9.simg python -c âimport torch; print(torch.version); print("cuda = ", torch.cuda.is_available())â
1.1.0
> cuda = Trueramapur0@login2:~/kanerva_plus_plus/fixed_experiments$ srun --partition=kalousis-gpu-EL7 --gres=gpu:1 --mem=32000 --cpus-per-task=2 --time=12:00:00 --pty bash
ramapur0@gpu008:~/kanerva_plus_plus/fixed_experiments$ singularity exec --nv /home/ramapur0/docker/pytorch1.1.0_cuda9.simg python -c âimport torch; print(torch.version); print("cuda = ", torch.cuda.is_available())â
1.1.0
> cuda = Falseramapur0@login2:~/kanerva_plus_plus/fixed_experiments$ srun --partition=cui-gpu-EL7 --gres=gpu:1 --mem=32000 --cpus-per-task=2 --time=12:00:00 --pty bash
ramapur0@gpu009:~/kanerva_plus_plus/fixed_experiments$ singularity exec --nv /home/ramapur0/docker/pytorch1.1.0_cuda9.simg python -c âimport torch; print(torch.version); print("cuda = ", torch.cuda.is_available())â
1.1.0
> cuda = False
Hi, I confirm that gpu008 does not work.
I used the âtestâ case of https://gitlab.unige.ch/hpc/softs/commit/b9973e982654776742faefd79f016777e9ad56e6 .
I also tried with version 2.4.2 and 3.2.1-1.1.el7 of singularity.
It would be interesting to know why this happen in every update. In a private conversation Jason had the hypothesis that it may be something like a Titan vs P100 problem. As gpu011 is down I was not able to test the hypothesis. The gpu011 has a 2080 ti so it could be interesting to test there.
In all case, I think the test case need to be modified to test at least every different kind of gpu. Ie one node with a p100, one node with a Titan and one node with a 2080 ti.
Thanks in advance for your help.
Hello,
I have the same problem. I cannot login to any of the baobab2.unige.ch or baobab.unige.ch.
Sincerely,
Damien
My access is now restored.
Thank you
Best regards
Damien
Hi there,
For the next time, please use the test case we provide (as @Pablo.Strasser did, cf. p/pytorch/cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch: new (b9973e98) ¡ Commits ¡ hpc / softs ¡ GitLab ) and post the with full output.
There are questions here:
I will reply the two questions separately in a moment.
Thx, bye,
Luca
Edit: link to Pabloâs post fixed.
For information, the link you posted (https://gitlab.unige.ch/prods/ies/recherche/hpc/issues/620#note_19506) is not accessible for normal user. Feel free to delete this post once read.
This link is not accessible. The demo I used is a simple MWE that demonstrates this recurring problem. Note that @Pablo.Strasser was able to duplicate your MWE and reproduced the same problem.
Ideally any MWE that demonstrates the problem should be run on every GPU after a deployment since this is a perennial regression and happens after almost every upgrade on kalousis-gpu-EL7 & cui-gpu-EL7.
Edit: Looks like there was a change done this morning (19-Aug) that has resolved the issue.
Hi there,
This can be tested with the upstream deviceQuery
tool (cf. https://baobabmaster.unige.ch/enduser/src/enduser/applications.html#nvida-cuda ) and our âinternalâ CUDA test case (cf. c/cuda/cuda_visible_devices_9.1.85.sbatch ¡ f172a4888789c8f8cdc9c97c5d36d47f5b68f789 ¡ hpc / softs ¡ GitLab ):
ATTENTION , the sbatch asks for one single GPU!
capello@login2:~/scratch/softs/c/cuda (master)$ for I in gpu{002..011}; do \ sbatch --nodelist=${I} --output=./cuda_visible_devices_9.1.85.sbatch_-_slurm-%j.out ./cuda_visible_devices_9.1.85.sbatch; \ done Submitted batch job 19719987 Submitted batch job 19719988 Submitted batch job 19719989 Submitted batch job 19719990 Submitted batch job 19719991 Submitted batch job 19719992 Submitted batch job 19719993 Submitted batch job 19719994 Submitted batch job 19719995 Submitted batch job 19719996 capello@login2:~/scratch/softs/c/cuda (master)$ for I in {87..96}; do grep -e 'hostname:' -e '^Device' cuda_visible_devices_9.1.85.sbatch_-_slurm-197199${I}.out; echo; done I: full hostname: gpu002.cluster Device 0: "TITAN X (Pascal)" I: full hostname: gpu003.cluster Device 0: "TITAN X (Pascal)" I: full hostname: gpu004.cluster Device 0: "Tesla P100-PCIE-12GB" I: full hostname: gpu005.cluster Device 0: "Tesla P100-PCIE-12GB" I: full hostname: gpu006.cluster Device 0: "Tesla P100-PCIE-12GB" I: full hostname: gpu007.cluster Device 0: "Tesla P100-PCIE-12GB" I: full hostname: gpu008.cluster Device 0: "TITAN Xp" I: full hostname: gpu009.cluster Device 0: "TITAN Xp" I: full hostname: gpu010.cluster Device 0: "TITAN Xp" I: full hostname: gpu011.cluster Device 0: "GeForce RTX 2080 Ti" capello@login2:~/scratch/softs/c/cuda (master)$
Thx, bye,
Luca
I confirmed that as Jason said in his last edit it works now. I tested with the âofficialâ test using pytorch.
Is it possible to know what changes was made on the nodes.
So that we know what the problem is for next time?
Hi there,
We provide two different tests:
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in gpu{002..011}; do \ sbatch --nodelist=${I} --output=./cuda_9.1.85_-_device_count.sbatch_-_slurm-%j.out ./cuda_9.1.85_-_device_count.sbatch; \ done Submitted batch job 19720202 Submitted batch job 19720203 Submitted batch job 19720204 Submitted batch job 19720205 Submitted batch job 19720206 Submitted batch job 19720207 Submitted batch job 19720208 Submitted batch job 19720209 Submitted batch job 19720210 Submitted batch job 19720211 capello@login2:~/scratch/softs/p/pytorch (master)$
capello@login2:~/scratch/softs/p/pytorch (master)$ ls -l pytorch.simg -rwxr-xr-x 1 capello unige 2637373471 Jul 2 16:53 pytorch.simg capello@login2:~/scratch/softs/p/pytorch (master)$ for I in gpu{002..011}; do \ sbatch --nodelist=${I} --output=./cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch_-_slurm-%j.out ./cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch; \ done Submitted batch job 19720722 Submitted batch job 19720723 Submitted batch job 19720724 Submitted batch job 19720725 Submitted batch job 19720726 Submitted batch job 19720727 Submitted batch job 19720728 Submitted batch job 19720729 Submitted batch job 19720730 Submitted batch job 19720731 capello@login2:~/scratch/softs/p/pytorch (master)$
I will report back once the jobs above have been completed.
Thx, bye,
Luca
Some operand taken this morning (August 19) has fixed this issue as per my earlier post. It would be good to have this quantified for the next cluster upgrade though.
Hi @Jason.Ramapuram , @Pablo.Strasser,
FYI, the only change done this morning (and since last Friday afternoon) was fixing the SSH configuration on login2 (as per the announcement, cf. msgid:baobab-announce.46@anonymous).
I understand your frustration, we are trying hard to gather as much information as possible to actually fix any issue you encounter in as less time as we can.
Thank you for all the suggestions both of you gave here (and in the past), the idea behind us providing test scripts is to give us the possibility to do tests before declaring a node (or the whole cluster) functional. And FYI, again, each installation ends with a series of automatic tests (mostly for configurations outside Slurm).
On top of that, however, Friday afternoon I did test the following Slurm jobs (cf. Files ¡ f172a4888789c8f8cdc9c97c5d36d47f5b68f789 ¡ hpc / softs ¡ GitLab ) on a single CPU or GPU node:
The two 2 Pytorch scripts we have (even the one with Singularity, cf. Update on Baobab Maintenance Aug 15-16? + Persistent Post Maintenance GPU Issue - #11 by Luca.Capello ) ended up on gpu006 .
Given that all CPU/GPU node installations are the same, with each GPU node simply having the upstream CUDA libraries as a plus, I assumed everything was OK.
I will come back as soon as the tests at Update on Baobab Maintenance Aug 15-16? + Persistent Post Maintenance GPU Issue - #11 by Luca.Capello have finished, at least to have a reference for the next maintenance.
Thx, bye,
Luca
Thanks a lot for the information & all the hard work @Luca.Capello ! It is much appreciated!
Hi there,
Here we are:
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in {02..11}; do \ cat cuda_9.1.85_-_device_count.sbatch_-_slurm-197202${I}.out; \ echo; \ done I: full hostname: gpu002.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu003.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu004.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu005.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu006.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu007.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu008.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu009.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu010.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 I: full hostname: gpu011.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== torch.cuda.device_count: 1 capello@login2:~/scratch/softs/p/pytorch (master)$
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in {22..31}; do \ cat cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch_-_slurm-197207${I}.out; \ echo; \ done I: full hostname: gpu002.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu003.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu004.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu005.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu006.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu007.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu008.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu009.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu010.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') I: full hostname: gpu011.cluster I: CUDA_VISIBLE_DEVICES: 0 ===== tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') capello@login2:~/scratch/softs/p/pytorch (master)$
Thus, the local tests are OK, I will check how to add them to our automatic-installation-is-finished-OK script to have them logged, at least for maintenances.
Thx, bye,
Luca
PS, some links to previous discussions:
A small clarification:
I: CUDA_VISIBLE_DEVICES: 0
doesnât means zero device were found but one device found with index zero.