Update on Baobab Maintenance Aug 15-16? + Persistent Post Maintenance GPU Issue

Hi there,

Here we are:

  1. Pytorch only:
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in {02..11}; do \
    cat cuda_9.1.85_-_device_count.sbatch_-_slurm-197202${I}.out; \
    echo; \
 done
I: full hostname: gpu002.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu003.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu004.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu005.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu006.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu007.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu008.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu009.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu010.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

I: full hostname: gpu011.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
torch.cuda.device_count: 1

capello@login2:~/scratch/softs/p/pytorch (master)$ 
  1. Pytorch via Singularity:
capello@login2:~/scratch/softs/p/pytorch (master)$ for I in {22..31}; do \
    cat cuda_9.2.148.1_-_matrix_zeros_-_singularity.sbatch_-_slurm-197207${I}.out; \
    echo; \
 done
I: full hostname: gpu002.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu003.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu004.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu005.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu006.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu007.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu008.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu009.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu010.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

I: full hostname: gpu011.cluster
I: CUDA_VISIBLE_DEVICES: 0
=====
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')

capello@login2:~/scratch/softs/p/pytorch (master)$ 

Thus, the local tests are OK, I will check how to add them to our automatic-installation-is-finished-OK script to have them logged, at least for maintenances.

Thx, bye,
Luca

PS, some links to previous discussions: