Problem with file system on baobab

Primary informations

Username: puttigar
Cluster: Baobab

Description

Dear HPC team,

While trying to read a numpy file located on scrath using python to be used for an ML training, I got the following error:

Starting job:  Fri Dec 8 14:45:03 CET 2023

Due to MODULEPATH changes, the following have been reloaded:
  1) OpenMPI/3.1.4

Number of epochs: 100
2023-12-08 14:45:06.057381: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[gpu044][[61301,1],0][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx5_0 errno says Protocol not supported
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gpu044
  Local device: mlx5_0
--------------------------------------------------------------------------
WARNING - Setting seed to 42
Traceback (most recent call last):
  File "CNN_model.py", line 21, in <module>
    arrTest = np.load(sys.argv[2])
  File "/opt/ebsofts/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/SciPy-bundle/2019.10-Python-3.7.4/lib/python3.7/site-packages/numpy/lib/npyio.py", line 436, in load
    magic = fid.read(N)
OSError: [Errno 70] Communication error on send
Job done:  Fri Dec 8 14:57:49 CET 2023

I try also ty simply to ls -lh to where the file is located but it gets stuck.

Thank you in advance,

Enzo

Dear Enzo,

this is probably related to this issue: [2023] Current issues on HPC Cluster - #15 by Adrien.Albert

As you showed us a couple of your logs: the software you are using are very old. Did you tried with the latest version? If you are missing a dependency software, we can install it for you.

Best

Dear Yann,

Thanks for the reply, Indeed it is related to the issues [2023] Current issues on HPC Cluster. In any case I’ll try with the latest version of the software.

Best.

The storage is working right now.

I don’t think this is still true.

1 Like