Primary informations
Username: puttigar
Cluster: Baobab
Description
Dear HPC team,
While trying to read a numpy file located on scrath using python to be used for an ML training, I got the following error:
Starting job: Fri Dec 8 14:45:03 CET 2023
Due to MODULEPATH changes, the following have been reloaded:
1) OpenMPI/3.1.4
Number of epochs: 100
2023-12-08 14:45:06.057381: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
[gpu044][[61301,1],0][btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx5_0 errno says Protocol not supported
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: gpu044
Local device: mlx5_0
--------------------------------------------------------------------------
WARNING - Setting seed to 42
Traceback (most recent call last):
File "CNN_model.py", line 21, in <module>
arrTest = np.load(sys.argv[2])
File "/opt/ebsofts/MPI/GCC-CUDA/8.3.0-10.1.243/OpenMPI/3.1.4/SciPy-bundle/2019.10-Python-3.7.4/lib/python3.7/site-packages/numpy/lib/npyio.py", line 436, in load
magic = fid.read(N)
OSError: [Errno 70] Communication error on send
Job done: Fri Dec 8 14:57:49 CET 2023
I try also ty simply to ls -lh to where the file is located but it gets stuck.
Thank you in advance,
Enzo