Issue with OpenMPI/4.0.3 on Baobab

Hi all,
I have got issues loading some modules on Baobab, I think it boils down to penMPI/4.0.3. OpenMPI/4.0.3 should require GCC/9.3.0, but without CUDA/11.0.2 it fails.

Here the details:
weninger@login2 ~ $ module list

Currently Loaded Modules:
  1) GCCcore/9.3.0   2) zlib/1.2.11   3) binutils/2.34   4) GCC/9.3.0

weninger@login2 ~ $ module spider OpenMPI/4.0.3

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  OpenMPI: OpenMPI/4.0.3
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      The Open MPI Project is an open source MPI-3 implementation.


    You will need to load all module(s) on any one of the lines below before the "OpenMPI/4.0.3" module is available to load.

      GCC/9.3.0
      GCC/9.3.0  CUDA/11.0.2
 
    Help:
      
      Description
      ===========
      The Open MPI Project is an open source MPI-3 implementation.
      
      
      More information
      ================
       - Homepage: https://www.open-mpi.org/
     

weninger@login2 ~ $ module load OpenMPI/4.0.3
Lmod has detected the following error:  The following module(s) are unknown: "UCX/1.8.0"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore-cache load "UCX/1.8.0"

Also make sure that all modulefiles written in TCL start with the string #%Module

Executing this command requires loading "UCX/1.8.0" which failed while processing the following module(s):

    Module fullname  Module Filename
    ---------------  ---------------
    OpenMPI/4.0.3    /opt/ebmodules/all/Compiler/GCC/9.3.0/OpenMPI/4.0.3.lua

If I load OpenMPI/4.0.3 using GCC and CUDA, I can’t load the packages HDF5/1.10.6 and Armadillo/9.900.1.

Here the logs for loading HDF5/1.10.6 (same error for Armadillo)
[weninger@login2.baobab ~]$ module list

Currently Loaded Modules:
  1) GCC/9.3.0       4) binutils/2.34   7) cURL/7.69.1      10) CUDA/11.0.2     13) libxml2/2.9.10     16) libevent/2.1.11          19) UCX/1.8.0-CUDA-11.0.2  22) OpenMPI/4.0.3
  2) GCCcore/9.3.0   5) ncurses/6.2     8) CMake/3.16.4     11) numactl/2.0.13  14) libpciaccess/0.16  17) Check/0.15.2             20) libfabric/1.11.0       23) pkg-config/0.29.2
  3) zlib/1.2.11     6) bzip2/1.0.8     9) CUDAcore/11.0.2  12) XZ/5.2.5        15) hwloc/2.2.0        18) GDRCopy/2.1-CUDA-11.0.2  21) PMIx/3.1.5



[weninger@login2.baobab ~]$ module spider HDF5/1.10.6

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  HDF5: HDF5/1.10.6
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and
      complex data. 


    You will need to load all module(s) on any one of the lines below before the "HDF5/1.10.6" module is available to load.

      GCC/9.3.0  OpenMPI/4.0.3
      iccifort/2020.1.217  impi/2019.7.217
 
    Help:
      
      Description
      ===========
      HDF5 is a data model, library, and file format for storing and managing data.
       It supports an unlimited variety of datatypes, and is designed for flexible
       and efficient I/O and for high volume and complex data.
      
      
      More information
      ================
       - Homepage: https://portal.hdfgroup.org/display/support

[weninger@login2.baobab ~]$ module load HDF5/1.10.6
Lmod has detected the following error:  These module(s) or extension(s) exist but cannot be loaded as requested: "HDF5/1.10.6"
   Try: "module spider HDF5/1.10.6" to see how to load the module(s).

Hi,

I rebuilt UCX yesterday but some of the build failed. I’m rebuilding all the versions right now. I should be fixed in 1-2h.

Thank you for the quick fixture, works fine now!
Best, Julian

Hi,

I jump on the train here because I have a very similar issue right now with the foss/2020b module. When I load it I get:

Lmod has detected the following error:  The following module(s) are unknown: "UCX/1.9.0"

It seems that only “UCX/1.9.0-CUDA-11.1.1” exists now so maybe UCX/1.9.0 without CUDA was not built?

Hi, I rebuilt this one as well, please check.

1 Like

No more error message, thanks for the quick fix !