Hi everyone! I am using baobab for semidefinite optimization with a program called sdpb
. The specifics don’t really matter to this question, other than the fact that it relies on OpenMPI/4.1.4
. If however you need to see a script or any other information do let me know.
I just stumbled upon this UCX warning that crashes the software I’m using:
[1710610763.983411] [cpu319:2620412:0] ib_md.c:1234 UCX WARN IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1710610763.983419] [cpu319:2620412:0] ib_md.c:1235 UCX WARN IB: data corruption might occur when using registered memory.
The warning is repeated many times until sdpb
crashes with this error message:
Process 9 caught error message:
in allocate_blocks() at ../src/sdp_solve/Block_Info/allocate_blocks.cxx:72:
Assertion '!block_indices.empty()' failed:
No SDP blocks were assigned to rank=9. node=9 node_rank=0
which I believe is saying that the node 319 (node number 9 out of the ten I was using) is not behaving well.
To verify this I tried running the same batch script with less nodes. When I am not assigned node 319, everything runs smoothly. As soon as node 319 is assigned I run into this error.
I have tried looking up what this warning means but it is still very unclear to me. I was hoping some of you might know what it means and how I could avoid running into it.
Secondary question is who should I contact in the future if I notice a certain node is not behaving well and might need rebooted?
Thank you very much for your time and have a great rest of your weekend.
Edit: more nodes are now presenting the same issue…
Other edit: Added the version of OpenMPI I’m using. Unfortunately I cannot use an earlier version since it clashes with other dependencies. Also, the errors started appearing two days ago and got worse over the weekend.