Issue with cpu116

Hi all,

I get an error when my job is launched on the node cpu116 on the partition shared-cpu on Yggdrasil.
I don’t know why, but it seems to happen only on cpu116, because my jobs are running without problems on other nodes of the shared-cpu partition.

Here is the batch script I am using to launch my job

#!/bin/sh
#SBATCH --partition=shared-cpu

#SBATCH --ntasks=10
#SBATCH --time=10:00:00

#SBATCH --mail-type=ALL
#SBATCH -o slurm_map.%j.out  # STDOUT
#SBATCH -e slurm_map.%j.err   # STDERR


echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running on $SLURM_NPROCS processors."
echo "Current working directory is `pwd`"
echo ""
echo "***** LAUNCHING *****"
echo `date '+%F %H:%M:%S'`
echo ""

# load Anaconda and OpenMPI
module load Anaconda3
module load foss

echo "Loaded Anaconda3 and foss"
echo ""

srun python3 -m cobaya run input_cmb_lensing_map_fullsky.yaml


echo ""
echo "***** DONE *****"
echo `date '+%F %H:%M:%S'`
echo ""

And here is the error I get when the job starts on cpu116

[cpu116:135582:0:135582] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135584:0:135584] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135586:0:135586] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135583:0:135583] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135585:0:135585] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135587:0:135587] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135589:0:135589] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135590:0:135590] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135591:0:135591] Caught signal 4 (Illegal instruction: illegal operand)
[cpu116:135588:0:135588] Caught signal 4 (Illegal instruction: illegal operand)
==== backtrace (tid: 135582) ====
 0 0x00000000000213e3 ucs_debug_print_backtrace()  /dev/shm/ebbuild/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucs/debug/debug.c:656
 1 0x0000000000104d80 __fileutils_MOD_tfilestream_openfile()  ???:0
 2 0x00000000000ff789 __fileutils_MOD_tfilestream_open()  ???:0
 3 0x00000000000feffd __fileutils_MOD_readnextcontentline()  ???:0
 4 0x0000000000027c60 __config_MOD_checkloadedhighltemplate()  ???:0
 5 0x00000000000069dd ffi_call_unix64()  :0
 6 0x0000000000006067 ffi_call_int()  ffi64.c:0
 7 0x000000000001097a _call_function_pointer()  /usr/local/src/conda/python-3.8.3/Modules/_ctypes/callproc.c:871
 8 0x000000000001097a _ctypes_callproc()  /usr/local/src/conda/python-3.8.3/Modules/_ctypes/callproc.c:1199
 9 0x00000000000110db PyCFuncPtr_call()  /usr/local/src/conda/python-3.8.3/Modules/_ctypes/_ctypes.c:4201
10 0x000000000013d25f _PyObject_MakeTpCall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:159
11 0x00000000001c15e5 _PyObject_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Include/cpython/abstract.h:125
12 0x00000000001c15e5 _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:3500
13 0x000000000020a04d function_code_fastcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:283
14 0x00000000000ff819 _PyObject_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Include/cpython/abstract.h:127
15 0x00000000000ff819 call_function()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4963
16 0x00000000000ff819 _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:3500
17 0x000000000018a2a2 _PyEval_EvalCodeWithName()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4298
18 0x000000000018b054 PyEval_EvalCodeEx()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4327
19 0x00000000002195bc PyEval_EvalCode()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:718
20 0x000000000024e6f3 builtin_exec_impl.isra.14()  /tmp/build/80754af9/python_1593706424329/work/Python/bltinmodule.c:1033
21 0x000000000024e6f3 builtin_exec()  /tmp/build/80754af9/python_1593706424329/work/Python/clinic/bltinmodule.c.h:396
22 0x0000000000140039 cfunction_vectorcall_FASTCALL()  /tmp/build/80754af9/python_1593706424329/work/Objects/methodobject.c:422
23 0x000000000013ca41 PyVectorcall_Call()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:199
24 0x00000000001c6611 do_call_core()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4983
25 0x00000000001c6611 _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:3559
26 0x000000000018a2a2 _PyEval_EvalCodeWithName()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4298
27 0x000000000018b243 _PyFunction_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:435
28 0x00000000000ff58e _PyObject_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Include/cpython/abstract.h:127
29 0x00000000000ff58e call_function()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4963
30 0x00000000000ff58e _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:3469
31 0x000000000018b16b function_code_fastcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:283
32 0x000000000018b16b _PyFunction_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:410
33 0x00000000000ff56d _PyObject_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Include/cpython/abstract.h:127
34 0x00000000000ff56d call_function()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4963
35 0x00000000000ff56d _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:3486
36 0x000000000018b16b function_code_fastcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:283
37 0x000000000018b16b _PyFunction_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:410
38 0x00000000000ff819 _PyObject_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Include/cpython/abstract.h:127
39 0x00000000000ff819 call_function()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4963
40 0x00000000000ff819 _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:3500
41 0x000000000018b16b function_code_fastcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:283
42 0x000000000018b16b _PyFunction_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:410
43 0x00000000000ff819 _PyObject_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Include/cpython/abstract.h:127
44 0x00000000000ff819 call_function()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:4963
45 0x00000000000ff819 _PyEval_EvalFrameDefault()  /tmp/build/80754af9/python_1593706424329/work/Python/ceval.c:3500
46 0x000000000018b16b function_code_fastcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:283
47 0x000000000018b16b _PyFunction_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:410
48 0x000000000007e299 _PyObject_Vectorcall()  /tmp/build/80754af9/python_1593706424329/work/Include/cpython/abstract.h:127
49 0x000000000007e299 _PyObject_FastCall()  /tmp/build/80754af9/python_1593706424329/work/Include/cpython/abstract.h:147
50 0x000000000007e299 object_vacall()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:1186
51 0x000000000017d397 _PyObject_CallMethodIdObjArgs()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:1244
52 0x000000000012f786 import_find_and_load()  /tmp/build/80754af9/python_1593706424329/work/Python/import.c:1698
53 0x000000000012f786 PyImport_ImportModuleLevelObject()  /tmp/build/80754af9/python_1593706424329/work/Python/import.c:1798
54 0x00000000001c7eda builtin___import__()  /tmp/build/80754af9/python_1593706424329/work/Python/bltinmodule.c:279
55 0x000000000017f706 cfunction_call_varargs()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:742
56 0x000000000017f706 PyCFunction_Call()  /tmp/build/80754af9/python_1593706424329/work/Objects/call.c:772
=================================
srun: error: cpu116: task 0: Illegal instruction

Hi,

This is typically related to binary compiled on another CPU architecture/generation.

cpu116 is a node with an AMD EPYC 7742.

I don’t know if the issue is related to Anaconda: I’ve installed a new Anaconda version, you can maybe give a try: Anaconda3/2021.05

About cobaya: are you using it with mpi4py? Where did you installed it, on the login node?

You can see here our FAQ about this similar issue:hpc:faq [eResearch Doc]

My deepest apologies for the very late answer…
My solution was to exclude the cpu[116-119] from the batch job script.
I guess you are right, the problem is coming from the compilation of some of my codes which work only on Intel CPU and not on AMD.
I use cobaya with mpi4py indeed, and I installed it from the login node.