Node saturated?

Dear all,

I have some single core tasks from a job array running on node 174.

I have noticed that those take by far more time to complete with respect to other tasks of the array that happen to run on other nodes. I went to look at what’s happening on the node and I have seen that actually my tasks don’t get to run on 100% on their CPU. You see this on htop

It looks that more resources then available have been allocated on this node (other users’ tasks or jobs).

Is it correct?

Hi there,

Actually, the real reason is that the CPU frequency is not the right one:

[root@node174 ~]# lscpu | \
 grep -E '^CPU (|max|min)[ ]?MHz: '
CPU MHz:               582.092
CPU max MHz:           3100.0000
CPU min MHz:           1200.0000
[root@node174 ~]# 

Node in DRAIN to allow further investigation.

Thx, bye,
Luca

Oh Ok!

actually I see that I have tasks stuck also on node 175 and 176, maybe it is the same. This is actually something that I’ve experienced in the past but without paying too much attention to it, there might be others nodes beyond those.

Thanks a lot!

Hello,

It seems that I have observed a similar issue to Davide Cucci.
It was for instance job no 32778628 that I ran on May 10th.
I ran 100 monothread jobs, and about half of them took 6 hours more to finish, but still exhibit a similar total CPU time.

Best wishes,
Gilles Vilmart

Hi there,

Thank you for the notice, same issue indeed, thus node[175-176] in DRAIN.

For that job there were at least three of the nodes @Davide.Cucci reported (specifically node[173-174,176] ), the other being OK:

 [root@login2 ~]# for I in $(sacct -n -j 32778628 --format=Nodelist | \
                              sort -u); do \
    echo "${I}: $(ssh "${I}" "lscpu | \
                               grep -e '^CPU MHz'")"; \
 done
node161: CPU MHz:               2822.961
node169: CPU MHz:               2438.610
node172: CPU MHz:               2094.567
node173: CPU MHz:               458.959
node174: CPU MHz:               444.860
node176: CPU MHz:               678.234
node191: CPU MHz:               2812.841
node202: CPU MHz:               3678.283
node209: CPU MHz:               1934.399
node216: CPU MHz:               3233.569
node217: CPU MHz:               2126.452
node251: CPU MHz:               2765.307
[root@login2 ~]# 

Thx, bye,
Luca

Hi there,

At a first analysis, node[173-176] share the same chassis for which I found power errors.

We are going to check remotely, otherwise if manual intervention is needed we have to wait for next-week maintenance (cf. Baobab scheduled maintenance: 27 May 2020 ).

Thx, bye,
Luca