I am running large array jobs (10000 jobs), and about 10-20% of the jobs are getting killed unexpectedly and without any reason from slurm, except that it is FAILED. They are not exceeding run time, in fact they die rather quickly generally when the computations start after IO is done. They aren’t using more memory than I requested either. I have also checked the nodes (though not exhaustively) that are running the failed jobs, and they don’t seem to be running out of memory. I get something like the following in my output:
Example from jobID: 55873818_4772
/var/spool/slurmd/job55885108/slurm_script: line 4: 154199 Killed intqtl gwas --vcf ukb22828_autosomes_b0_v3.v30000.maf01.info3.bcf --bed ukbio_chol.txtt --cov ukbio_pcs.txtt --normal --out b03/chol.$c.intqtl --double --chunk $c all.15000.chunks --fixed-vcf
Interestingly when I resubmit the failed jobs exactly as before, once the original array job has completed, they complete succesfully. Thus it seems to me that this has something to do with the load of the node running the jobs. Ie. when I submit 10000 jobs there are many more jobs running on a given node which seems to cause them to die, whereas when only the failed jobs are submitting there are much fewer jobs per node. These are computationally intensive jobs using AVX instructions.
Do you have any idea what could be wrong? Is it a thermal or power issue? Or am I missing something?
Thanks in advance,
with your job id and checking on the compute node (node174) it seems your job went out of memory:
[7600877.944498] Out of memory: Kill process 96814 (intqtl) score 70 or sacrifice child
[7600877.953142] Killed process 96814 (intqtl) total-vm:7114872kB, anon-rss:7080836kB, file-rss:28kB, shmem-rss:0kB
(baobab)-[root@admin1 ~]$ sacct --format=Start,AveCPU,State,MaxRSS,JobID,NodeList,ReqMem --units=G -j 55885108
Start AveCPU State MaxRSS JobID NodeList ReqMem
------------------- ---------- ---------- ---------- ------------ --------------- ----------
2022-03-01T02:49:23 FAILED 55873818_55+ node174 8G
2022-03-01T02:49:23 00:05:15 FAILED 6.75G 55873818_55+ node174
2022-03-01T02:49:23 00:00:00 COMPLETED 0.00G 55873818_55+ node174
I checked our Slurm configuration and I saw this notice in our logs:
No memory enforcing mechanism configured.
Ahah! This could explain why your jobs were able to run when the node wasn’t fully loaded!
I’ve updated the Slurm configuration with the fix. Thanks for the notification.
I am requesting 8G per job and none of them go above 7G, so the jobs are not going over the memory limit set during job submission. Thus, what I understand from your anwser is that the nodes were scheduling more jobs than the memory available in the node, and hence some were getting killed since the node was running out of memory. Ie. the node was not respecting the memory requirement set, eg, a node with 64G of free memory should have executed at most 8 of these jobs (8x8G = 64G), but it was executing more.
Am I understanding this correctly? I am asking because it’s not clear to me whether I am doing something wrong in my job submission (not requesting enough memory).
If the job was killed, this isn’t maybe true as Slurm is gathering memory usage on a regular basis but the job may ask more memory and be killed in the meantime without Slurm noticing it. With the change I made, if your job is killed due to memory usage, this should appears as the reason in the job output.
the job scheduling was correct. What wasn’t correct is the memory limit enforcement. If a job asked 8G but was using instead 16G, this job wasn’t killed by Slurm but this job or another one could be killed by OOM if the memory was exhausted. I guess you have nothing to change in your script, maybe increase a little bit your memory requirement if some of your job still crash.
In your example, you are talking about 8x8G on 64G compute node. This isn’t possible in real life as the OS is using some RAM as well and we limit the memory usage to 95% of the total.
Ok thanks for the clarification and your help.
Ok this issue still persists, ie I am still getting FAILED jobs. Example JobID: 56040904_3185. I have checked the node this job was failing on, which is node187. It seems this node is running out of memory although all my running jobs use <7G of the requested 8G of memory. Atttached is a screen capture from this node:
This node can run 9 of my jobs but the 10th one always fails as the node OOMs. And as you can see all of my running jobs use 6.8G of memory. This node has 96G of memory, and I should be reserving 80G of it (as it is constantly trying to execute 10 of my jobs concurrently), but there seems to be actually ~67G of free memory available.
Further info that maybe helpful, the program I am running is one that I have written, so I know exatly how memory allocation works, and I already know that it will not use more the 7G and I know that after the IO is over it will not allocate more memory (or any substational amount of memory), so I know that there shouldn’t be any momentary spikes in memory usage. Though I know it will use less than 7G, to be on the safe side I am requesting 8G.
Maybe I am wrong but there seems to be an issue with memory allocation of slurm, or I am once again missing something.