Problem with job allocation

Hello team,

I attach the image with the error.


I was trying to allocate an interactive job on the private-wesolowski-bigmem partition, and it seemed to be granted, then just failed.

Thanks in advance,

Cristina

The same problem seems to appear on the public-interactive-cpu partition:

salloc: Nodes cpu003 are ready for job
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "cpu003"
srun: error: fwd_tree_thread: can't find address for host cpu003, check slurm.conf
srun: error: Task launch for StepId=10034249.0 failed on node cpu003: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.

This might be an undesired consequence of the maintenance… :cry:

1 Like

Quick follow up, @Cristina.GonzalezEspinoza things seems to work now in the public-interactive-cpu partition. Maybe it’s the case in your private one as well, however I have no idea where the problem was coming from.

Best,
Stefano

Hi Stefano,

Yes, indeed, it works now, thanks for the heads-up!!

Cristina

Hi,
I’m not able to reproduce the issue, the reason was probably some glitch after the re install of the compute nodes.

1 Like

Dear HPC team,

Today I experience a similar error as the one described in this post on Yggdrasil:
Below is the command line used and the error message.

Any idea ?

Kind regards,
Julien

Blockquote
(yggdrasil)-[prados@login1 oxa-48]$ salloc --partition=shared-cpu --time=4:00:00 --mem=12G --ntasks=1 --cpus-per-task=4
salloc: Pending job allocation 13923014
salloc: job 13923014 queued and waiting for resources
salloc: job 13923014 has been allocated resources
salloc: Granted job allocation 13923014
salloc: Waiting for resource configuration
salloc: Nodes cpu116 are ready for job
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve “cpu116”
srun: error: fwd_tree_thread: can’t find address for host cpu116, check slurm.conf
srun: error: Task launch for StepId=13923014.interactive failed on node cpu116: Can’t find an address, check slurm.conf
srun: error: Application launch failed: Can’t find an address, check slurm.conf
srun: Job step aborted
salloc: Relinquishing job allocation 13923014

Hi, this is fixed: Current issues on Baobab and Yggdrasil - #96 by Yann.Sagon