How to keep data from the local node storage after runtime

I got asked by the HPC group to provide here a summary of my experiences about the local node storage.

Each node has a local storage, which is faster the access and creates less network traffic between the nodes. Hence, it is useful for files which need to be accessed often during the run time, see the HPC domunentation about the local storage.

In practice, there is a need to keep at least some if not all of the created data. This should be moved or copied from the local storage (mounted as /scratch on each node only accessible from that node and freed after your job finished) to your home (mounted as $HOME/scratch) or your group storage (usually mounted under /srv/beegfs/scratch/).

When all works well it is simple to add a command to copy/move data at the end of your slurm script or at the end of your program.
But this won’t work, when your code, e.g. hits the end time and get killed by slurm.

In such a case you’d need to give special care about keeping your data.

  1. Signal handler

The best option is to use a signal handler. This usually exists on all programming languages, please checkout the documentation for the language your code is written in. In python, there is the module signal, which provides this functionality. Hence, you should setup a handler listing for e.g. SIGTERM.

signal.signal(signal.SIGTERM, your_function)

Usually, slurm will first send a SIGTERM a wait a bit before sending SIGKILL. Because the second signal will force stop your job, you should better already react on the SIGTERM.
In the case, you have bigger amounts of data and you worry, that in the time slurm gives you can’t finish all to get your data copied, there is the option to let slurm send an additional signal earlier before the end time:

#SBATCH --signal=[{R|B}:]<sig_num>[@sig_time]

sig_num is the signal you’d ask slurm to send for you, I recommend to use “USR1” (“SIGUSR1” called on the OS level and usually in the signal handlers). sig_time is the time in seconds before the end time of your job. For more details, see the slurm documentation. To get the signal working well, you should run your code via srun, which you can put in a slurm sbatch script, too.

  1. Child process

There might be cases where the signal handler does not listen actively. E.g. when your python code is a wrapper to run another more complicated and/or fixed code you can’t modify. Be aware that the following will only work for kills due to the slurm end time.
A standard, while not recommended, way to do this is via os.system(). This is running your other code, but it will cause your python code to not listen for signals anymore. Hence, a signal send by slurm will only reach the other program, but not your wrapper in python, which is aimed to do the file handling in most cases. Here, we need to get a bit ore creative for a solution. The way I found is to use os.fork() to create a child process of the python script. In the main script you will continue to do your normal stuff. The child I send to sleep and do the copy of the data in case it awakes. In this case, we need to care ourselves about the timing. Slurm does provide $SLURM_JOB_END_TIME which give you the end time of the job. Hence you can use this as an input to your python script to transport the required information, when slurm will kill your job. Because sleep takes a duration, it is more easy to work with the difference of $SLURM_JOB_END_TIME and $SLURM_JOB_START_TIME.
WARNING: Your slurm job won’t finish until both the child and the main program ended. Hence, you should care about the child process at the end of your main program. A simple an quick solution is to let the main program kill the child. Here some example code part (using the modules os, sys, and time):

child_pid = os.fork()
if child_pid>0: # only entered by the main program
    # inset main program here including to move the data
    os.kill(child_pid, 9) # this will kill your child based on its process ID
elif child_pid==0: # only entered by the child
    time.sleep(100) # here you need to put in the time you calculate, when you want to get the data copied
    #copy your data here
    sys.exit(0) # this makes sure your child won't execute any other code

It is OK, to let the child just copy the data, because the data on the node will be free with the end of your job nevertheless and moving would potentially cause more issues on your main program.


Dear @Matthias.Kruckow many thanks for the detailed feedback, this is much appreciated!