what did you try:
I was trying to run a GPU-related neural network process for carrying out predictions using a trained model.
what didn’t work:
I tried using the private-schaer-gpu or shared-gpu partitions with the maximum memory per CPU (16000) that was available. However, I get the Out of Memory error. The process seems to work fine for the training process, but fails to finish the testing. I tried to reduce the testing dataset for my current case but I am getting the same OOM error.
what was the error message:
Id=51911687.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: gpu007: task 0: Out Of Memory
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=51911687.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
path to the relevant files (logs, sbatch script, etc):
- logs: /home/share/schaer2/neural-network-f/neural_network_test/5_sec_NN/slurm-51911687.out
- sbatch script: /home/share/schaer2/neural-network-f/neural_network_test/5_sec_NN/run_predict.sh