Cuda problem running a python script file

Primary informations

Username: $bugatti
Cluster: $Yggdrasil


I would like to run a python script using CUDA in order to decrease the running time. The python script uses a package called pyechelle which produces some image files (fits format).

Steps to Reproduce

I log into Yggdrasil. Then I load the modules:
$ ml load GCCcore CUDA Python
I launch my file:
srun python

The file content is here below:

from pyechelle.simulator import Simulator
from pyechelle.sources import Constant
from pyechelle.spectrograph import ZEMAX
sim = Simulator(ZEMAX("MaroonX"))
# Enable cuda and set a specific random seed.
sim.set_output('02_cuda.fits', overwrite=True)

If I run it without the line sim.set_cuda(True) (which calls CUDA to run the file on the GPU), it works, but if I run it with that line I encounter this error:

CUDA driver library cannot be found.
If you are sure that a CUDA driver is installed,
try setting environment variable NUMBA_CUDA_DRIVER
with the file path of the CUDA driver shared library.
srun: error: cpu001: task 0: Exited with exit code 1

I tried to solve it by doing:
$ locate cuda

$ export NUMBA_CUDA_DRIVER=/usr/lib64/

And run again
srun python
But there is this other error which appears:

File "/home/users/b/bugatti/.local/lib/python3.11/site-packages/numba/cuda/cudadrv/", line 381, in absent_function
    raise CudaDriverError(f'Driver missing function: {fname}')
numba.cuda.cudadrv.error.CudaDriverError: Driver missing function: cuInit

Expected Result

The code should run smoothly and create a file.fits when it’s done. It works if I don’t call CUDA.

Dear @Maddalena.Bugatti

you are lacking a couple of options for srun: partition, time limit, number of cpu(s), number of gpu(s).

As you want to use a gpu, you need to use for example the partition shared-gpu and specify that you want one gpu.

Example: srun --partition shared-gpu --gpus=1 python

Check here for more details: hpc:slurm [eResearch Doc]

Two suggestions:

  • write an sbatch script instead of using srun
  • always specify a version when you load a module: ex ml GCCcore/12.3.0 instead of using the latest one.

Thank you very much @Yann.Sagon, now it works properly.