NumPy multithreading in Baobab

Hi all:

I’m running in the cluster Python programs that perform exact diagonalization and matrix operations using the array operations of the package NumPy. As far as I know, NumPy multithreads automatically, which makes programs rather fast on my PC. However, I noticed that the execution time of such programs when running on the cluster was equal to the single-thread time on my PC, so even though I was demanding e.g. 8 CPU’s per core with my batch commands, apparently it was only using one. For this reason I tried using the environment variable OPENBLAS_NUM_THREADS. The sequence of batch commands I use is, for example:

#SBATCH --job-name=asg-$filename
#SBATCH --output=$jobfolder/job.out
#SBATCH --error=$jobfolder/job.err
#SBATCH --partition=dpt-EL7
#SBATCH --time=00-04:00:00
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1
module load GCC/7.3.0-2.30 OpenMPI/3.1.1 Python/2.7.15
export OPENBLAS_NUM_THREADS=8
srun python $inputname" > $jobscript

This does reduce the execution time and makes me think that multithreading is working, but then I look at the actual results and they have changed numerically! (with respect to, for example, the outcome of the same program in my PC), so I really don’t know what’s the issue here. Maybe I’m not enforcing properly the simultaneous use of the CPU’s? Eventually I need to deal with arrays that are not manageable by my PC, so I need to know whether the results I obtain from the cluster are reliable…

Some help would be appreciated, please, my computing skills are a bit primitive…

Best,
Adrián S.G.

Dear Adrian,

Is there a particular reason to use an old NumPy/Python version?

Are you using the same NumPy version on Baobab and on your pc?

To know more about the version of NumPy on Baobab you are using:

[sagon@login2 ~] $ module load GCC/7.3.0-2.30 OpenMPI/3.1.1 Python/2.7.15
[sagon@login2 ~] $ python -c "import numpy; numpy.show_config()"
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/ebsofts/Compiler/GCC/7.3.0-2.30/OpenBLAS/0.3.1/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/ebsofts/Compiler/GCC/7.3.0-2.30/OpenBLAS/0.3.1/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/ebsofts/Compiler/GCC/7.3.0-2.30/OpenBLAS/0.3.1/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/ebsofts/Compiler/GCC/7.3.0-2.30/OpenBLAS/0.3.1/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
[sagon@login2 ~]

So indeed it seems you can control the parallelism with OPENBLAS_NUM_THREADS

If you set OPENBLAS_NUM_THREADS=1, do you have an expected result?

About this line in your script:

#SBATCH --job-name=asg-$filename

is this realy working? how do get the $filename value, is this a variable set outside of the script?

Dear Yann:

Thanks for your reply. I was a bit schematic when quoting the batch commands; the whole batch script I’m using is the following:

#!/bin/bash

if [ $# -ne 1 ]; then
   echo 'file required'
fi

inputname=$1
filename=$(basename $inputname)
jobfolder=jobdata-$filename
jobscript=$jobfolder/run.sbatch

mkdir -p $jobfolder

echo "#!/bin/bash
#SBATCH --job-name=asg-$filename
#SBATCH --output=$jobfolder/job.out
#SBATCH --error=$jobfolder/job.err
#SBATCH --partition=dpt-EL7
#SBATCH --time=00-04:00:00
#SBATCH --cpus-per-task= x
#SBATCH --nodes=1
module load GCC/7.3.0-2.30 OpenMPI/3.1.1 Python/2.7.15
export OPENBLAS_NUM_THREADS= x
srun python $inputname" > $jobscript

sbatch $jobscript

where instead of “x” I write 1 or 8 depending on whether I want to do a single-thread run or multithread using 8 cpu’s.

I’ve also written a little toy program that I think reproduces the problem I pointed out in the first post. It creates a 1D array (whose items have values that depend on their position in the array). Then I use this array to construct a tridiagonal matrix M and, for several values of a variable t, I compute the (weighted) average of the elements of the first column of exp(M*t), which I call “complexity”, denoted by C(t) in the plots I’ll show below. The Python script is the following:

import numpy as np
from scipy.sparse import diags
from scipy.special import binom
import numpy.linalg as la
import time


def bSequence (L,N):
    
    D = int(binom(L,N))
       
    b = [None]*(D**2)
    
    for n in range(D**2):
        
        if (n <= L):
            
            b[n] = n / 7.
            
        elif(n>L):
            
            b[n] = b[L]-(n-L)*(b[L])/(float(D**2)-L)
            
    return b

def makeM (b):
    
    b = np.delete(b,0)
    b = np.delete(b,np.where(b == 0))
    diagonals = [b,-b]
    M = diags(diagonals, [-1, 1]).toarray()
    
    return M

def Spectrum (M): # Compute the spectrum of M, which will be used to exponentiate it.
    
    assert la.norm( M+np.conj(M.T) ) < 1e-8 # This method is only correct for skew-hermitian matrices.
    
    M = 1j*M 
    
    assert la.norm( M-M.conj().T < 1e-8 ) # We created a hermitian matrix out of the original one.
    
    evals,P = la.eigh(M) # The method la.eigh() is faster than la.eig()
    
    Pinv = P.conj().T # The eigenvectors matrix is unitary!
    
    evals = -1j*evals # We recover the eigenvalues of the original matrix. The eigenvectors are just the same as those of the "artificial" hermitian matrix.
    
    return [evals, P, Pinv]

def Complexity(evals,P,Pinv,t0,dt,ts,su):
    
    v = Pinv[:,0] # Take only the first column of the inverse eigenvectors matrix.
        
    t = t0 - dt
    deltat = dt
    timesteps = ts
    speedUp = su
    
    tVals = [None]*(timesteps*speedUp)
    CVals = [None]*(timesteps*speedUp)
    
    i=-1
    
    for k in range(speedUp):
        actualStep = deltat*(2**k)
        
        for j in range(timesteps):
            
            i += 1
            
            t += actualStep
            
            tVals[i] = t
            
            Diagonal = diags(np.exp(t*evals),0).toarray() # Exponentiate every eigenvalue (item-wise) and put the resulting list in a diagonal matrix.
            
            phit = P.dot(Diagonal).dot(v) # First column of exp(M*t)
            
            C = 0.
            
            for n in range(1,len(phit)):
                
                C += n * np.conj(phit[n]) * phit[n]
                
            CVals[i] = np.real(C) # K-complexity is real, so this makes sure to remove possible residual imaginary parts due to numerical precision.
            
    return [tVals,CVals] # I will want to plot CVals vs tVals, see below.

if __name__ == '__main__':
    
    L = 6
    
    N = 3
    
    start_time = time.time()
    
    t0 = 0 # Initial time value.
    
    dt = 0.01 # Value of the time step.
    
    ts = 100 # Number of time steps (for every scale).
    
    su = 5 # Number of scales spanned.
    
    b = bSequence(L,N)
    
    M = makeM(b)
    
    evals,P,Pinv = Spectrum(M)
    
    result = Complexity(evals,P,Pinv,t0,dt,ts,su) # result is a list with two elements, each element is a 1D array.
    
    filename = 'Toy_Complexity_L'+str(L)+'_N'+str(N)+'_Tin'+str(t0)+'_dt'+str(dt)+'_ts'+str(ts)+'_su'+str(su)
           
    elapsed_time = time.time() - start_time 

    print (elapsed_time)
    
    np.save(filename,result)

After running this program in the cluster (the single-thread execution time is roughly 30 seconds), I download the created file which contains the two arrays and I plot the second array (result[1]) vs the first array (result[0]). I find different results depending on whether I use a single thread or multithreading. When using 8 cpu’s there is some seemingly “non-deterministic” noise (and I call it non-deterministic because it changes after every execution even though the script is exactly the same and there are no randomly assigned variables). See the plot below:

The result that I obtain from running the Python script on my PC, which multithreads using 4 cpu’s, agrees with the red line (i.e. 1-thread result from the cluster). The Python version in my computer is 3.7. I’ve also tried to use the newer Python version you suggested in your post:

ml GCC/8.2.0-2.31.1 OpenMPI/3.1.3 Python/3.7.2 SciPy-bundle/2019.03

But in that case I get error messages of the form:

srun: error: node092: task 0: Illegal instruction

… are some nodes problematic for using this Python version?

Thanks for your time and help.
Adrián S.G.