Hello all,
I’m trying to run a tensorflow/keras code where there is an eigenvalue calculation in the loss function. The code runs fine on my personal computer, but on Baobab I get the following error message:
File "run.py", line 313, in customLoss
e,_= tf.linalg.eigh(matrix)
File "/opt/ebsofts/MPI/GCC/6.4.0-2.28/OpenMPI/2.1.2/TensorFlow/1.7.0-Python-3.6.4/lib/python3.6/site-packages/tensorflow/python/ops/linalg_ops.py", line 348, in self_adjoint_eig
e, v = gen_linalg_ops.self_adjoint_eig_v2(tensor, compute_v=True, name=name)
File "/opt/ebsofts/MPI/GCC/6.4.0-2.28/OpenMPI/2.1.2/TensorFlow/1.7.0-Python-3.6.4/lib/python3.6/site-packages/tensorflow/python/ops/gen_linalg_ops.py", line 1639, in self_adjoint_eig_v2
"SelfAdjointEigV2", input=input, compute_v=compute_v, name=name)
File "/opt/ebsofts/MPI/GCC/6.4.0-2.28/OpenMPI/2.1.2/TensorFlow/1.7.0-Python-3.6.4/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/opt/ebsofts/MPI/GCC/6.4.0-2.28/OpenMPI/2.1.2/TensorFlow/1.7.0-Python-3.6.4/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/opt/ebsofts/MPI/GCC/6.4.0-2.28/OpenMPI/2.1.2/TensorFlow/1.7.0-Python-3.6.4/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Got info = 8 for batch index 0, expected info = 0. Debug_info = heevd
[[Node: loss/activation_1_loss/SelfAdjointEigV2 = SelfAdjointEigV2[T=DT_FLOAT, compute_v=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](loss/activation_1_loss/mul_3)]]
It’s weird because I have very similar code using eigenvalue calculations which is running fine on Baobab. I’m trying to pinpoint what the problem is but it’s rather obscure. I tried to pinpoint the problem by reducing my code as much as possible. I get the same error already for a loss function like
def customLoss(x,y_pred):
y_pred = tf.slice(y_pred,begin=(0,0),size=(batch_size,9)) #extract relevant elements
matrix = K.reshape(y_pred,(-1,3,3)) #reshape to 3x3 matrices
matrix = matrix + K.permute_dimensions(matrix,(0,2,1)) #add transpose to each matrix in batch
e,_= tf.linalg.eigh(matrix) #get eigenvalues
return -K.min(e)
In fact with this code it runs well for a few epochs and then gets the error.
Has anyone run into problems like this related to getting eigenvalues of matrices?
Thanks!
p.s. I load the following for this:
## TensorFlow
module load GCC/6.4.0-2.28 OpenMPI/2.1.2 TensorFlow/1.7.0-Python-3.6.4 matplotlib/2.1.2-Python-3.6.4 Keras/2.1.6-Python-3.6.4
## CUDA
module load cuDNN/7.0.5-CUDA-9.1.85