Hello,
I am using Kraken, version 4.3.13
. Prior training, I have compiled my data in binary format. I use a pretty big dataset (a bit more than 2000 pages to train an OCR model), which is why I cannot share it here:
(kraken-env) (yggdrasil)-[gabays@login1 ~]$ stat dataset.arrow
File: dataset.arrow
Size: 18305066626 Blocks: 35752084 IO Block: 524288 regular file
Before using sbatch
, I played a bit to find the correct configuration to understand how it works. I did salloc
:
alloc --partition=shared-gpu --time=12:00:00 --gpus=2 --mem=24GB --cpus-per-task=12 --gres=gpu:2,VramPerGpu:24GB
Things go fine for the first epoch, then it suddenly stops:
(kraken-env) (yggdrasil)-[gabays@gpu002 ~]$ ketos train -f binary -d cuda:0 -B 16 --workers 1 -r 0.0004 -u NFC dataset.arrow --freq 1
scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
┏━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ┃ Name ┃ Type ┃ Params ┃ In sizes ┃ Out sizes ┃
┡━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0 │ val_cer │ _CharErrorRate │ 0 │ ? │ ? │
│ 1 │ val_wer │ _WordErrorRate │ 0 │ ? │ ? │
│ 2 │ net │ MultiParamSequential │ 4.1 M │ [[1, 1, 120, 400], '?'] │ [[1, 258, 1, 50], '?'] │
│ 3 │ net.C_0 │ ActConv2D │ 1.3 K │ [[1, 1, 120, 400], '?'] │ [[1, 32, 120, 400], '?'] │
│ 4 │ net.Do_1 │ Dropout │ 0 │ [[1, 32, 120, 400], '?'] │ [[1, 32, 120, 400], '?'] │
│ 5 │ net.Mp_2 │ MaxPool │ 0 │ [[1, 32, 120, 400], '?'] │ [[1, 32, 60, 200], '?'] │
│ 6 │ net.C_3 │ ActConv2D │ 40.0 K │ [[1, 32, 60, 200], '?'] │ [[1, 32, 60, 200], '?'] │
│ 7 │ net.Do_4 │ Dropout │ 0 │ [[1, 32, 60, 200], '?'] │ [[1, 32, 60, 200], '?'] │
│ 8 │ net.Mp_5 │ MaxPool │ 0 │ [[1, 32, 60, 200], '?'] │ [[1, 32, 30, 100], '?'] │
│ 9 │ net.C_6 │ ActConv2D │ 55.4 K │ [[1, 32, 30, 100], '?'] │ [[1, 64, 30, 100], '?'] │
│ 10 │ net.Do_7 │ Dropout │ 0 │ [[1, 64, 30, 100], '?'] │ [[1, 64, 30, 100], '?'] │
│ 11 │ net.Mp_8 │ MaxPool │ 0 │ [[1, 64, 30, 100], '?'] │ [[1, 64, 15, 50], '?'] │
│ 12 │ net.C_9 │ ActConv2D │ 110 K │ [[1, 64, 15, 50], '?'] │ [[1, 64, 15, 50], '?'] │
│ 13 │ net.Do_10 │ Dropout │ 0 │ [[1, 64, 15, 50], '?'] │ [[1, 64, 15, 50], '?'] │
│ 14 │ net.S_11 │ Reshape │ 0 │ [[1, 64, 15, 50], '?'] │ [[1, 960, 1, 50], '?'] │
│ 15 │ net.L_12 │ TransposedSummarizingRNN │ 1.9 M │ [[1, 960, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 16 │ net.Do_13 │ Dropout │ 0 │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 17 │ net.L_14 │ TransposedSummarizingRNN │ 963 K │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 18 │ net.Do_15 │ Dropout │ 0 │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 19 │ net.L_16 │ TransposedSummarizingRNN │ 963 K │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 20 │ net.Do_17 │ Dropout │ 0 │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 21 │ net.O_18 │ LinSoftmax │ 103 K │ [[1, 400, 1, 50], '?'] │ [[1, 258, 1, 50], '?'] │
└────┴───────────┴──────────────────────────┴────────┴──────────────────────────┴──────────────────────────┘
Trainable params: 4.1 M
Non-trainable params: 0
Total params: 4.1 M
Total estimated model params size (MB): 16
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━ 10992/10992 0:31:20 • 0:00:00 6.07it/s val_accuracy: 0.765 early_stopping: 0/10
val_word_accuracy: 0.328 0.76487
stage 1/∞ ━━━━━━━╸━━━━━━━━━━━━━━━━━ 3515/10992 0:10:01 • 0:21:18 5.85it/s val_accuracy: 0.765 early_stopping: 0/10
stage 1/∞ ━━━━━━━━━━━━━━━━━━━━━━━╸━ 10519/10992 0:30:16 • 0:01:22 5.81it/s val_accuracy: 0.765 early_stopping: 0/10
val_word_accuracy: 0.328 0.76487 Killed
I try to restart the training from the first model trained (--load model_0.mlmodel
), diminishing the size of -B
from 16 to 8. Same issue, but faster:
(kraken-env) (yggdrasil)-[gabays@gpu002 ~]$ ketos train -f binary -d cuda:0 -B 8 --workers 1 -r 0.0003 -u NFC dataset.arrow --freq 1 --load model_0.mlmodel
scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
┏━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ┃ Name ┃ Type ┃ Params ┃ In sizes ┃ Out sizes ┃
┡━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0 │ val_cer │ _CharErrorRate │ 0 │ ? │ ? │
│ 1 │ val_wer │ _WordErrorRate │ 0 │ ? │ ? │
│ 2 │ net │ MultiParamSequential │ 4.1 M │ [[1, 1, 120, 400], '?'] │ [[1, 258, 1, 50], '?'] │
│ 3 │ net.C_0 │ ActConv2D │ 1.3 K │ [[1, 1, 120, 400], '?'] │ [[1, 32, 120, 400], '?'] │
│ 4 │ net.Do_1 │ Dropout │ 0 │ [[1, 32, 120, 400], '?'] │ [[1, 32, 120, 400], '?'] │
│ 5 │ net.Mp_2 │ MaxPool │ 0 │ [[1, 32, 120, 400], '?'] │ [[1, 32, 60, 200], '?'] │
│ 6 │ net.C_3 │ ActConv2D │ 40.0 K │ [[1, 32, 60, 200], '?'] │ [[1, 32, 60, 200], '?'] │
│ 7 │ net.Do_4 │ Dropout │ 0 │ [[1, 32, 60, 200], '?'] │ [[1, 32, 60, 200], '?'] │
│ 8 │ net.Mp_5 │ MaxPool │ 0 │ [[1, 32, 60, 200], '?'] │ [[1, 32, 30, 100], '?'] │
│ 9 │ net.C_6 │ ActConv2D │ 55.4 K │ [[1, 32, 30, 100], '?'] │ [[1, 64, 30, 100], '?'] │
│ 10 │ net.Do_7 │ Dropout │ 0 │ [[1, 64, 30, 100], '?'] │ [[1, 64, 30, 100], '?'] │
│ 11 │ net.Mp_8 │ MaxPool │ 0 │ [[1, 64, 30, 100], '?'] │ [[1, 64, 15, 50], '?'] │
│ 12 │ net.C_9 │ ActConv2D │ 110 K │ [[1, 64, 15, 50], '?'] │ [[1, 64, 15, 50], '?'] │
│ 13 │ net.Do_10 │ Dropout │ 0 │ [[1, 64, 15, 50], '?'] │ [[1, 64, 15, 50], '?'] │
│ 14 │ net.S_11 │ Reshape │ 0 │ [[1, 64, 15, 50], '?'] │ [[1, 960, 1, 50], '?'] │
│ 15 │ net.L_12 │ TransposedSummarizingRNN │ 1.9 M │ [[1, 960, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 16 │ net.Do_13 │ Dropout │ 0 │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 17 │ net.L_14 │ TransposedSummarizingRNN │ 963 K │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 18 │ net.Do_15 │ Dropout │ 0 │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 19 │ net.L_16 │ TransposedSummarizingRNN │ 963 K │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 20 │ net.Do_17 │ Dropout │ 0 │ [[1, 400, 1, 50], '?'] │ [[1, 400, 1, 50], '?'] │
│ 21 │ net.O_18 │ LinSoftmax │ 103 K │ [[1, 400, 1, 50], '?'] │ [[1, 258, 1, 50], '?'] │
└────┴───────────┴──────────────────────────┴────────┴──────────────────────────┴──────────────────────────┘
Trainable params: 4.1 M
Non-trainable params: 0
Total params: 4.1 M
Total estimated model params size (MB): 16
stage 0/∞ ━━━━━━━━━━━━━━━━━━━━━━━━━ 21984/21984 0:38:37 • 0:00:00 9.69it/s val_accuracy: 0.855 early_stopping: 0/10
val_word_accuracy: 0.508 0.85462
stage 1/∞ ━━━━━━━━━━━━━━━━━━╺━━━━━━ 16021/21984 0:30:43 • 3:05:24 0.54it/s val_accuracy: 0.855 early_stopping: 0/10
val_word_accuracy: 0.508 0.85462 Killed
Then I start training a model with sbatch
(I keep diminishing the batch size because it is the main reason for killed
):
#!/bin/env bash
#SBATCH --partition=shared-gpu
#SBATCH --time=12:00:00
#SBATCH --gpus=1
#SBATCH --output=kraken-%j.out
#SBATCH --mem=24GB
#SBATCH --cpus-per-task=12
#SBATCH --gres=gpu:1,VramPerGpu:24GB
module load fosscuda/2020b Python/3.8.6
source ~/kraken-env/bin/activate
echo "KETOS training"
srun ketos train -f binary -d cuda:0 -B 2 -r 0.0001 -u NFC dataset.arrow
and I get this error:
(yggdrasil)-[gabays@login1 ~]$ cat kraken-31170677.out
KETOS training
scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
┏━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ ┃ Name ┃ Type ┃ Params ┃ In sizes ┃ Out sizes ┃
┡━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ 0 │ val_cer │ _CharErrorRate │ 0 │ ? │ ? │
│ 1 │ val_wer │ _WordErrorRate │ 0 │ ? │ ? │
│ 2 │ net │ MultiParamSequ… │ 4.1 M │ [[1, 1, 120, │ [[1, 258, 1, │
│ │ │ │ │ 400], '?'] │ 50], '?'] │
│ 3 │ net.C_0 │ ActConv2D │ 1.3 K │ [[1, 1, 120, │ [[1, 32, 120, │
│ │ │ │ │ 400], '?'] │ 400], '?'] │
│ 4 │ net.Do_1 │ Dropout │ 0 │ [[1, 32, 120, │ [[1, 32, 120, │
│ │ │ │ │ 400], '?'] │ 400], '?'] │
│ 5 │ net.Mp_2 │ MaxPool │ 0 │ [[1, 32, 120, │ [[1, 32, 60, │
│ │ │ │ │ 400], '?'] │ 200], '?'] │
│ 6 │ net.C_3 │ ActConv2D │ 40.0 K │ [[1, 32, 60, │ [[1, 32, 60, │
│ │ │ │ │ 200], '?'] │ 200], '?'] │
│ 7 │ net.Do_4 │ Dropout │ 0 │ [[1, 32, 60, │ [[1, 32, 60, │
│ │ │ │ │ 200], '?'] │ 200], '?'] │
│ 8 │ net.Mp_5 │ MaxPool │ 0 │ [[1, 32, 60, │ [[1, 32, 30, │
│ │ │ │ │ 200], '?'] │ 100], '?'] │
│ 9 │ net.C_6 │ ActConv2D │ 55.4 K │ [[1, 32, 30, │ [[1, 64, 30, │
│ │ │ │ │ 100], '?'] │ 100], '?'] │
│ 10 │ net.Do_7 │ Dropout │ 0 │ [[1, 64, 30, │ [[1, 64, 30, │
│ │ │ │ │ 100], '?'] │ 100], '?'] │
│ 11 │ net.Mp_8 │ MaxPool │ 0 │ [[1, 64, 30, │ [[1, 64, 15, │
│ │ │ │ │ 100], '?'] │ 50], '?'] │
│ 12 │ net.C_9 │ ActConv2D │ 110 K │ [[1, 64, 15, │ [[1, 64, 15, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 13 │ net.Do_10 │ Dropout │ 0 │ [[1, 64, 15, │ [[1, 64, 15, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 14 │ net.S_11 │ Reshape │ 0 │ [[1, 64, 15, │ [[1, 960, 1, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 15 │ net.L_12 │ TransposedSumm… │ 1.9 M │ [[1, 960, 1, │ [[1, 400, 1, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 16 │ net.Do_13 │ Dropout │ 0 │ [[1, 400, 1, │ [[1, 400, 1, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 17 │ net.L_14 │ TransposedSumm… │ 963 K │ [[1, 400, 1, │ [[1, 400, 1, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 18 │ net.Do_15 │ Dropout │ 0 │ [[1, 400, 1, │ [[1, 400, 1, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 19 │ net.L_16 │ TransposedSumm… │ 963 K │ [[1, 400, 1, │ [[1, 400, 1, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 20 │ net.Do_17 │ Dropout │ 0 │ [[1, 400, 1, │ [[1, 400, 1, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
│ 21 │ net.O_18 │ LinSoftmax │ 103 K │ [[1, 400, 1, │ [[1, 258, 1, │
│ │ │ │ │ 50], '?'] │ 50], '?'] │
└────┴───────────┴─────────────────┴────────┴────────────────┴─────────────────┘
Trainable params: 4.1 M
Non-trainable params: 0
Total params: 4.1 M
Total estimated model params size (MB): 16
SLURM auto-requeueing enabled. Setting signal handlers.
And nothing happens.
Do you have any idea what is happening?