Nvidia A100 Ampere architecture with MIG

Yann.Sagon · April 30, 2021, 2:56pm

Dear users,

we have now one compute node with A100 cards on Baobab, more to come.

This card is very powerful and has a lot of RAM. For this reason it isn’t easy to “saturate” it.

One way to circumvent this is to use MIG ( Multi-Instance GPU).
MIG is supported on the new Nvidia Ampere architecture. It allows to split an A100 GPU card in up-to seven GPU instances fully isolated. This is useful if the job on the GPU doesn’t saturate the GPU. This mechanism doesn’t harvest non used GPU capacity but allows each users to have a predictable throughput and latency.

The GPU can be shared in two different way that can be mixed together as well.

vGPU

temporal partitioning. shared compute resource
If low usage, higher resource for one job
non predictable throughput and latency
A100 split in 10, 4GB each

MIG

spatial partitioning. dedicated compute resource
heterogeneous
not perturbed by other jobs
not all the memory and/or cores of the GPU can be used.

Right now we did some testings on gpu020 and split one of the two cards like that:

[root@gpu020 ~]# nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-9d699867-b051-7fc1-bd12-558372f8959a)
GPU 1: A100-PCIE-40GB (UUID: GPU-a7449be4-8516-9501-f69d-1e5841e103ce)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-a7449be4-8516-9501-f69d-1e5841e103ce/1/0)
  MIG 2g.10gb Device 1: (UUID: MIG-GPU-a7449be4-8516-9501-f69d-1e5841e103ce/5/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-a7449be4-8516-9501-f69d-1e5841e103ce/13/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-a7449be4-8516-9501-f69d-1e5841e103ce/14/0)

It means that you can specify the gpu type you want like that (example to request :

--gpus=1g.5gb:1

See here for the meaning of the profile name.

The current issues:

the integration in Slurm isn’t dynamic, some work should be done if we want this.
not the full GPU can be used.
as the configuration isn’t dynamic, not sure if it worth to enable it.

If you are interested by the topic, you can give your voice here:

Do you think it worth to split the card?

Yes, I don’t need that much power
No, I want to have the full A100 power
Maybe, I don’t know how much memory I need

0 voters

And feel free to answer to this thread to open the discussion!

MIG reference: NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation

Ludovic.Dumoulin · May 11, 2021, 8:40am

I would like to try this MIG but I don’t know what I have to write in the batch.
One instance with 5Gb is probably enough for me.
Then I have to add this line in my batch:

--gpus=1g.5gb:1

but where ? before, after or instead of this line ?

--gres=gpu:ampere:1

Thank you

Yann.Sagon · May 11, 2021, 1:28pm

This line is the “old school” way to request GPUs. If you want to use the card 1g.5gb you can replace this line with

--gpus=1g.5gb:1

Or if you wan to use the A100 card, use this line:

--gpus=ampere:1

Because I split the A100 card as a test, I created 4 smaller cards:

3g.20gb 
2g.10gb 
1g.5gb 
1g.5gb

Each card is present only once. It means if you want to launch 4 jobs on four cards, you need to specify the different model for each job/step.

The idea is that you can test if the smallest card is enough or not. If yes, you know that you can require any GPU type on this node.

There is unfortunately no way to specify that you want card 3g.20gb or 2g.10gb for example.

Best

Ludovic.Dumoulin · May 20, 2021, 9:49am

This is the result for my simulation.

Number of time steps done in 24h:

Full A100 → 518k
3g.20gb (43% of the SMs, 50% of the memory) → 259k (50% perfomance w.r.t full A100)
2g.10gb (29%, 40%) → 185k (36%)
1g.5gb (14%, 20%) → 93k (18%)

The good point is that over all, for my simulation, the performances are better with MIG.
However, it takes too long to complete 1 run of 500k time step, If I use the 1g.5gb card I would need more than 5 days… I don’t know if it is relevant to have these small GPUs while P100s are faster and available.

For my use, I would prefer one full A100 than this MIG. However maybe we can try this configuration:

3g.20gb
4g.20gb

It could be more convenient as we have many P100s for “shorter” jobs.

Best

Yann.Sagon · May 20, 2021, 3:55pm

Hi,

unfortunately the combination you suggest isn’t valid. The nearest would be

two times 3g.20g.

Anyway as you seems to be able to “saturate” the full A100 there isn’t indeed any good reason ton split it for your use case.

The issue we’ll face is other GPU jobs with less resources needs will use a full A100 for no good reason. Slurm is missing a way to exclude those kind of GPUs if the user is only asking for a generic GPU for example.

As you seems to have an application that can do some “real world” benchmarking on GPUs, if you have some spare time I would be very interested to have a comparison with the other GPUs we provide : RTX, V100 for example.

I’ll revert the A100 to full in the next few days.

Ludovic.Dumoulin · May 26, 2021, 10:08am

Hi,

Thank you,

I think it is not relevant to run my simulation on RTX because I need double precision.
The V100 is not on baobab, but for the P100 → 106k time steps done in 24h.

Best

Yann.Sagon · May 26, 2021, 12:07pm

Hi

Ok, good to know.

Feel free to use it anyway on Yggdrasil if needed.

Even high end GPUs become obsolete quickly

Ludovic.Dumoulin · June 1, 2021, 8:35am

Hi,

Yes, I am a bit surprised by the factor \sim 5 of performance, theoretically there is only a factor \sim 2 in TFLOPS.

Do you know when I’ll be able to use the second A100 (without MIG) ?

Also I have a PartitionTimeLimite error when I launch a job of 24h. I receive an email saying that my job will not run, and then it runs or is canceled.

Yann.Sagon · June 1, 2021, 8:37am

Hi,

I’ll try to do that soon.

Can you show us your sbatch please?

Ludovic.Dumoulin · June 1, 2021, 8:49am

Thank you.

#!/bin/env bash
#SBATCH --partition=private-kruse-gpu,shared-gpu
#SBATCH --time=0-24:00:00
#SBATCH --gres=gpu:ampere:1
#SBATCH --mail-user=ludovic.dumoulin@unige.ch
#SBATCH --mail-type=END
#SBATCH --output=slurm-%J.out
#SBATCH --mem=3000
module load Julia
cd /home/users/d/dumoulil/code2/
srun julia --optimize=3 FD-Jacobi-friction.jl

Yann.Sagon · June 1, 2021, 8:54am

This is the issue. As you are requesting more than 12h, the shared-gpu partition is an invalid selection and you must remove it from your partition list.

Ludovic.Dumoulin · June 1, 2021, 8:56am

ok, thank you, now it works

Ludovic.Dumoulin · June 8, 2021, 9:52am

Hi,

I still can not use the second A100, is it a bug ?

Best,

Yann.Sagon · June 15, 2021, 2:56pm

Hi this is not a bug, but not enough time:(

At worst this will be done during the next Baobab maintenance at the end of the month. We’ll try to do it before.

Best

Luca.Capello · July 9, 2021, 1:00pm

Hi there,

Indeed, done during last week Baobab maintenance (cf. Baobab scheduled maintenance: 30th of June - 01st of July 2021 - #3 by Yann.Sagon ).

Thx, bye,
Luca