Nvidia A100 Ampere architecture with MIG

Dear users,

we have now one compute node with A100 cards on Baobab, more to come.

This card is very powerful and has a lot of RAM. For this reason it isn’t easy to “saturate” it.

One way to circumvent this is to use MIG ( Multi-Instance GPU).
MIG is supported on the new Nvidia Ampere architecture. It allows to split an A100 GPU card in up-to seven GPU instances fully isolated. This is useful if the job on the GPU doesn’t saturate the GPU. This mechanism doesn’t harvest non used GPU capacity but allows each users to have a predictable throughput and latency.

The GPU can be shared in two different way that can be mixed together as well.

vGPU

  • temporal partitioning. shared compute resource
  • If low usage, higher resource for one job
  • non predictable throughput and latency
  • A100 split in 10, 4GB each

MIG

  • spatial partitioning. dedicated compute resource
  • heterogeneous
  • not perturbed by other jobs
  • not all the memory and/or cores of the GPU can be used.

Right now we did some testings on gpu020 and split one of the two cards like that:

[root@gpu020 ~]# nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-9d699867-b051-7fc1-bd12-558372f8959a)
GPU 1: A100-PCIE-40GB (UUID: GPU-a7449be4-8516-9501-f69d-1e5841e103ce)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-a7449be4-8516-9501-f69d-1e5841e103ce/1/0)
  MIG 2g.10gb Device 1: (UUID: MIG-GPU-a7449be4-8516-9501-f69d-1e5841e103ce/5/0)
  MIG 1g.5gb Device 2: (UUID: MIG-GPU-a7449be4-8516-9501-f69d-1e5841e103ce/13/0)
  MIG 1g.5gb Device 3: (UUID: MIG-GPU-a7449be4-8516-9501-f69d-1e5841e103ce/14/0)

It means that you can specify the gpu type you want like that (example to request :

--gpus=1g.5gb:1

See here for the meaning of the profile name.

The current issues:

  • the integration in Slurm isn’t dynamic, some work should be done if we want this.
  • not the full GPU can be used.
  • as the configuration isn’t dynamic, not sure if it worth to enable it.

If you are interested by the topic, you can give your voice here:

Do you think it worth to split the card?
  • Yes, I don’t need that much power
  • No, I want to have the full A100 power
  • Maybe, I don’t know how much memory I need

0 voters

And feel free to answer to this thread to open the discussion!

MIG reference: NVIDIA Multi-Instance GPU User Guide :: NVIDIA Tesla Documentation

1 Like

I would like to try this MIG but I don’t know what I have to write in the batch.
One instance with 5Gb is probably enough for me.
Then I have to add this line in my batch:

--gpus=1g.5gb:1

but where ? before, after or instead of this line ?

--gres=gpu:ampere:1

Thank you

This line is the “old school” way to request GPUs. If you want to use the card 1g.5gb you can replace this line with

--gpus=1g.5gb:1

Or if you wan to use the A100 card, use this line:

--gpus=ampere:1

Because I split the A100 card as a test, I created 4 smaller cards:

3g.20gb 
2g.10gb 
1g.5gb 
1g.5gb

Each card is present only once. It means if you want to launch 4 jobs on four cards, you need to specify the different model for each job/step.

The idea is that you can test if the smallest card is enough or not. If yes, you know that you can require any GPU type on this node.

There is unfortunately no way to specify that you want card 3g.20gb or 2g.10gb for example.

Best

This is the result for my simulation.

Number of time steps done in 24h:

  • Full A100 → 518k

  • 3g.20gb (43% of the SMs, 50% of the memory) → 259k (50% perfomance w.r.t full A100)

  • 2g.10gb (29%, 40%) → 185k (36%)

  • 1g.5gb (14%, 20%) → 93k (18%)

The good point is that over all, for my simulation, the performances are better with MIG.
However, it takes too long to complete 1 run of 500k time step, If I use the 1g.5gb card I would need more than 5 days… I don’t know if it is relevant to have these small GPUs while P100s are faster and available.

For my use, I would prefer one full A100 than this MIG. However maybe we can try this configuration:

3g.20gb
4g.20gb

It could be more convenient as we have many P100s for “shorter” jobs.

Best

Hi,

unfortunately the combination you suggest isn’t valid. The nearest would be

two times 3g.20g.

Anyway as you seems to be able to “saturate” the full A100 there isn’t indeed any good reason ton split it for your use case.

The issue we’ll face is other GPU jobs with less resources needs will use a full A100 for no good reason. Slurm is missing a way to exclude those kind of GPUs if the user is only asking for a generic GPU for example.

As you seems to have an application that can do some “real world” benchmarking on GPUs, if you have some spare time I would be very interested to have a comparison with the other GPUs we provide : RTX, V100 for example.

I’ll revert the A100 to full in the next few days.

Hi,

Thank you,

I think it is not relevant to run my simulation on RTX because I need double precision.
The V100 is not on baobab, but for the P100 → 106k time steps done in 24h.

Best

Hi

Ok, good to know.

Feel free to use it anyway on Yggdrasil if needed.

Even high end GPUs become obsolete quickly :roll_eyes:

Hi,

Yes, I am a bit surprised by the factor \sim 5 of performance, theoretically there is only a factor \sim 2 in TFLOPS.

Do you know when I’ll be able to use the second A100 (without MIG) ?

Also I have a PartitionTimeLimite error when I launch a job of 24h. I receive an email saying that my job will not run, and then it runs or is canceled.

Hi,

I’ll try to do that soon.

Can you show us your sbatch please?

Thank you.

#!/bin/env bash
#SBATCH --partition=private-kruse-gpu,shared-gpu
#SBATCH --time=0-24:00:00
#SBATCH --gres=gpu:ampere:1
#SBATCH --mail-user=ludovic.dumoulin@unige.ch
#SBATCH --mail-type=END
#SBATCH --output=slurm-%J.out
#SBATCH --mem=3000
module load Julia
cd /home/users/d/dumoulil/code2/
srun julia --optimize=3 FD-Jacobi-friction.jl

This is the issue. As you are requesting more than 12h, the shared-gpu partition is an invalid selection and you must remove it from your partition list.

ok, thank you, now it works :slight_smile:

Hi,

I still can not use the second A100, is it a bug ?

Best,

Hi this is not a bug, but not enough time:(

At worst this will be done during the next Baobab maintenance at the end of the month. We’ll try to do it before.

Best

Hi there,

Indeed, done during last week Baobab maintenance (cf. Baobab scheduled maintenance: 30th of June - 01st of July 2021 - #3 by Yann.Sagon ).

Thx, bye,
Luca