Dear @Yann.Sagon,
Thank you for your reply.
The A100 GPU on gpu022 has 40 GB of RAM, whereas the GPU on gpu03.bamboo has 80 GB. The latter has faster memory. According to Nvidia, there may be a speed boost depending on the application running.
While it is true that the A100 80Gb has ~25% more memory bandwidth, which would translate in improved performance on memory-bound tasks, the task I used to test the performance is actually compute-bound. I am certain of this because I was monitoring the GPU power usage and CUDA core utilization, which both were constantly at 100%.
Under compute-bound tasks, both versions of the A100 should behave the same, since they are essentially the same chip. In the NVIDIA specs, the FLOPS are equivalent for both.
Moreover, for the first iterations, both GPUs have the same performance. Which would be impossible if the 80Gb was faster because it had strictly better hardware.
I believe there is some kind of throttling kicking in on the A100 40Gb PCIe cards. But I have not been able to identify the reason.
On gpu022, there is a parameter named NPS (Numa Per Socket) that is set to 1, whereas on gpu003, I guess it is set to 4. I have set gpu022 to drain in order to change this parameter. This may improve the bandwidth between the memory, CPU and PCI. After making this change, you can rerun your application to see if it improves the speed.
Thanks. I am currently running a job that will take a long time on gpu022. But I will try to profile gpu022 once this parameter has been changed. Although I am sure that this will not change much, since as I said before, the GPUs are compute-bound. Moreover, the dataset I’m loading from RAM to the GPU is only 3Mb in total. So it is extremely improbable that the PCIe is the bottleneck.
I have run more experiments on other A100s from baobab, both 40Gb and 80Gb models (image attached below). The 80Gb models are able to sustain their performance under constant load, but all the 40Gb models start with the same performance of the 80Gb models, and then quickly drop to a lower perf. Interestingly, gpu031’s performance decay is way slower.
I have rented a private node from Vast.ai to compare its performance with our HPC. Vast.ai offers a containerized environment from private individuals, which usually translates in inconsistent performance. But the GPU I rented (blue line) is able to sustain a performance similar to the ones in Bamboo when other common resources are not being drained.
I hope you found this information useful. Please let me know if I can help in any way.
Best regards,
Ramon.