NVIDIA’s new GPU accelerator the NVIDIA® Tesla® P100 is a great option for both High Performance Computing (HPC) and Deep Learning workloads. It comes in 2 different form factors, PCIe with either 12GB or 16GB of HBM2 memory and SXM2 with 16GB of HBM2 memory and NVIDIA NVlink™ high speed interconnect.
Here is the breakdown. In the beginning of this quarter, NVIDIA started shipping the Tesla P100 as the first member of NVIDIA’s new Pascal architecture family. The Pascal architecture replaces the Kepler architecture optimized for double precision performance (K40, K80) as well as the Maxwell architecture optimized for single precision performance (M40, M4). Double precision performance is required for HPC and scientific applications, while single precision performance is needed for rendering and deep learning applications. Note that NVIDIA skipped a beat on the HPC side. NVIDIA’s previous Maxwell architecture does not support double precision optimized hardware for scientific applications.
Besides substantial improvements of single precision (SP, FP32) and double precision performance (DP, FP64), the P100 sports a new native data type, half precision (HP, FP16). Why half precision? The short answer is deep learning (DL). While the performance of artificial neural networks does not improve with increased precision, the computation time can be sped up by reducing the precision. Instead of performing one SP operation, the P100 can perform 2 HP operations simultaneously. 11 TFLOPs SP turn into 22 TFLOPs HP, a performance increase of 3X over NVIDIA’s Tesla M40, the previous generation highest performing enterprise grade DL card. NVIDIA is currently working on updating CUDA libraries and frameworks like Caffe to enable the new data type. In that sense the P100 is not only an HPC but more so a true DL card. The table below summarizes double, single, and half precision performance and other metrics for the top-end-of-line Kepler, Maxwell and Pascal families:
As shown in the table, The Tesla P100 comes in 2 form factors. NVIDIA’s P100 PCIe form factor features the traditional PCIe 3.0 x16 interface for card-to-card and card-to-CPU communications. For some applications the PCIe interface can be a bottleneck limiting the overall system performance. To address the issue, NVIDIA added NVlink, a new interconnect co-developed by IBM and NVIDIA, it is 5x faster than the x16 PCIe 3.0 to enable significantly faster communication between GPUs and from GPU to CPU. NVlink is not available for cards with PCIe form factor, but is only available on boards with a new mezzanine card-like SXM2 form factor. Each SXM2 card supports 4 NVlink channels both for card-to-card and card-to CPU communication.
Several x86 board manufacturer including SMC and Quanta will support the SXM2 form factor. For x86 based systems communication between GPU and CPU remains PCIe based. Currently the maximum number of supported P100 cards is 4 per CPU and 8 per server. Probably the most prominent example for an x86 based system is the NVIDIA DGX-1 box.
If needed, IBM’s OpenPower platform provides further acceleration. Starting with the Power8 CPU family, IBM supports NVlink for enhanced data transfer between GPU and CPU. First systems from Wistron are now available.
In summary, NVIDIA’s new Tesla P100 GPU Accelerator is a well-rounded high end processor for both HPC and DL applications. A new native data type, half precision or FP16, is introduced to essentially double the TFLOP performance for DL applications. P100 is offered as dual width PCIe cards without NVlink and SXM2 form factor with 4 NVlink channels. While x86 platform only support PGU-to-PGU NVlink communication, IBM’s new Openpower platform and Power8 CPU also enable CPU-to-PGU communication via NVlink.