Newest GPGPU flagman, Tesla K20 was announced by NVIDIA at Supercomputing conference in Salt Lake City yesterday (BTW, you can meet Roman Pavlyuk, ELEKS' CTO and Oleh Khoma, Head of HPC Unit there). Due to partnership with NVIDIA we got access to K20 couple of months ago and did lots of performance tests. Today we're going to tell you more about it's performance in comparison with several other NVIDIA accelerators that we have here at ELEKS.
We implemented set of synthetic micro-benchmarks that measure performance of following basic GPGPU operations:
- Host/Device kernel operations latency
- Reduction time (SUM)
- Dependent/Independent FLOPs
- Memory management
- Memory transfer speed
- Device memory access speed
- Pinned memory access speed
You can find more information and benchmark results below. Our set of tests is available on GitHub, so that you can run them on your hardware if you want. We ran these tests on seven different test configurations:
- GeForce GTX 580 (PCIe-2, OS Windows, physical box)
- GeForce GTX 680 (PCIe-2, OS Windows, physical box)
- GeForce GTX 680 (PCIe-3, OS Windows, physical box)
- Tesla K20Xm (PCIe-3, ECC ON, OS Linux, NVIDIA EAP server)
- Tesla K20Xm (PCIe-3, ECC OFF, OS Linux, NVIDIA EAP server)
- Tesla M2050 (PCIe-2, ECC ON, OS Linux, Amazon EC2)
- Tesla M2050 (PCIe-2, ECC ON, OS Linux, PEER1 HPC Cloud)
One of the goals was to determine the difference between K20 and older hardware configurations in terms of overall system performance. Another goal: to understand the difference between virtualized and non-virtualized environments. Here is what we got:
Host/Device kernel operations latency
One of the new features of K20 is Dynamic Parallelism that allows you to execute kernels from each other. We did a benchmark that measure latency of kernel schedule and execution with and without DP. Results without DP look like that:
Surprisingly, new Tesla is slower than old one and GTX 680, probably because of the driver which was in beta version at the moment we measured performance. It is also obvious that AWS GPU instances are much slower than closer-to-hardware PEER1 ones, because of virtualization.
Then we tried to run similar benchmark with DP on:
Obviously we couldn't run these tests on older hardware because it doesn't support DP. Surprisingly, DP scheduling is slower than traditional one, but DP execution time is pretty much the same with ECC ON and traditional is faster with ECC OFF. We expected that DP latency would be less than traditional. It is hard to say what is the reason of such slowness. We suppose that it could be a driver, but it is just our assumption.
Reduction time (SUM)
Next thing we tried to measure was reduce execution time. Basically we calculated array sum. We did it with different arrays and grid sizes (Blocks x Threads x Array size):
Here we got expected results. New Tesla K20 is slower on small data sets, probably because of less clock frequency and not fully-fledged drivers. It becomes faster when we work with big arrays and use as many cores as possible.
Regarding virtualization, we found that virtualized M2050 is comparable with non-virtualized one on small data sets, but much slower on large data sets.
Peak theoretical performance is one of the most misunderstood properties of computing hardware. Some people says it means nothing, some says it is critical. The truth is always somewhere between these points. We tried to measure performance in FLOPs using several basic operations. We measured two types of operations, dependent and independent in order to determine if GPU does automatic parallelization of independent operations. Here's what we got:
Regarding overall results, Teslas are much faster than GeForces when you work with double precision floating point numbers, which is expected: consumer accelerators are optimized for single precision because double is not required in computer games, primary software they were designed for. FLOPs are also highly dependent on clock speed and number of cores, so newer cards with more cores are usually faster, except of one case with GTX 580/680 and double precision: 580 is faster because of higher clock frequency.
Virtualization doesn't affect FLOPs performance at all.
Another critical thing for HPC is basic memory management speed. As there are several memory models available in CUDA it is also critical to understand all the implications of using each of them. We wrote a test that allocate and release 16 b, 10 MB and 100 MB blocks of memory in different models. Please note: we got quite a different results in this benchmark, so it makes sense to show them on charts with logarithmic scale. Here they go:
Memory transfer speed
Another important characteristics of an accelerator is speed of data transfer from one memory model to other. We measured it by copying 100 MB blocks of data between Host and GPU memory in both directions using regular, page locked and write combined memory access models. Here's what we got:
Device memory access speed
We also measured device memory access speed for each configuration we have. Here they go:
Pinned memory access speed
Last metric we measured was pinned memory access speed when device interacts with host memory. Unfortunately we weren't able to run these tests on GTX 680 with PCIe-3 due to issue with big memory blocks allocation in Windows.