High Performance Computing (HPC) Update

Our HPC effort at VMware has been very active in recent months, and we have lots of information to share– including new performance results. Rather than cramming all of that content into a single mega blog entry, I’ve decided instead to give a preview here of some of the most significant developments. I will then delve into each of these areas in more detail in a series of upcoming monthly blog posts. InfiniBand Performance This spring we installed a four-node InfiniBand HPC cluster in our lab in Cambridge, MA. The system includes four HP DL380p Gen8 servers, each with 128 GB memory and two Intel 3.0GHz E5-2667 eight-core processors and Mellanox ConnectX3 cards that support both FDR (56 Gb/s) InfiniBand and 40 Gb/s RoCE. The nodes are connected with a Mellanox MSX6012F-1BFS 12-port switch. Na Zhang , PhD student at Stony Brook University, did an internship with us this summer. She accomplished a prodigious amount of performance tuning and benchmarking, looking at a range of benchmarks and full applications. We have lots of performance data to share, including IB, RoCE, and SR-IOV results over a range of configurations. In addition to testing on ESX 5.5u1, we have been working closely with our [...]]> Our HPC effort at VMware has been very active in recent months, and we have lots of information to share– including new performance results. Rather than cramming all of that content into a single mega blog entry, I’ve decided instead to give a preview here of some of the most significant developments. I will then delve into each of these areas in more detail in a series of upcoming monthly blog posts.

InfiniBand Performance


This spring we installed a four-node InfiniBand HPC cluster in our lab in Cambridge, MA. The system includes four HP DL380p Gen8 servers, each with 128 GB memory and two Intel 3.0GHz E5-2667 eight-core processors and Mellanox ConnectX3 cards that support both FDR (56 Gb/s) InfiniBand and 40 Gb/s RoCE. The nodes are connected with a Mellanox MSX6012F-1BFS 12-port switch.


Na Zhang, PhD student at Stony Brook University, did an internship with us this summer. She accomplished a prodigious amount of performance tuning and benchmarking, looking at a range of benchmarks and full applications. We have lots of performance data to share, including IB, RoCE, and SR-IOV results over a range of configurations.



FDR IB latencies: native, ESX 5.5u1, ESX prototype



In addition to testing on ESX 5.5u1, we have been working closely with our R&D teams and evaluating performance on experimental builds of ESX. The graphs above show InfiniBand latencies using VM Direct Path I/O using ESX 5.5u1 and an engineering build of ESX. As you can see, the 15-20% latency overhead for very small messages measured in ESX 5.5u1 has been eliminated in the engineering build — this is an important advance for VMware’s HPC efforts!


Na continues to work with us part time during this school year and will rejoin us for another internship next summer. We expect to continue to publish a wide variety of performance results over the coming months. We are also in the process of doubling the size of our cluster, and so will be able to test at higher scale as well.


Single Root IO Virtualization (SR-IOV)


In addition to testing InfiniBand and RoCE with VM Direct Path I/O (passthrough) we have been working closely with our partner Mellanox to evaluate anearly version of InfiniBand SR-IOV support for ESX 5.5. Unlike passthrough mode, which makes an entire device directly visible within a virtual machine, SR-IOV (Single Root I/O Virtualization) allows single hardware devices to appear as multiple, virtual devices — each of which can be shared with a different VM, as illustrated in the diagram below. The PF (physical function) driver is an early version provided to us by Mellanox and the VF (virtual function) driver is included in the latest releases of the Mellanox OFED distribution.



SR-IOV allows sharing of a single device among multiple VMs



One of the primary HPC use-cases for SR-IOV is to allow multiple VMs on a host to access high-performance cluster file systems like Lustre and GPFS by sharing the single, physical InfiniBand connection between the host and storage system. We will be demonstrating this capability in the EMC booth at SC’14 in New Orleans. Stop by and say hello.


Application Performance


Beyond micro-benchmarks and several well-known higher level benchmarks (HPCC, LINPACK, NPB), we have tested a few full applications used in Life Sciences and elsewhere. In particular, we’ve evaluated NAMD, LAMMPS, and NWCHEM, and seen generally good results. As a teaser, here are a some NAMD test results that illustrate how well this molecular dynamics code runs on our test cluster:



NAMD performance using 1 to 16 MPI processes per VM and one VM per host on ESX 5.5u1



Intel Xeon Phi


Using a test system supplied by Intel, we’ve run Intel Xeon Phi performance tests with an engineering build of ESX and VM Direct Path I/O. We’ve seen almost identical performance relative to baremetal, as seen in the graph below which shows virtual and baremetal performance using two different Intel programming models (pragma and native). While we will cover Xeon Phi in more detail in a subsequent post, it should be noted that we are using an engineering build of ESX because Xeon Phi is not usable with the shipping versions of ESX 5.5 due to PCI limitations. So, as they say, don’t try this test at home (yet).



Double and single precision GEMM results using Intel Xeon Phi in passthrough mode with a prototype ESX build. Virtual and baremetal performance is close to identical for both native and program programming models



HPC People


I’m very pleased to announce that Andrew Nelson joined our HPC effort recently. Andy has broad and deep expertise with VMware’s products from his previous role as an SE in our field organization. For the past four years he has worked on initiatives related to distributed systems, networking, security, compliance, and HPC. Andy is now focused on prototyping the ability to self-provision virtual HPC clusters within a private cloud environment, much the way Big Data Extensions (BDE) and Serengeti now support provisioning of Hadoop clusters. Andy will join me in blogging about HPC, so expect to learn a lot more about this project and other HPC initiatives from him.


Matt Herreras, who in his day job is a Senior Systems Engineering Manager, is another key member of the VMware HPC team. Matt plays a critical role in bringing our field organization and Office of the CTO together to allow us to team effectively to address the rapidly growing interest we are seeing in HPC from our customers. Matt pioneered the ideas of Scientific Agility and Science as a Service at VMware and has been invaluable in helping to move our overall HPC effort forward.


Bhavesh Davda, my long-time collaborator and colleague in the Office of the CTO, leads our Telco effort and also continues to lend his technical expertise to the HPC program in a number of important areas– most notably RDMA, InfiniBand, RoCE, low latency, and jitter reduction. His deep platform experience and mentoring have been and continue to be crucial to the success of our HPC effort.


Conferences


Matt and I spoke at both VMworld USA and VMworld Europe this year, giving a talk titled How to Engage with your Engineering, Science, and Research Groups about Virtualization and Cloud Computing. As the title suggests, the presentation was aimed at helping our primary constituency — IT management and staff — talk with their colleagues who are responsible for running HPC workloads within their organization about the benefits of virtualization and cloud computing for those environments. For those with access to VMworld content, this link should take you directly to the audio recording of our presentation.


This year Andy, Matt, and I will all be at SC’14 in New Orleans. We will have a demo station in the EMC booth where we will be showing a prototype of our approach to self-provisioning of virtual HPC clusters in a vRA private cloud as well as demonstrating use of SR-IOV to connect multiple VMs to a remote file system via InfiniBand. If you will be attending SC and want to meet, please stop by the booth or send me a note at simons at vmware.com.






via VMware Blogs http://bit.ly/1q9fjRG