PGAS and Hybrid MPI+PGAS Programming Models on Modern HPC clusters with Accelerators
Multi-core processors, accelerators (GPGPUs), co-processors (Xeon Phis) and high-performance interconnects (Infini- Band, 10-40 GigE/iWARP and RoCE) with RDMA support are shaping the architectures for next generation clusters. Efficient programming models to design applications on these clusters as well as on future exascale systems are still evolving. The new MPI-3 standard brings enhancements to Remote Memory Access Model (RMA) as well as introduce non-blocking collectives. Partitioned Global Address Space (PGAS) Models provide an attractive alternative to the MPI model owing to their easy to use global shared memory abstractions and light-weight one-sided communication. At the same time, Hybrid MPI+PGAS programming models are gaining attention as a possible solution to programming exascale systems. These hybrid models help the transition of codes designed using MPI to take advantage of PGAS models without paying the prohibitive cost of re-designing complete applications. They also enable hierarchical design of applications using the different models to match modern architectures.
In this tutorial, we provide an overview of the research and development taking place along the programming models (MPI, PGAS, and Hybrid MPI+PGAS) and discuss associated opportunities and challenges in designing the associated runtimes as we head toward exascale computing with accelerator-based systems. We start with an in-depth overview of modern system architectures with multi-core processors, GPU accelerators, Xeon Phi co-processors and high-performance interconnects. We present an overview of the new MPI-3 RMA model, language based (UPC and CAF) and library based (OpenSHMEM) PGAS models. We introduce MPI+PGAS hybrid programming models and the associated unified runtime concept. We examine and contrast different challenges in designing high-performance MPI-3 compliant, OpenSHMEM and hybrid MPI+OpenSHMEM runtimes for both host-based and accelerator (GPU- and MIC-) based systems. We present case-studies using application kernels, to demonstrate how one can exploit hybrid MPI+PGAS programming models to achieve better performance without rewriting the complete code. Using the publicly available MVAPICH2-X, MVAPICH2-GDR and MVAPICH-MIC libraries, we present the challenges and opportunities to design efficient MPI, PGAS and hybrid MPI+PGAS runtimes for next generation systems. We introduce the concept of ‘CUDA-Aware MPI/PGAS’ to combine high productivity and high performance. We present how to take advantage of GPU features such as Unified Virtual Address, CUDA-IPC and GPUDirect RDMA technologies to design efficient MPI, OpenSHMEM, and Hybrid MPI+OpenSHMEM runtimes. Similarly, using MVAPICH2-MIC runtime, we expose optimized data movement schemes for different system configurations including multiple MICs per-node in the same socket and/or different sockets configurations.
More details are available from: http://web.cse.ohio-state.edu/~panda/ppopp17_hybrid_tutorial.html