Mathematical modeling plays a key role in modern astrophysics. It is the universal tool for research of non-linear evolutionary processes in the universe. Modeling the complex astrophysical processes in high resolution takes the most powerful supercomputers. The University’s AstroPhi project develops astrophysical code for massively parallel supercomputers with Intel Xeon Phi processors. This valuable project helps students learn to create numerical simulation code for massively parallel supercomputers. The students also learn about modern HPC hardware architectures—preparing them to develop tomorrow’s exascale supercomputers.
The use of Intel® Advanced Vector Extensions for Intel® Xeon Phi™ processors gave us the maximum code performance compared with other architectures available on the market,” said Igor Kulikov, Assistant Professor, Novosibirsk State University.
The team designed the project using a numerical method shown in the figure below. The benefits of this high-order method included:
- The absence of artificial viscosity
- Galilean-invariant solution
- Entropy non-decrease guarantee
- Simple parallelization
- Potentially “infinite” scalability (weak scalability)
The first three benefits are the key factors for realistic modeling of all the significant physical effects in astrophysical problems. The simplicity of the method, plus the small number of MPI send/receive operations, provides efficient parallelization―and potentially “infinite” scalability in terms of weak scalability.
Massively Parallel Architecture
The team co-designed the new solver for massively parallel architecture based on Intel Xeon Phi processors. Designed to help eliminate node bottlenecks and simplify code modernization, the bootable processors provided the power efficiency the team needed to handle the most demanding high-performance computing applications.
The team based the solver on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions, which deliver 512-bit SIMD support and enable programs to pack eight double-precision or 16 single-precision floating-point numbers, or eight 64-bit integers, or 16 32-bit integers within the 512-bit vectors. This enables processing of 2X the number of data elements that AVX/AVX2 can process with a single instruction, and 4X that of SSE.
The use of Intel Advanced Vector Extensions 512 for Intel Xeon Phi processors gave us the maximum code performance compared with other architectures available on the market,” said Igor Kulikov, assistant professor at NSU.
Optimizing the Code
A key aspect of the AstroPhi project was optimizing the code for maximum performance on the Intel Xeon Phi processors. Before optimization, the team had some problems with vector dependencies and vector sizes. The goals for optimizing the code were to remove vector dependencies and optimize memory load operations, efficiently adapting vector and array sizes for the Intel Xeon Phi architecture. The team used Intel Advisor and Intel Trace Analyzer and Collector, two tools that are part of Intel® Parallel Studio XE, for the optimization.
Intel Parallel Studio XE is a comprehensive software development suite that helps developers maximize application performance on today’s and future processors by taking advantage of the ever-increasing processor core count and vector register width.A key aspect of the AstroPhi project was optimizing the code for maximum performance on the Intel Xeon Phi processors.Click To Tweet
Intel Advisor is a software tool based on the fact that for modern processors, it is crucial to both vectorize (use AVX* or SIMD* instructions) and thread software to realize the full performance potential of the processor. Using this tool, the team was able to perform a roofline analysis highlighting poor-performing loops and showing performance headroom for each loop, identifying which can be improved and which are worth improving.
Intel Advisor made it easier to find the cause of bottlenecks and decide on next optimization steps,” explained Igor Chernykh, assistant professor at NSU. “It provided data to help us forecast the performance gain before we invested significant effort in implementation.”
Intel Advisor sorted loops by potential gain, making compiler reports easier to read by showing messages on the source, and giving the project team tips for effective vectorization. It also provided key data like trip counts, data dependencies, and memory access patterns make vectorization safe and efficient.
Intel Trace Analyzer and Collector was another help in optimizing the code. This graphical tool helped the team understand MPI application behavior, quickly find bottlenecks, improve correctness―and, ultimately, maximize the tool’s performance on Intel® architecture. It includes MPI communications profiling and analysis features that helped to improve weak and strong scaling.
After all the improvements and optimizations, the team achieved 190 GFLOPS performance and 0.3 FLOP/byte arithmetic intensity, with 100 percent mask utilization and 573 GB/s memory bandwidth.
Using Intel Advisor and Intel Trace Analyzer and Collector, we were able to remove vector dependencies, optimize load operations, and adapt vector and array size for the Intel Xeon Phi architecture,” explained Kulikov. “This optimization gave the opportunity to run 3X more variants of astrophysical tests.”
Download your free 30-day trial of Intel® Parallel Studio XE