The objective of this WP is to evaluate the software resulting of WPs 3-4-5-6-7 with regard of their performance and scaling on the different HPC architectures. Continuous scaling experiments on different platforms with feedback to domain experts will be performed throughout the duration of the WP. Moreover the aim of this WP is to provide feedback to WPs 3-4-5-6-7 in order to finally tune and optimize the behaviours of the resulting software packages (pre-process, solvers, postprocess.
Traditional approaches on parallel application analysis and optimization tools focused on conventional objects such as MPI, OpenMP or pthread, if multiprocessing paradigms are taken into account, or Fortran, C, C++… if compilers are under study. Regarding hardware, typical variables were CPU flops, disk I/O operations, bandwidth between CPU and memory or latency of the cluster networks. Future exascale supercomputers will not be built by just adding multiple nodes with many cores but by combining traditional CPUs with accelerators or other type of architectures and this new heterogeneous architecture with massive numbers of cores means news challenges to be addressed. On the other hand, new programming models (PGAS, Open ACC, OpenCL, Cuda …) are also being developed and new evolving standards (MPÎ, OpenMP) with new features will have to be studied. Finally these changes also entails new performance problems such as load imbalance between CPU/GPU, inefficient data transfer between host and accelerator or problems related to the massive number of cores and data involved.
The new computing environment requires new family of tools for the benchmarking and performance analysis of the software resulting from the previous WPs
This task will evaluate most-up-to-date analysis and performance tools to select and install the most appropriate for architectures being used to get performance information of the software released from others WPs.
There is a huge collection of performance tools on the market. Some of them commercial but also many others available under free or GPL license. Vendor specific tools (such as CrayPat or SGI Performance Suite) can often go deeper than vendor neutral tools but portable approaches allow comparisons between various architectures results.
A summary list of the tools that will be evaluated includes TAU, HPCToolkit, PerfSuite and UNITE (suites of integrated tools for parallel performance analysis), kCachegrind/Cachegrind (profiling tool, cache analysis and visualization), Marmot (Tool for check MPI usage at runtime), MUST (MPI runtime error detection), PAPI (Portable performance counter utilities), Vampire/Vampiretrace (event tracing tools), Paraver/Extrae (trace generation and visualization), MAQAO (binary analyzer), Score-P (profiling and analyzing), Scalasca (automatic performance analysis).
Task leader: LUH-HLRN. Partners involved: CIMNE, CESCA, LUH-HLRN
The aim of this task is to perform an exhaustive study of the performance of the software obtained in WP 3-7 taking into account their scaling behaviour. For this purpose, profiling and/or tracing selected methods will be used to collect the information and analysis tools to evaluate the results.
The benchmarking will be performed and presented considering aspects such as CPU and memory usage (FLOP rate and Memory bandwidth measurements), overhead and scalability when considering OpenMP or walltime spending in communication, detecting load imbalance or analysing message sizes (for MPI), among others.
Post-processing algorithms, including description of commonality developments.
Task leader: CESCA. Partners involved: CIMNE, CESCA, LUH-IKM, LUH-HLRN, NTUA
Once performance of software running on different selected architectures has been correctly analysed and benchmarked, this task will be in charge of studying the results. The aim of the task will be finding the performance problems, including those well-known in the tera-world (disk performance, I/O bottlenecks, buffer overhead, network performance…) but also those appearing in the peta-exa era ( such as load imbalance within the heterogeneous architecture (CPU/GPU…), inefficient data transfer between host and accelerator, etc.)
The next step will be to perform the last trials of the study and giving advice on how to finally optimize hardware specific parameters. The final goal is to improve and tune the software of the WPs for its best performance under the various architectures where it could be use.
Finally, this task will also develop performance monitoring assessment for the software in production.
Task leader: CESCA. Partners involved: All (CIMNE, CESCA, LUH-IKM, LUH-HLRN, NTUA, QUANTECH)
Lead beneficiary: LUH
Lead beneficiary: CESCA
Lead beneficiary: CESCA
The research leading to these results has received funding from the European Community's Seventh Framework Programme under grant agreement n° 611636