The development of parallel systems on heterogeneous, especially FPGA-populated clusters, is still an active area of research. Emerging FPGA- based HPC clusters often lack support in parallel programming models.
Thus, in the scope of the OPTIMA project the programming environment has been carefully selected to use industry-proven software with the capability to support the efficient programming and the intercommunication between FPGAs. As a result, within OPTIMA we will utilize and further extend CARME, FPGA-ML, PARALLELWARE, GPI/GASPI, and MAXJ. All those programming paradigms, except MAXJ, have been developed within several EC-funded projects and have unique characteristics as analytically described in the next sections.
- CARME is a multi-user software stack for deep learning on HPC clusters. At its current stage it mainly supports GPU clusters and provides customized containers for machine learning (e.g. for Tensorflow, Pytorch, scikit-learn). Currently CARME is mainly used by the deep learning group at Fraunhofer ITWM and in the scope of external training courses on our internal GPU cluster.
- FPGA-ML is being developed and tested based on the requirements of, mainly, two HPC applications: an oil reservoir modelling application and a city traffic analysis one. Based on the overall requirements’ analysis of ECOSCALE and the evaluation of the final prototype, it seems that certain parts of the programming environment and the runtime (e.g. the support of Deep learning schemes and the MPI-based intercommunication mechanism) have room for improvements.
- PARALLELWARE ANALYZER is the first static code analyzer specializing in performance. The Parallelware Artificial Intelligence (AI) engine leverages the expertise of the senior performance optimization engineers who have been doing it manually for the last decades. It provides actionable insights through performance optimization reports that help ensuring best practices to speedup the code through parallel computing in modern heterogeneous multicore chips. At the current stage, it mainly supports the C, C++ and Fortran programming languages, the OpenMP and OpenACC parallel programming APIs, and vectorization, multicore CPU and GPU parallel hardware platforms.
- Regarding GASPI/GPI-2, it is industry-proven code. In the scope of the ExaNoDe and EuroExa project an Ethernet version providing host triggered FPGA communication has been deployed. This version is based on TCP sockets. The communication works with the help of GASPI segments. The segments are allocated and registered in memory regions to which the DMA controller has access. The data offload to the FPGA is handled by the DMA controller. In the scope of the EPiGRAM-HS project we are investigating methods to offload with OpenCL directives.
- MaxJ is an industry-proven solution that has been used commercially by Maxeler internally and its customers. Furthermore, it is used by a large community of academic users. MaxJ has been successfully used to accelerate applications from various different application domains including seismic modelling, finance, machine learning, genomics, and various scientific applications with complex code bases. https://www.maxeler.com/solutions/universities/
An application in MaxJ is a combination of one or multiple kernels and a manager. The kernels express computation and the manager describes the connection between the kernels and external interfaces such as PCIe, DRAM or networking. The concept behind MaxJ is to provide the right level of abstraction for productive development and high performance. MaxJ hides the tedious and time-consuming aspects found in conventional hardware description languages while retaining enough control over the key aspects of the application so that the developer can reason about the computation and IO and push the application towards maximum performance. For example, when developing in MaxJ one would not be concerned with low-level details and optimisation of the dataflow graph such as scheduling of operations, but the developer will control pipelining, parallelism, and balance of computation and communication. MaxJ designs fully predictable in terms of performance and developers typically analyse, plan and optimise a design before implementing it. It has been shown that this model can result in performance equivalent to hand-optimised low-level VHDL implementation while at the same time requiring less development time than an approach in high-level synthesis.
The MaxCompiler tools for MaxJ provide extensive functionality for application analysis, optimisation, simulation and debug. One of the key challenges of developing the FPGA accelerators is the extremely long place and route times of the FPGA vendor backend, that can exceed 24h for large FPGA devices. In order to avoid unnecessary place and route runs, MaxCompiler provides tools to simulate and debug MaxJ code without the need to place and route them. Furthermore, the tools provide additional tools to assist with powerful but complex optimisation procedures such as using custom number formats.
Another key benefit of MaxJ is that it is completely FPGA vendor and devic agnostic. MaxJ code in not bound to any particular FPGA vendor tool chain or FPGA device architecture and therefore provides a simple route to moving applications between different accelerators that use different FPGA devices. For example, older generation MAX4 DFEs are based on Intel Stratix-V devices, and the MaxJ code developed for these devices can be recompiled for new MAX5 DFEs which use Xilinx UltraScale+ FPGAs. MaxCompiler also supports further Xilinx UltraScale+ based third party-platforms from Amazon Web Services and Xilinx. This is a significant benefit, since a common obstacle to the wider adoption of FPGA accelerators is the assumption that code development and optimisation is locked to one particular target devices and portability to other, or newer devices is limited or even impossible. This can be avoided with device-agnostic code and tools.
Progress beyond the State-of-the-Art
In terms of CARME it will be tested for its full support of FPGAs. Within OPTIMA we will also develop a library with FPGA kernels for Deep Learning and mathematical operations that can be run with CARME, thus extending CARME beyond the scope of a pure machine learning software stack. We propose that the frontend of this new library is written in Python to provide an easy-to-use interface that is highly accepted in the machine learning community. FPGA-ML will be improved by being combined with the other components of the OPTIMA programming environment. In particular, by replacing in FPGA-ML the CPU-demanding MPI intercommunications with the much lighter GASPI/GPI-2 we expect to increase the performance of the applications running on top of it. Moreover, CARME will allow FPGA-ML to support properly Machine/Deep learning applications while PARALLEWARE will provide the means for better task distributions and debugging. PARALLELWARE will be extended to fully support FPGAs, in addition to the existing support for multicore CPUs and GPUs.