In this post we describe how the Codee tool can be used to identify code patterns for Maxeler DFEs. Maxeler uses a dataflow-oriented programming model for FPGA-based Dataflow Engine (DFE). In a dataflow computing model, operations are naturally concurrent: Data flows from memory or the host CPU into a chip where arithmetic operations are carried out by chains of functional units which are statically interconnected in a topology corresponding to the implemented functionality. Data streams from one functional unit directly to the next one without the need of complex control mechanisms. Data arrives just in time when it is needed, and the final results flow back into memory. This dataflow model of computation maps well to reconfigurable devices such as FPGAs, and building the accelerator architecture a model of minima data movements reduces the potential for memory bottlenecks.
Codee currently offers built-in support for identifying code patterns suitable for GPU offload: https://www.codee.com/catalog/.
As part of the OPTIMA project, MAX analysed if and how these existing patterns are applicable to Maxeler DFEs. Superficially, GPUs and FPGA-based accelerators such as Maxeler DFEs may seem similarly suitable to accelerate and offload compute intensive tasks. There are, however, some important differences between the capabilities of GPUs and DFEs. While it is beyond the scope of this post to offer a detailed comparison of these very different architectures, we can identify some significant key points:
- GPUs are characterised by a highly parallel SIMD architecture which works well for workloads that can be highly parallelised or batched.
- FPGAs offer a fine-grain reconfigurable logic which works well for mapping more irregular compute problems, e.g. with very different pipeline stages.
- The fine-grain FPGA architecture supports many application specific optimisations such as custom number formats, encodings and computational styles (serial vs parallel vs pipelines). This is because custom compute pipelines can be created from the fine grain logic.
- The dataflow model is ideal for throughput-oriented compute models where data transfers from and to DMA buffers can be fully overlapped with the actual compute on the chip.
- The static dataflow model employed in Maxeler DFEs is fully deterministic which can be leveraged during the development process (e.g. making the implementation plannable and allowing the developer to optimise an architecture that has not been implemented yet) and during runtime (e.g. building accelerators that act with known and fixed throughput and latency which can be essential for networking applications).
Due to these differences, GPU coding styles may not always be appropriate for DFE accelerators. In the following, and analyse all existing GPU-oriented coding patterns in Codee. For the patterns that are applicable to DFEs, we give examples of MaxJ code implementations that would be suitable to realise this pattern on a Maxeler DFE. We also give a suggestion for an additional pattern where support could be added in the future.
The following list gives an overview of the performance issues and coding suggestions for the GPU offloading that are currently supported in Codee. All of these are based on static code analysis:
- Applying the offloading parallelism on the loop containing the forall pattern (PWR055)
- Applying the offloading parallelism on the loop containing the spare reduction by handling the race condition issue (PWR057)
- Offloading the scalar reduction (PWR056)
- Suggestion for the SIMD vectorisation directives for the offloaded region (RMK008)
- Hybrid parallelisation suggestion for the nested loops (GPU offloading and vectorisation) (RMK002)
- Pragma annotation suggestions for the OpenACC and OpenMP (PWR026, PWR027)
- Distributing the offloaded work by applying the OpenMP teams (009)
- Prevention of copying the unused data to the GPU (PWR015)
- Offloading optimisation of the derived data types (PWR012)
- Optimising the data transfer to the GPU (PWD005, PWD006, PWD003)
- Preventing the unwanted data movements to GPU (PWR013)
The following table gives an overview of how these coding patterns are applicable to DFE acceleration via MaxJ, and suggests what the MaxJ pattern would cover.
Catalog ID | Description | DFE offload | Note / MAXJ suggestion |
---|---|---|---|
PWR055 | Applying the offloading parallelism on the loop containing the forall pattern | Directly applicable | It can be done as a single reduction pipeline or multiple pipelines for better performance. |
PWR057 | Applying the offloading parallelism on the loop containing the spare reduction by handling the race condition issue | Partially applicable | It can be implemented using mapped memories. This implementation would have a loop in the design that would impact the performance. |
PWR056 | Offloading the scalar reduction | Directly applicable | It can be done as a single reduction pipeline or multiple pipelines for better performance |
RMK008 | SIMD vectorisation directives for the offloaded region | Directly applicable. | With FPGAs we can configure hardware as we want. Pwr055 example shows SIMD behaviour. |
RMK002 | Hybrid parallelisation suggestion for the nested loops | Directly applicable. | Pwr055 example shows both vectorization and SIMD behaviour. |
PWR026 | Pragma annotation suggestions for the OpenACC | Not applicable. | SLiC is a C/C++ library, not compiler extension. |
PWR027 | Pragma annotation suggestions for the OpenMP | Not applicable. | SLiC is a C/C++ library, not compiler extension. |
PWR009 | Distributing the offloaded work by applying the OpenMP teams | Partially applicable. | Similar behavior can be implemented in MaxJ. Multiple threads and multiple kernels. This would have greater overhead than having wider input to a single kernel and multiple SIMD pipelines. |
PWR015 | Prevention of copying the unused data to the GPU | Directly applicable. | Streaming unused data will impact the performance. |
PWR012 | Offloading optimisation of the derived data types | Directly applicable. | It helps code analysis. Same as pwr015. |
PWD005 | Optimising the data transfer to the GPU | Directly applicable. | Not streaming enough data will cause MaxJ kernels to stall. |
PWD006 | Optimising the data transfer to the GPU | Directly applicable. | Same explanation as for GPU. |
PWD003 | Optimising the data transfer to the GPU | Directly applicable. | Same explanation as for GPU. |
PWR013 | Preventing the unwanted data movements to GPU | Directly applicable. | Streaming unused data will impact the performance. Setting unused mapped memories or mapped registers will impact the performance. |