These guidelines aim to assist software developers to quickly start using the OOPS library, and easily run their applications to the OPTIMA HPC platform. It is assumed that the reader is already familiar with the concepts of software – hardware codesign, reconfigurable hardware, and the AMD Vitis development environment. Moreover, these guidelines summarize a set of tips that can help developers to jump-start on using the OOPS library kernels for quickly offloading BLAS, and other functions to the Alveo u55c FPGA card.
Supported platforms
The OOPS library supports HPC systems that host Alveo u55c FPGA cards. Moreover, its current version is built with the Vitis 2022.1 IDE version.
Online repository import
The Vitis IDE GUI allows quickly importing past projects to their workspace. This option does not require developers to go through the process of selecting hardware parts of development kits, since this information is already included. As such, to create a new workspace with the OOPS library, developers can do the following steps:
- Go to the OOPS online repository and download the latest zip-based project that is available.
- Start the Vitis IDE GUI and then go to File ⇒ Import.
- Select the zip-based project Although the OOPS library is developed using the 2022.1 IDE version, newer releases should be able to import it as well.
How to use the OOPS library kernels
Software – hardware codesign is an aspect that requires special knowledge from both domains. Application developers, though, need a quick way to port legacy to new HPC platforms and easily deploy tasks to available hardware accelerators. As such, the OOPS library has been designed in a way to require as few code changes as possible.
For example, all BLAS kernels retain the exact interface with the original BLAS functions. The only difference is the hardware kernel function name, which requires preceding the “oops_” text before the actual name. For example, to offload the BLAS L1 dot function onto the FPGA, the developers simply need to replace the “dot” name with the “oops_dot” name. As mentioned in D5.2, the Underground Analysis application resembles a solid example of easily porting software for FPGA-supported execution on the OPTIMA HPC platform, by simply renaming its BLAS L1 functions as described above.
Moreover, the Vitis project of the OOPS library allows a quick transition for legacy-code based applications. As shown in the above figure, each hardware kernel is wrapped by a software function that hides FPGA buffer allocation, data transfers, and OpenCL-based runtime interactions, and exposes only the function data input / output (and configuration if applicable) parameters. In addition, each kernel is accompanied by a full “main” program that demonstrates an end-to-end usage (i.e. memory allocation on the host, data transfers, results acquisition), which can be used as a quick “jump-start” template for software developers. As such, developers can call a kernel’s main function to run the example code either on software / hardware emulation or directly on the FPGA (after of course running the full synthesis-translate-place-route-bitstream flow).
Buffer allocation
The FPGA DMA engines require that all buffers used by an application are aligned to 4 KiB pages boundaries. In case a buffer is not so aligned, then the XRT will perform an extra copy to ensure its alignment, increasing the memory overhead. For this reason, the OOPS library provides the void* OOPS_malloc(size_t alloc_bytes) function; OOPS_malloc accepts as input the buffer size in bytes, and returns a void pointer that is aligned at a 4KiB boundary. As such, developers are strongly advised to use this function to avoid any extra data copies from the XRT during the application execution. The standard free(void* ptr) function can be used to free allocated buffers.
debug flow | build time | logic validation | pragmas validation | execution time |
---|---|---|---|---|
software emulation | low | software-level | no | fast |
hardware emulation | medium | hardware-level | yes | low |
hardware implementation | high | real hardware | yes | real time |
Tips on application debugging
The Vitis IDE suite allows developers to test applications using either software / hardware emulation or the actual FPGA. The table above summarizes the benefits and limitations of each debug flow. Software emulation enables testing the application functionality, validating software / hardware algorithms, and checking FIFO-related bottlenecks (e.g. full or leftover data after kernel completion). However, it does not take into account any hardware-related aspects, such as latency. Hardware emulation on the other hand, is based on the Qemu platform to provide detailed timing of the application when executed on the FPGA, without requiring that the latter is actually installed.
For this reason, hardware emulation requires long running times, hence developers are advised to use small data sets during hardware emulation. An important benefit of hardware emulation versus running the application on the FPGA is that the Vitis flow requires considerably less time to generate the final executables. For example, the hardware emulation flow can take less than 10 minutes to create all executables for a BLAS L1 kernels, whereas hardware execution would require up to 1 hour.
This fact is crucial for hardware debugging that is not available during software emulation. For example, the dataflow directive has no effect during software emulation; on the other hand, the hardware emulation will reveal any hardware-related issues during its flow (e.g. a dataflow directive is not correctly applied, and Vitis HLS cannot generate hardware), and application emulation (e.g. FIFOs get full and / or empty due to the dataflow hardware structure, revealing imbalances between HLS functions). As such, towards reducing compilation time overheads, developers are encouraged to go through software / hardware emulation before running the application to the FPGA.
Enabling multiple CUs
All OOPS kernels can utilize multiple CUs to improve performance. BLAS L1, L3 and the Jacobi preconditioner kernels can be configured from the software API in terms of the number of CUs to be used. To change the number of CUs in BLAS L2 / L3, LU decomposition kernels, developers need to instantiate or remove the corresponding high-level functions within the HLS kernel code. The reason that these compute-demanding kernels can not be configured in terms of CUs from the software API is because “hardwired” CUs exhibit less latency during invocation, compared to multiple CUs that are enabled using the OpenCL runtime.
Overall, this option allows developers to balance between area utilization and performance, which is very useful for cases where many functions will be offloaded to hardware kernels. For example, if an application is based on many different vector-vector operations, developers can instantiate most (if not all) kernels to the FPGA using a different number of CUs for each function, based on the workload requirements.
To enable multiple CUs from the Vitis IDE GUI for the BLAS L1 and the Jacobi preconditioner kernels, developers can go to the Assistant ⇒ expand the oops_library_system_hw_link ⇒ Hardware ⇒ and double-click on the binary_container_1. From there, they can configure the number of CUs, along with the HBM channels to be used by each one of them. As a final step, developers need to configure the corresponding parameter that designates the number of CUs to be used during the application execution.