OPTIMA’s main goal is to prove that there are several HPC applications that can take advantage of the future highly heterogeneous FPGA-populated HPC systems while, by using the newly introduced tools and runtimes, the application porting/development can be almost as simple as developing software for conventional HPC systems incorporating GPUs. In a nutshell, OPTIMA aims to:
- develop optimized versions of applications and open-source libraries that will be executed on FPGA-based HPC systems, at a significantly higher performance-to-energy ratio and/or producing more accurate results than the existing HPC systems, including those consisting of low power CPUs (e.g. ARM) and/or GPUs, and
- provide guidelines and reference open-source designs so as to allow the application porting, by third parties, to FPGA-based heterogeneous platforms to be done in time similar to that needed for porting an HPC application to systems utilizing GPUs and/or many-cores.

A large set of scientific and industrial applications is based on vector operations, linear/differential equations, and matrix multiplications. Consequently, towards enhancing performance, the project will provide the Optima OPen Source (OOPS) library as an optimized set of software routines that may be used by industrial/scientific software and applications, which will take advantage of the OPTIMA hardware platforms. OOPS will drastically reduce the effort of mapping primitive computation kernels onto reconfigurable logic that is integrated in the OPTIMA hardware platforms and improve the execution time and energy efficiency of the mapped computations. OOPS will be integrated into the OPTIMA toolflow and programming environment, enabling seamless utilization of the available hardware resources by software developers, without requiring advanced skills or extensive experience in hardware development.
Figure 1 illustrates how the OOPS library will be integrated to the OPTIMA toolflow. OOPS will expose an API in the form of function prototypes towards the application layer and target the OPTIMA hardware platforms. Developers will be able to utilize OOPS kernels by including a small set of files in the application source code. On the other hand, OOPS kernels will leverage the device vendor runtime layer to:
- transfer data from the host processor to hardware kernels,
- initiate and monitor data processing, and
- send output results back to the application layer.
The OOPS library set will implement a large subset of the BLAS L1, L2, and L3 subroutines [2], a sparse matrix-vector multiplication (SpMV) kernel, and a subset of the PETSc [1] suite that supports the Jacobi preconditioner, LU factorization and the Krylov Conjugate Gradient (CG) algorithm.
The OOPS library will expose a C-based API to users for easy integration with existing development environments. Exposing a standard C-based interface serves the key target of releasing an Open-Source library set to users that can be easily integrated/combined with existing frameworks (e.g. Parallelware, GASPI, etc.) from varying application domains. Other widely used programming languages (e.g. Python) will also be evaluated in order to expand the OOPS library availability. As such, application developers that use the OPTIMA framework will be able to integrate the OOPS library by simply including its API header file in their software code. The full OOPS API will be available at the OPTIMA code repository.

Finally, Figure 2 illustrates the general structure of OOPS kernels that will be implemented in the OPTIMA platforms. Kernels will expose a set of input / output arguments to developers. Before initiating data processing, each kernel will utilize vendor-specific APIs to allocate memory for all input / output arguments in the FPGA, as well as transfer data between the host processor and the attached FPGA device. Towards optimizing performance and resource utilization, all OOPS kernels will include vendor-specific High-Level Synthesis (HLS) directives. These directives allow CAD synthesis and implementation tools to identify and unroll for-loops, exploit data parallelism, set a target pipelining interval, as well as optimize data transfer mechanisms (e.g., using multiple DMA channels and / or double buffering).
References
[1] Balay S., Gropp W.D., McInnes L.C., Smith B.F. (1997) Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries. In: Arge E., Bruaset A.M., Langtangen H.P. (eds) Modern Software Tools for Scientific Computing. Birkhäuser, Boston, MA.
[2] BLAS (Basic Linear Algebra Subprograms) – http://www.netlib.org/blas/