Issue
The loop containing the forall pattern can be made faster by offloading it to an FPGA using the Xilinx programming environment.
Action
Implement a version of the forall loop using an Application Program Interface (API) provided by the HLS compiler of the Xilinx programming environment.
Relevance
A loop often represents a computationally expensive block of computation and a FPGA accelerator can be used to speed up the computation. A loop without loop-carried dependencies can be mapped to a HLS implementation (see code example ‘axpy’ below). The programmer needs to explicitly provide the HLS compiler with annotations to guide the high-level synthesis process.
Code examples
CPU:
int axpy(int n, float da, float *x, float *y,)
{
int i=0;
for(i=0; i < n, i++)
{
y[i] += da * x[i] ;
}
}
Xilinx HLS:
#include "../../include/global.hpp"
#include "../../include/common.hpp"
#include "../../include/common.hpp"
inline void write_vector_wide_axpy(v_dt* out, hls::stream<v_dt>& Xtemp, hls::stream<v_dt>& Ytemp ,const int N,const int incy,const float alpha) {
v_dt X;
v_dt Y;
v_dt temp;
mem_wr:
for (int i = 0; i < N; i+=VDATA_SIZE) {
#pragma HLS pipeline II=1
X=Xtemp.read();
Y=Ytemp.read();
for(int j=0;j<VDATA_SIZE;j++){
#pragma HLS unroll
if(i+VDATA_SIZE<=N || j<(N%VDATA_SIZE)){
temp.data[j]=Y.data[j] + alpha*X.data[j];
}
else{
temp.data[j]=0;
}
}
out[i/VDATA_SIZE] = temp;
}
}
References