Issue
The loop containing the forall pattern can be made faster by offloading it to a DFE accelerator using the Maxeler programming environment.
Action
Implement a version of the forall loop using an Application Program Interface (API) provided by the Maxj compiler of the Maxeler programming environment.
Relevance
A loop often represents a computationally expensive block of computation and a FPGA-based DFE accelerator can be used to speed up the computation. A loop without loop-carried dependencies is straightforward to map to a simple MaxJ implementation (see code example V1 below), and it can be further optimised by parallelisation (see code example V2). The programmer does not need to explicitly manage data transfers between CPU and DFE accelerator but the IO overhead of data transfers needs to be considered in the overall accelerator development.
Code examples
CPU:
void example(double *D, double *X, double *Y, int n, double a)
{
for (int i = 0; i < n; ++i) { D[i] = a * X[i] + Y[i]; }
}
MaxJ:
V1 (simple solution)
DFEVar X = io.input("X " , dfeFloat(11, 52));
DFEVar Y = io.input("Y" , dfeFloat(11, 52));
DFEVar a = io.scalarInput("a " , dfeFloat(11, 52));
io.output("D", dfeFloat(11, 52)) <== a * X * Y;
V2 (optimized solution)
// N is a variable that can be tuned depending on available FPGA resources
DFEVectorType<DFEVar> vectorType = new DFEVectorType<DFEVar>
(dfeFloat(11, 52), N);
DFEVar X = io.input("X ", vectorType);
DFEVar Y = io.input("Y", vectorType);
DFEVar a = io.scalarInput("a " , dfeFloat(11, 52));
DFEVector<DFEVar> output = io.output("D", vectorType);
for(int i = 0; i < N; i++){
output[i] <== a * X[i] * Y[i]
}
References