Issue
The loop containing the scalar reduction pattern can be made faster by offloading it to an accelerator.
Action
The code example provides a MaxJ solution that could be applied to a scalar reduction pattern.
Relevance
Offloading a loop to an DFE is one of the ways to speed it up. DFEs offer significant computation resources to accelerate the computation but in order to be efficient, the MaxJ implementation needs to consider the accumulation step of the computation. Controlling the degree of pipelining can be used to control latency, throughput and hardware resources needed by the hardware implementation.
Code examples
CPU:
void example(int *A, int n) {
double sum = 0;
for (int i = 0; i < n; ++i) {
sum += A[i];
}
return sum;
}
MaxJ:
// N is a variable that can be tuned depending on available FPGA resources
DFEVectorType<DFEVar> vectorType = new DFEVectorType<DFEVar>
(dfeUInt(32), N);
DFEVar enableOutput = control.count.simpleCounter(32) === n;
DFEVEctor<DFEVar> A = io.input("A", vectorType );
DFEVector<DFEVar> output = io.output("sum", vectorType, enableOutput);
for(int i = 0; i < N; i++) {
DFEVar carriedSum = dfeUInt(32).newInstance(this); // sourceless stream
DFEVar sum = A[i] === 0 ? 0.0 : carriedSum;
optimization.pushNoPipelining();
DFEVar newSum = A[i] + sum;
optimization.popNoPipelining();
DFEVar newSumOffset = stream.offset(newSum, -1);
carriedSum <== newSumOffset;
output[i] <== sum;
}
References