Issue
The loop containing the scalar reduction pattern with a non-associative operator can be made faster through loop unrolling, as it cannot be parallelized due to dependencies between loop iterations.
Action
The code example provides a MaxJ solution that could be applied to a space reduction pattern.
Relevance
Loop unrolling is a common and important coding construct for DFE. It deals with loops that have a carried dependency between loop iterations. Due to the dependency, the loop cannot be parallelised into independent blocks of computation, but it can be unrolled into a pipeline that offers a high degree of concurrency and throughput.
Code examples
CPU:
void example(float *input, float* output, int numItems) {
for (count=0; count<numItems; count += 1) {
float d = input[count];
float v = 2.9142 - 2*d;
for (iteration=0; iteration < 4; iteration += 1) {
v = v * (2.0 - d * v);
}
output[count] = v;
}
}
MAXJ:
DFEVar d = io.input("input" , dfeFloat (8, 24));
DFEVar v = 2.9142 - 2.0 ∗ d;
for (int iteration = 0; iteration < 4; iteration += 1) {
v = v ∗ (2.0 - d ∗ v);
}
io.output("output", v, dfeFloat (8, 24));
References