Issue
The loop containing the sparse reduction pattern can be made faster by offloading it to an accelerator, while properly handling the race condition issue to preserve correctness.
Action
The code example provides a MaxJ solution that could be applied to a sparse reduction pattern.
Relevance
Offloading a loop to an DFE is one of the ways to speed it up. A sparse reduction pattern can be implemented with a mapped memory on the DFE which offers an efficient way to handle the data dependent access patterns in the computation.
Code examples
CPU:
void example(int *A, int *nodes1, int n) {
for (int nel = 0; nel < n; ++nel) {
A[nodes1[nel]] += nel * 1;
}
}
MaxJ:
Memory<DFEVar> mappedMemoryA = mem.alloc(dfeUInt(32), 512);
mappedMemoryA.mapToCPU("A");
DFEVar enableInput = control.count.makeCounter(control.count.makeParams(4).withMax(2)).getCount() === 0;
DFEVar nodes1 = io.input("nodes1", dfeUInt(32), enableInput)
nodes1 = nodes1.cast(dfeUInt(MathUtils.bitsToAddress(512)));
DFEVar nel = control.count.simpleCounter(32);
DFEVar A = mappedMemoryA.read(nodes1);
optimization.pushNoPipelining();
A = A + nel;
optimization.popNoPipelining();
mappedMemoryA.write( stream.offset(nodes1, -2), stream.offset(A, -2), constant.var(true));
References