Default vector.contract lowering essentially yields a series of sdot/ddot
operations. However, for some layouts a series of saxpy/daxpy operations,
chained through fma are more efficient. This CL introduces a choice between
the two lowering paths. A default heuristic is to follow.
Some preliminary avx2 performance numbers for matrix-times-vector.
Here, dot performs best for 64x64 A x b and saxpy for 64x64 A^T x b.
------------------------------------------------------------ A x b A^T x b ------------------------------------------------------------ GFLOPS sdot (reassoc) saxpy sdot (reassoc) saxpy ------------------------------------------------------------ 1x1 0.6 0.9 0.6 0.9 2x2 2.5 3.2 2.4 3.5 4x4 6.4 8.4 4.9 11.8 8x8 11.7 6.1 5.0 29.6 16x16 20.7 10.8 7.3 43.3 32x32 29.3 7.9 6.4 51.8 64x64 38.9 79.3 128x128 32.4 40.7 ------------------------------------------------------------
AXPY and OuterProduct are fundamentally the same strategy in the 2dx2d -> 2d and 1dx2d->1d / 2dx1d->2d, respectively.
It would therefore be good to use a single entry point.
Going progressively through vector.outerproduct can be beneficial to compose with patterns that operate on vector.outerproduct.
vector.outerproduct then lowers to the appropriate extract/insert + splat + FMA.
For the matvec special case, I was thinking of relaxing the semantics of vector.outerproduct to take either 1dx1d -> 2d, scalar x 1d -> 1d and 1d x scalar -> 1d.
Thoughts?