This operation will be used to transform MATMUL(TRANSPOSE(a), b). The
transformation will go in the following stages:
- Lowering to hlfir.transpose and hlfir.matmul
- Canonicalise to hlfir.matmul_transpose
- hlfir.matmul_transpose will be lowered to FIR as a new runtime library call
Step 2 (and this operation) are included for consistency with the other
hlfir intrinsic operations and to avoid mixing concerns in the intrinsic
lowering pass.
In step 3, a new runtime library call is used because this operation is
most easily implemented in one go (the transposed indexing actually
makes the indexing simpler than for a normal matrix multiplication). In
the long run, it is intended that HLFIR will allow the same buffer
to be shared between different runtime calls without temporary
allocations, but in this specific case we can do even better than that
with a dedicated implementation.
This should speed up galgel from SPEC2000 (but this hadn't been tested
yet). The optimization was implemented in Classic Flang.