This pattern is not specific to nvgpu; I intend to use in SPIR-V codegen. VectorTransforms seems like a more generally useful place.
In addition:
- Fix a bug in the second condition (the dimensions were swapped for RHS).
- Add tests.
- Add support for externally provided filter functions, similar to other vector transforms.
- Prefer to transpose before zero/sign-extending inputs.
Excuse my ignorance... What is MMT and TNT? Perhaps adding a comment would help.