Adds optional attribute to support tensor cores on F32 datatype by lowering to mma.sync with TF32 operands. Since, TF32 is not a native datatype in LLVM we are adding tf32Enabled as an attribute to allow the IR to be aware of MmaSyncOp datatype. Additionally, this patch adds placeholders for nvgpu-to-nvgpu transformation targeting higher precision tf32x3.
For mma.sync on f32 input using tensor cores there are two possibilites:
(a) tf32 (1 mma.sync per warp-level matrix-multiply-accumulate)
(b) tf32x3 (3 mma.sync per warp-level matrix-multiply-accumulate)
Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by splitting each operand into two tf32 values and issue three mma.sync tensor core operations.
I think just putting`UnitAttr` without the OptionalAttr should be sufficient.