This patch adds builtins/intrinsics for the following variants of FMA:
- f16
- rn
- rn_ftz
- rn_sat
- rn_ftz_sat
- rn_relu
- rn_ftz_relu
- f16x2
- rn
- rn_ftz
- rn_sat
- rn_ftz_sat
- rn_relu
- rn_ftz_relu
- bf16
- rn
- rn_relu
- bf16x2
- rn
- rn_relu
They all require PTX 7.0, SM_80.
ptxas (Cuda compilation tools, release 11.0, V11.0.194) is happy with the generated assembly.
Depends on D<117887>
I think the default should be the most useful/common and the least surprising value.
I'd argue that in this case it would be []. This would give reader a reasonable idea about what's going on even without looking at FMA_TUPLE implementation.