Add WMMA clang builtins and tests. Extra changes in code
are needed to handle function overloads.
WavefrontSize 32:
builtin_amdgcn_wmma_f32_16x16x16_f16_w32
builtin_amdgcn_wmma_f32_16x16x16_bf16_w32
builtin_amdgcn_wmma_f16_16x16x16_f16_w32
builtin_amdgcn_wmma_bf16_16x16x16_bf16_w32
builtin_amdgcn_wmma_i32_16x16x16_iu8_w32
builtin_amdgcn_wmma_i32_16x16x16_iu4_w32
WavefrontSize 64:
builtin_amdgcn_wmma_f32_16x16x16_f16_w64
builtin_amdgcn_wmma_f32_16x16x16_bf16_w64
builtin_amdgcn_wmma_f16_16x16x16_f16_w64
builtin_amdgcn_wmma_bf16_16x16x16_bf16_w64
builtin_amdgcn_wmma_i32_16x16x16_iu8_w64
builtin_amdgcn_wmma_i32_16x16x16_iu4_w64