Add warp synchronous matrix-multiply accumulate ops in GPU and NVVM dialect.
Add following three ops to GPU dialect :-
1.) subgroup_mma_load_matrix 2.) subgroup_mma_store_matrix 3.) subgroup_mma_compute
Add following three ops to NVVM dialect :-
1.) wmma.m16n16k16.load 2.) wmma.m16n16k16.store 3.) wmma.m16n16k16.mma