WgmmaMmaAsync Op generates wgmma.mma_async PTX instruction that uses the same registers as read and write with mapping. Therefore, the registers count needs to be increased 2 times for the following registers.
This works changes this:
llvm.inline_asm has_side_effects asm_dialect = att "{wgmma.mma_async... {$0, $1, $2, $3, $4}, $5, $6, p", "=f,=f,=f,=f,0,1,2,3,l,l"
Into this one below. The only different is the number of registers ($8 and $9) that comes after read/write.
llvm.inline_asm has_side_effects asm_dialect = att "{wgmma.mma_async... {$0, $1, $2, $3, $4}, $8, $9, p", "=f,=f,=f,=f,0,1,2,3,l,l"