PTX 6.3 (CUDA-10.0) extends wmma instruction to support s8/u8/s4/u4/b1 -> s32.
All of the new instructions are still handled mostly by tablegen. I've slightly
refactored the code to drive intrinsic/instruction generation from a master
list of supported variants, so all irregularities have to be implemented in one place only.
The test generation script wmma.py has been refactored in a similar way.
I've added additional checks to verify the sanity of the set of tests generated
by the script for particular PTX and SM combination.