This revisions refactors the implementation of mapping to threads to additionally allow warps and linear ids to be specified.
warp_dims is currently specified along with block_dims as a transform attribute.
Linear ids on th other hand use the flattened block_dims to predicate on the first (linearized) k threads.
An additional GPULinearIdMappingAttr is added to the GPU dialect to allow specifying loops mapped to this new scheme.
Various implementation and transform op semantics cleanups are also applied.
This doesn't seem to have any test cases here. Is this PR ready for review?