Introduce a set of function that promote a memref argument of a
gpu.func to workgroup memory using memory attribution. The promotion
boils down to additional loops performing the copy from the original
argument to the attributed memory in the beginning of the function, and
back at the end of the function using all available threads. The loop
bounds are specified so as to adapt to any size of the workgroup. These
utilities are intended to compose with other existing utilities (loop
coalescing and tiling) in cases where the distribution of work across
threads is uneven, e.g. copying a 2D memref with only the threads along
the "x" dimension. Similarly, specialization of the kernel to specific
launch sizes should be implemented as a separate pass combining constant
propagation and canonicalization.
Introduce a simple attribute-driven pass to test the promotion
transformation since we don't have a heuristic at the moment.
What is "op" here?