It introduces a pattern that swaps linalg.generic + tensor.pack to
tensor.pack + linalg.generic. It requires all the iteration types
being parallel; the indexing map of output operand is identiy. They can
all be relaxed in the future.
The user can decide whether the propagation should be applied or not by
passing a control function.
this is lacking docs