Currently, the insertion point for bubbleUpPackOpThroughElemGenericOp
is after the tensor.pack this means that the new generic will be created
right after the tensor.pack. This is inconvenient because we are moving
the position of the generic; the idea is to move pack/unpack around, not
linalg.generics. This PR changes the insertion point to preserve the
position of the generic.
Additionally, it restricts the pattern to fire if the generic has a
single user (tensor.pack) to avoid introducing recomputation.
[optional] Is it a heavy check? If so, maybe we should have simpler check and bail-out. We can maybe introduce an aggressive mode of this is needed.