This revision revisits the implementation of transform.fuse_into_containing_op so that it iterates on
producers one use at a time.
Support is added to fuse a producer through a foreach_thread shared tensor argument, in which case we
tile and fuse the op inside the containing op and update the shared tensor argument to the unique destination operand.
If one cannot find such a unique destination operand the transform fails.