This revision significantly rewrites hoisting on tensors.
Previously, vector.transfer_read/write and tensor.extract/insert_slice would
be clumped together when looking for candidate pairs.
This would significantly increase the complexity of the logic and would not apply
independently to tensor.extract/insert_slice.
The new implementation decouples the cases and starts to cast the problem
as a generic matching subset extract/insert, which will be future proof when
other such operation pairs are introduced.
Lastly, the implementation makes the distinction clear between vector.transfer_read/write for
which we allow bypasses of the disjoint subsets from tensor.extract/insert_slice for which we
do not yet allow it.
This can be extended in the future and unified once we have subset disjunction implemented more generally.
The algorithm can be rewritten to be less of a fixed point with interspersed canonicalizations.
As a consequence, the test explicitly adds a canonicalization to clean up the IR and verify we end up in the same state.
That extra canonicalization exhibited that one of the uses in one of the tests was dead, so we fix the appropriate test.
The other implementation won't work on buffers right? I don't think we can retire that until we have an alternative.