Most convolution operations need explicit padding of the input to
ensure all accesses are inbounds. In such cases, having a pad
operation can be a significant overhead. One way to reduce that
overhead is to try to fuse the pad operation with the producer of its
source.
A sequence
linalg.generic -> linalg.pad_tensor
can be replaced with
linalg.fill -> tensor.extract_slice -> linalg.generic -> tensor.insert_slice.
if the linalg.generic has all parallel iterator types.
..WithProducerGeneric.. to be clear?