linalg.pad_tensor ops frequently arise from cases like
convolution padding, where we want to pad a few zeros
at the boundary. Handwritten kernels can handle this
boundary case by using pre- and post-if statements to
load referenced scalars at the boundary and compose
them with zeros to form vectors. So we can still go
through nicely vectorized load/store for the original
tensor's content.
Right now, this is not possible with linalg as the
linalg.pad_tensor ops is on its own without being
tiled and fused: when CodeGen'ing towards GPU, it
will force a separate kernel, which requires allocating
a new buffer, copying over the original tensor's buffer
content. This whole buffer allocation and data copying
is unnecessary and can be a major source of latency.
This commit is a first step towards making linalg.pad_tensor
compose better with other linalg ops, to enable generating
more optimized code like handwritten kernels. linalg.pad_tensor
isn't a structured linalg op so we need specific handling for it.
I would avoid doing this. Instead it is probably better to just create the map so that it can either use the value or the constant. Creating an std.constant only for the affine.apply to be canonicalized is just wasted work.