This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Support tiling and fusing linalg.pad_tensor
AbandonedPublic

Authored by antiagainst on May 27 2021, 5:17 AM.

Details

Summary

linalg.pad_tensor ops frequently arise from cases like
convolution padding, where we want to pad a few zeros
at the boundary. Handwritten kernels can handle this
boundary case by using pre- and post-if statements to
load referenced scalars at the boundary and compose
them with zeros to form vectors. So we can still go
through nicely vectorized load/store for the original
tensor's content.

Right now, this is not possible with linalg as the
linalg.pad_tensor ops is on its own without being
tiled and fused: when CodeGen'ing towards GPU, it
will force a separate kernel, which requires allocating
a new buffer, copying over the original tensor's buffer
content. This whole buffer allocation and data copying
is unnecessary and can be a major source of latency.

This commit is a first step towards making linalg.pad_tensor
compose better with other linalg ops, to enable generating
more optimized code like handwritten kernels. linalg.pad_tensor
isn't a structured linalg op so we need specific handling for it.

Diff Detail

Event Timeline

antiagainst created this revision.May 27 2021, 5:17 AM
antiagainst requested review of this revision.May 27 2021, 5:17 AM

Thanks for attacking this @antiagainst !
I haven't looked at the details yet but the low/hi padding composition looks like what I'd expect.

I have one question before digging deeper, it seems to me this is doing a few things at the same time.
In my mind there is a fundamental building block that rewrites subtensor(pad) into pad(subtensor) with the proper min/max computation gymnastics and that is in turn used for fusion (and later distribution and other places).
Then the imperfectly nested structure you currently write yourself as a special case of fusion, would happen automatically by loop hoisting.
Your fusion pattern would then just detect pad -> subtensor -> consumer and use the above building block.
I also expect similar type of subtensor + X to appear with reshape and concat, so common infrastructure at that level would be useful too.

Have you tried this line of thought?

Jing added a subscriber: Jing.Jun 1 2021, 12:00 PM
mravishankar requested changes to this revision.Jun 2 2021, 4:29 PM

Ok, first initial skim of this patch. I actually didnt follow the core logic. Maybe we can chat offline about this.

mlir/lib/Dialect/Linalg/Transforms/Fusion.cpp
338

I would avoid doing this. Instead it is probably better to just create the map so that it can either use the value or the constant. Creating an std.constant only for the affine.apply to be canonicalized is just wasted work.

365

I am having a hard time following the logic here. Maybe rephrase this?

This revision now requires changes to proceed.Jun 2 2021, 4:29 PM
nicolasvasilache resigned from this revision.Jul 28 2021, 2:23 AM

FYI some of the work that @springerm did should subsume this now.

This functionality has been implemented in D105460 and ExtractSliceOfPadTensorSwapPattern. This revision can be closed.

antiagainst abandoned this revision.Nov 15 2021, 12:42 PM