Page MenuHomePhabricator

[mlir][linalg] Support tiling and fusing linalg.pad_tensor
Needs RevisionPublic

Authored by antiagainst on May 27 2021, 5:17 AM.

Details

Summary

linalg.pad_tensor ops frequently arise from cases like
convolution padding, where we want to pad a few zeros
at the boundary. Handwritten kernels can handle this
boundary case by using pre- and post-if statements to
load referenced scalars at the boundary and compose
them with zeros to form vectors. So we can still go
through nicely vectorized load/store for the original
tensor's content.

Right now, this is not possible with linalg as the
linalg.pad_tensor ops is on its own without being
tiled and fused: when CodeGen'ing towards GPU, it
will force a separate kernel, which requires allocating
a new buffer, copying over the original tensor's buffer
content. This whole buffer allocation and data copying
is unnecessary and can be a major source of latency.

This commit is a first step towards making linalg.pad_tensor
compose better with other linalg ops, to enable generating
more optimized code like handwritten kernels. linalg.pad_tensor
isn't a structured linalg op so we need specific handling for it.

Diff Detail

Unit TestsFailed

TimeTest
320 msx64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loc-const-value-1.ll
Script: -- : 'RUN: at line 1'; c:\ws\w6\llvm-project\premerge-checks\build\bin\llc.exe C:\ws\w6\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\basic-block-sections-debug-loc-const-value-1.ll --dwarf-version=4 --basic-block-sections=none -filetype=obj -o C:\ws\w6\llvm-project\premerge-checks\build\test\DebugInfo\X86\Output\basic-block-sections-debug-loc-const-value-1.ll.tmp
330 msx64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loc-const-value-2.ll
Script: -- : 'RUN: at line 1'; c:\ws\w6\llvm-project\premerge-checks\build\bin\llc.exe C:\ws\w6\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\basic-block-sections-debug-loc-const-value-2.ll --dwarf-version=4 --basic-block-sections=none -filetype=obj -o C:\ws\w6\llvm-project\premerge-checks\build\test\DebugInfo\X86\Output\basic-block-sections-debug-loc-const-value-2.ll.tmp
320 msx64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loc-split-range.ll
Script: -- : 'RUN: at line 1'; c:\ws\w6\llvm-project\premerge-checks\build\bin\llc.exe C:\ws\w6\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\basic-block-sections-debug-loc-split-range.ll --dwarf-version=4 --basic-block-sections=none -filetype=obj -o C:\ws\w6\llvm-project\premerge-checks\build\test\DebugInfo\X86\Output\basic-block-sections-debug-loc-split-range.ll.tmp
380 msx64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loc.ll
Script: -- : 'RUN: at line 5'; c:\ws\w6\llvm-project\premerge-checks\build\bin\llc.exe C:\ws\w6\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\basic-block-sections-debug-loc.ll --dwarf-version=4 --basic-block-sections=none -filetype=obj -o C:\ws\w6\llvm-project\premerge-checks\build\test\DebugInfo\X86\Output\basic-block-sections-debug-loc.ll.tmp
390 msx64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loclist-1.ll
Script: -- : 'RUN: at line 1'; c:\ws\w6\llvm-project\premerge-checks\build\bin\llc.exe C:\ws\w6\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\basic-block-sections-debug-loclist-1.ll --dwarf-version=4 --basic-block-sections=none -filetype=obj -o C:\ws\w6\llvm-project\premerge-checks\build\test\DebugInfo\X86\Output\basic-block-sections-debug-loclist-1.ll.tmp
View Full Test Results (9 Failed)

Event Timeline

antiagainst created this revision.May 27 2021, 5:17 AM
antiagainst requested review of this revision.May 27 2021, 5:17 AM

Thanks for attacking this @antiagainst !
I haven't looked at the details yet but the low/hi padding composition looks like what I'd expect.

I have one question before digging deeper, it seems to me this is doing a few things at the same time.
In my mind there is a fundamental building block that rewrites subtensor(pad) into pad(subtensor) with the proper min/max computation gymnastics and that is in turn used for fusion (and later distribution and other places).
Then the imperfectly nested structure you currently write yourself as a special case of fusion, would happen automatically by loop hoisting.
Your fusion pattern would then just detect pad -> subtensor -> consumer and use the above building block.
I also expect similar type of subtensor + X to appear with reshape and concat, so common infrastructure at that level would be useful too.

Have you tried this line of thought?

Jing added a subscriber: Jing.Jun 1 2021, 12:00 PM
mravishankar requested changes to this revision.Jun 2 2021, 4:29 PM

Ok, first initial skim of this patch. I actually didnt follow the core logic. Maybe we can chat offline about this.

mlir/lib/Dialect/Linalg/Transforms/Fusion.cpp
338

I would avoid doing this. Instead it is probably better to just create the map so that it can either use the value or the constant. Creating an std.constant only for the affine.apply to be canonicalized is just wasted work.

365

I am having a hard time following the logic here. Maybe rephrase this?

This revision now requires changes to proceed.Jun 2 2021, 4:29 PM