Page MenuHomePhabricator

Generalized PatternMatch & InstSimplify
Needs ReviewPublic

Authored by simoll on Nov 25 2020, 3:01 AM.

Details

Summary
Summary

How can we make InstSimplify work with intrinsics?

The idea here is to pretend that the intrinsics were the actual instruction (eg an fadd) and run the existing InstSimplify code. We stop InstSimplify before it breaks code as soon as we are not sure anymore that the rewrite is compatible with the semantics of the intrinsics.

The tests in this patch show this for contrained fp and vp intrinsics.

This is a work-in-progress re-implementation of the generalized pattern match mechanism of D57504 . That patch also implements InstCombine (and not just simplify).

Background

InstSimplify only works for regular LLVM instructions. Yet, there are more and more intrinsics that mimic regular instructions with a twist.
For example:

  • @llvm.experimental.constrained.fadd allows custom rounding and fp exceptions - however, for default fp settings, it is just an fadd.
  • @llvm.vp.add is a vector-add with a mask and explicit vector length - however, the operation applied to each active lane is just an add.

InstSimplify and InstCombine specify a ton of peephole rewrites to optimize patterns with regular IR instructions.
We'd like to make those pattern-based rewrites work on intrinsics as well.

How?

InstSimplify always works the same: if a pattern matches, it replaces the match root with a pattern leaf (or a constant).
We do two things to make this work:
1.) We add layer of helper classes that let intrinsics pretend to be instructions.
2.) We add a MatcherContext that verifies that a specific pattern match is legal with intrinsics.

However, we want this not just for one kind of intrinsic but different classes (as shown above). To do that, we introduce the notion of a Trait - a Trait is a representation of that extra-property that makes the difference between an instruction and an intrinsic that just pretends to be one.

We define three different traits in this patch:

  • The CFPTrait works on constrained fp intrinsics. The MatcherContext<CFPTrait> verifies all pattern matches that use constrained fp intrinsics with default fp semantics (tonearest, no exceptions).
  • The VPTrait works on VP intrinsics and regular instructions. Eg, a first-class %x = add <8 x i32> %y, %z passes as an add as well as a llvm.vp.add.v8i32(%x, %y, %mask, %evl). Since the masked-out lanes in VP intrinsics deliver an undefined results all matching patterns are automatically legal.
  • The EmptyTrait does not pretend anything. Only first-class FAdd is a FAdd. There are no helper classes but an Instruction is really just an Instruction.
Remarks
  • We get constant-folding for VP intrinsics for free.
  • The constrained fp trait could be extended to non-strict fp exceptions (simplify only).
  • We will build on this framework also for InstCombine - this was also implemented in D57504 .
  • I am not a floating-point expert - i'd be thrilled to learn under which circumstances pattern rewrites that assume default fp semantics apply to other rounding modes.
Implementation Details

The MatcherContext<Trait> starts in an uninitialized state. When a PatternMatch.h pattern is in the process of being matched against a specific instruction, it calls the check(V)/accept(V) methods of the context on all operators in the pattern. As soon as the context returns false, the entire match fails.

The ExtInstruction<Trait>, ExtBinaryOperator<Trait> classes make up the intermediate layer of pretend-classes. The default implementation of those classes assumes that there is an underlying intrinsic class (Trait::Intrinsic).

Diff Detail

Event Timeline

simoll created this revision.Nov 25 2020, 3:01 AM
simoll requested review of this revision.Nov 25 2020, 3:01 AM
simoll edited the summary of this revision. (Show Details)
simoll updated this revision to Diff 307590.Nov 25 2020, 5:20 AM

NFC. Fixed formatting, tidiness.

nhaehnle added inline comments.
llvm/include/llvm/IR/Traits/Traits.h
183

I'm confused: How does this work? Shouldn't there be an isa<typename Trait::Intrinsic>(V) check? Actually, how does this even compile? It seems like a (V) is missing on the cast.

244

Which ones do you have in mind?

simoll added inline comments.Nov 26 2020, 12:32 AM
llvm/include/llvm/IR/Traits/Traits.h
183

Evidently this code is never instantiated or it would not compile.. i'll fix this.

244

I have no specific ext type in mind here. Generally speaking, it depends on the traits and the types they need to override. At the moment, we only define the Ext types required by the CFP and VP traits (and a few more that aren't instantiated..).
In a complete implementation for all foreseeable overrides, you'd need an "Ext" type for all types that are used in "PatternMatch.h". Eg, ExtPossiblyExactOperator,ExtOverflowingBinaryOperator, etc.

spatel added a subscriber: spatel.Dec 15 2020, 11:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2022, 7:49 PM

First, I think this is a good idea and can eventually mitigate the general problem of intrinsics vs. instructions in other LLVM passes.

But I worry we'll end up with too many traits to emulate actual instructions' semantics and we'll go back to the discussions of how much an intrinsic is like an instruction.

As an example, for one type of transformation (say constant folding) an intrinsic-add "acts like an add" (precision), but another transformation (ex loop induction) it doesn't (wrapping semantics).

So, while we can come up with a list of traits that make this one particular match work, we may be faced with a combinatorial number of traits to generalise it to other transformations.

I don't have any particular case in mind, just the general feeling that we'll get stuck half-way through and have to keep a set of traits that can't be easily used by other passes.

But this is not a negative view, just perhaps a request for clarification: how much else did you look at to make sure this can extend to more intrinsics and passes?

Second, there are a lot of clang-format changes on unrelated code lines and it makes it hard to see what's just reformatted and what's really changed.

simoll marked an inline comment as done.Wed, Jun 1, 9:04 AM

First, I think this is a good idea and can eventually mitigate the general problem of intrinsics vs. instructions in other LLVM passes.

Thanks for chiming in!

But I worry we'll end up with too many traits to emulate actual instructions' semantics and we'll go back to the discussions of how much an intrinsic is like an instruction.

As an example, for one type of transformation (say constant folding) an intrinsic-add "acts like an add" (precision), but another transformation (ex loop induction) it doesn't (wrapping semantics).

So, while we can come up with a list of traits that make this one particular match work, we may be faced with a combinatorial number of traits to generalise it to other transformations.

I don't have any particular case in mind, just the general feeling that we'll get stuck half-way through and have to keep a set of traits that can't be easily used by other passes.

But this is not a negative view, just perhaps a request for clarification: how much else did you look at to make sure this can extend to more intrinsics and passes?

I did not look into the generality of the approach beyond InstSimplify/Combine. The traits, as defined right now, are really specific to that use case. InstSimplify/Combine is really where the demand is for the intrinsics as far as i am aware.

Other intrinsic types for InstCombine could be saturating int arithmetic, complex arithmetic and, maybe, matrix intrinsics.

You could try to generalize the trait approach for all kinds of analyses and transformations. I am not sure it's worthwhile though.
If you did, what could happen in the worst case, if specific traits are not general enough and we insisted on generalizing the approach to all passes? If you wrote one trait per set of intrinsics and pass that would get you somewhere in O(#passes x #intrinsic_types) in terms of number of trait classes - that's not really a combinatorial explosion. Even in this scenario, which we would not blindly walk in to, you are getting something in return: those passes start working on the intrinsics.

Second, there are a lot of clang-format changes on unrelated code lines and it makes it hard to see what's just reformatted and what's really changed.

D126783 should help with that.

simoll added a comment.Thu, Jun 2, 8:54 AM

Btw, no matter where you stand on D126889 - the patch lower-cases all InstructionSimplify function names - the diff shows just how many different passes actually rely on instruction simplification. Those passes could potentially benefit from generalized pattern matching (even if its limited to instsimplify/combine).

fhahn added a comment.Tue, Jun 7, 4:31 AM

It looks like unfortunately the patch doesn't apply on current main. Does it have any dependencies or does it just need a rebase?

It looks like unfortunately the patch doesn't apply on current main. Does it have any dependencies or does it just need a rebase?

It needs a rebase. Currently doing that.

kpn added a subscriber: kpn.Wed, Jun 8, 6:25 AM
simoll updated this revision to Diff 435586.Thu, Jun 9, 9:16 AM

Here is the rebased patch for now. Currently two failing tests. This could be instsimplify opportunities enabled by this patch on constrained fp (have to look into it more).

Failed Tests (2):
  LLVM :: Transforms/InstSimplify/constfold-constrained.ll
  LLVM :: Transforms/InstSimplify/floating-point-arithmetic-strictfp.ll

Some comments/thoughts inline.
In the meantime, upstream got new fp patterns that consider the fp environment. This should work just fine with this approach (see my comment on the consider function - we may even DCE fp-environment-aware-patterns for Traits that support fp but do not care about the fp env).

llvm/include/llvm/IR/Traits/Traits.h
268

Note that we can DCE pattern-match paths in the trait-instantiated pattern rewrites, if we make this function more transparent to the compiler.

Not all pattern rewrites make sense for all traits. Eg, the Constrained FP trait does not care about anything but fp arithmetic. If we turn consider into a switch over opcodes right in this header file, say, the compiler may have a chance to detect dead pattern matching paths in the code. The result would be that when eg simplifyAddInst is instantiated for the CFPTrait, the function would be almost empty (since all non-fp opcodes are rejected - and the compiler can (hopefully) discard the int arithmetic patterns for the CFPTrait).

llvm/lib/Analysis/InstructionSimplify.cpp
3694

Ignore. Stale code kept during rebase.

4259

Ignore. Stale code kept during rebase.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
2253

I am not happy about this change. The PatternMatch abstractions should be entirely transparent to external code. External code would be code outside of PatternMatch or InstSimplify/Combine.

One way to get rid of this change in external code, is to understand that the issue only arises when "our" code calls into "external" code expecting there to be a match function with trait parameter - but never when "external" patterns call into "our" patterns because then everything will default to calling match functions without trait parameter.

llvm/test/Transforms/InstSimplify/fast-math-strictfp.ll
96

This test is actually illformed, according to the LangRef:

"If any FP operation in a function is constrained then they all must be constrained. This is required for correct LLVM IR. "

simoll added inline comments.Fri, Jun 10, 2:08 AM
llvm/test/Transforms/InstSimplify/fpadd_constrained.ll
57

This test is then illformed, too ;-)

nikic added a comment.Fri, Jun 10, 3:29 AM

This fails to build with:

lib/libLLVMInstCombine.a(InstructionCombining.cpp.o):InstructionCombining.cpp:function llvm::InstCombinerImpl::visitGEPOfGEP(llvm::GetElementPtrInst&, llvm::GEPOperator*): error: undefined reference to 'llvm::Value* llvm::simplifyAddInst(llvm::Value*, llvm::Value*, bool, bool, llvm::SimplifyQuery const&, llvm::MatcherContext&)'

Probably due to illegal use of inline. You need to use static inline if you don't add extern inline definitions. (Or possibly a missing explicit template instantiation, but I think that wouldn't be used by InstCombine with this patch.)

I'm generally very skeptical about this proposal. This adds a lot of additional complexity to the area of LLVM that likely already has the highest ratio between lines of code changed and review time / amount of tests. After adding tests for baseline behavior, 16 commuted variants, nowrap/FMF variations, vector splat, undef-splat and non-splat variations, and negative tests for all used conditions (including all pattern match conditions), I definitely would rather not add variation tests for VP and constrained FP intrinsics (with which I am not familiar) as well.

I guess one could make an argument that if one uses the generalized matchers, there is nothing that could possibly go wrong, and there is no need to actually test these pattern or understand their correctness. I'm doubtful that this is actually true though. It's almost certainly not true for InstCombine, but even in InstSimplify I expect that there would be some dangerous interaction with predicated div/rem opcodes. Presumably, a zero value in a masked out lane does not render the program undefined, and as such an InstSimplify fold to a poison value would be incorrect.

Another concern here is compile-time, but I won't be able to test that until there is an actually buildable patch.

A possibly stupid question from someone who has not followed the VP design: Why does the predication need to happen on every single instruction? Naively, I'd assume that predication is only important for instructions that have trapping behavior, which is basically just load/store and div/rem class intrinsics. For everything else, it should be fine to represent this as a normal vector add, and then a single masking operation at the end. One could then propagate that masking backwards in the backend. That should ensure that normal instructions optimize fine, while instructions where the masking is semantically relevant (like div/rem) do not blindly participate in optimizations.

So basically, vp.add(X, Y, M) can be decomposed into vp.mask(add(X, Y), M), where the vp.mask intrinsic can be pushed down and combined with other mask intrinsics. So vp.add(vp.add(X, Y, M), Z, M) becomes vp.mask(add(vp.mask(add(X, Y), M), Z), M) becomes vp.mask(vp.mask(add(add(X, Y), Z), M), M) becomes vp.mask(add(add(X, Y), Z), M), at which point add(add(X, Y), Z) optimizes as usual.

This is really on two separate things: First on generalized pattern matching, the second on vector predication (which is only one of the intrinsics sets benefiting from this).

This fails to build with:

lib/libLLVMInstCombine.a(InstructionCombining.cpp.o):InstructionCombining.cpp:function llvm::InstCombinerImpl::visitGEPOfGEP(llvm::GetElementPtrInst&, llvm::GEPOperator*): error: undefined reference to 'llvm::Value* llvm::simplifyAddInst(llvm::Value*, llvm::Value*, bool, bool, llvm::SimplifyQuery const&, llvm::MatcherContext&)'

Probably due to illegal use of inline. You need to use static inline if you don't add extern inline definitions. (Or possibly a missing explicit template instantiation, but I think that wouldn't be used by InstCombine with this patch.)

It compiles on my system. I am fixing this with the next patch update.

I'm generally very skeptical about this proposal. This adds a lot of additional complexity to the area of LLVM that likely already has the highest ratio between lines of code changed and review time / amount of tests. After adding tests for baseline behavior, 16 commuted variants, nowrap/FMF variations, vector splat, undef-splat and non-splat variations, and negative tests for all used conditions (including all pattern match conditions), I definitely would rather not add variation tests for VP and constrained FP intrinsics (with which I am not familiar) as well.

If we want to optimize all those new intrinsics, matrix, constrained, saturating, then we will have to add new logic to do so.
You can have the complexity elsewhere but you cannot get rid of it. I am advocating here for re-using the existing code instead of replicating the same or similar pattern-rewriting logic elsewhere, once for each intrinsic set.
Regarding coverage, we could auto-generate intrinsic tests from the instruction combiner tests. The test generators could also explicitly target issues as the ones you are describing next (masked out zero-lanes).

I guess one could make an argument that if one uses the generalized matchers, there is nothing that could possibly go wrong, and there is no need to actually test these pattern or understand their correctness. I'm doubtful that this is actually true though. It's almost certainly not true for InstCombine, but even in InstSimplify I expect that there would be some dangerous interaction with predicated div/rem opcodes. Presumably, a zero value in a masked out lane does not render the program undefined, and as such an InstSimplify fold to a poison value would be incorrect.

That's a fair point, the traits have to be carefully crafted as to not allow invalid rewrites. I see a clear strategy here: we start with very strict traits that bail pattern matching early as soon as we are stepping out of our comfort zone. Eg, we can make a trait bail around divisions initially to study the problem better and find the right trait logic to make them work.

Another concern here is compile-time, but I won't be able to test that until there is an actually buildable patch.

Agreed, also code size if you instantiate for too many traits. I see two approaches to mitigate these issues:

  • We can use virtual dispatch instead of template instantiation (or template-instantiate only twice: one for trait-less pattern matching and once again for virtual dispatch).
  • We can have a cmake variable that controls instantation and if your distribution does not care about constrained fp or vp, say, you just don't instantiate for it and won't see compile time or size increases. I was hinting in that direction with the EnabledTraits.def file.

A possibly stupid question from someone who has not followed the VP design: Why does the predication need to happen on every single instruction? Naively, I'd assume that predication is only important for instructions that have trapping behavior, which is basically just load/store and div/rem class intrinsics. For everything else, it should be fine to represent this as a normal vector add, and then a single masking operation at the end. One could then propagate that masking backwards in the backend. That should ensure that normal instructions optimize fine, while instructions where the masking is semantically relevant (like div/rem) do not blindly participate in optimizations.

So basically, vp.add(X, Y, M) can be decomposed into vp.mask(add(X, Y), M), where the vp.mask intrinsic can be pushed down and combined with other mask intrinsics. So vp.add(vp.add(X, Y, M), Z, M) becomes vp.mask(add(vp.mask(add(X, Y), M), Z), M) becomes vp.mask(vp.mask(add(add(X, Y), Z), M), M) becomes vp.mask(add(add(X, Y), Z), M), at which point add(add(X, Y), Z) optimizes as usual.

VP intrinsics exist primarily to have a way to express the dynamic vector length for architectures like RISC-V V extension and SX-Aurora (VE target).
Every single instruction can have a different vector length and that parameter has performance implications (latency).
There exist vector codes that exploit vector length switching for better performance (to blend lanes, to compress sparse vectors and then operate on the dense data, etc).
The mask does not strictly need to be there for every instruction - integer add/sub are examples for this. Some, such as sdiv/srem strictly need it. However, you can exploit the mask by using a compressed_store/expanding_load when a sparse vector register is spilled. Where the mask is not needed you can always pass in an all-ones mask to disable it.
The long term strategy with the VP proposal is to push the mask and evl parameter into something like operand bundles that you can simply tag on regular instructions. Intrinsics are just a bridge technology, if you will. Before we can do that, however, LLVM has to be capable of optimizing masked instructions (we cannot ignore predication for the same reason we cannot have too permissive traits). Making LLVM predication-ready is what prompted generalized pattern matching.

simoll updated this revision to Diff 439024.Wed, Jun 22, 7:54 AM
  • inline -> static inline
  • rebased
nikic added a comment.EditedWed, Jun 22, 9:17 AM

This still fails to link. It looks like explicit template instantiations are missing for simplifyAddInst (they are present for simplifyBinOp for example).

Edit: simplifyFAddInst is missing as well.

simoll updated this revision to Diff 439824.Fri, Jun 24, 11:09 AM
  • Fixed m_ExtractValue pattern (all tests passing now).
  • Add cmake variable to control which traits InstSimplify will be instantiated for (LLVM_OPTIMIZER_TRAITS_TO_INSTANTIATE .. eg VPTrait;CFPTrait).
  • Explicitly instantiate InstSimplify templates.

We can have a cmake variable that controls instantation and if your distribution does not care about constrained fp or vp, say, you just don't instantiate for it and won't see compile time or size increases. I was hinting in that direction with the EnabledTraits.def file.

I don't think this makes a lot sense. There's no way we can disable this from the distribution side if constrained FP and VP are core parts of LLVM (which at least the former is at this point, given that it is exposed by Clang). If you want to do this you'll also have to export variables for llvm-lit, so tests can be disabled based on which traits are enabled. I don't think we want to go there.

I'd really like to have a working patch (with enabled traits) so I can give this a basic compile-time evaluation at least.

We can have a cmake variable that controls instantation and if your distribution does not care about constrained fp or vp, say, you just don't instantiate for it and won't see compile time or size increases. I was hinting in that direction with the EnabledTraits.def file.

I don't think this makes a lot sense. There's no way we can disable this from the distribution side if constrained FP and VP are core parts of LLVM (which at least the former is at this point, given that it is exposed by Clang). If you want to do this you'll also have to export variables for llvm-lit, so tests can be disabled based on which traits are enabled. I don't think we want to go there.

I'd really like to have a working patch (with enabled traits) so I can give this a basic compile-time evaluation at least.

Well, What isn't working on your system? All templates should be instantiated now. Please share your build configuration because static/dylib/shared lib builds of LLVM work fine on my system.
The cmake variable (in the reference patch), let's you run compile time tests with different trait configurations: cmake -DLLVM_OPTIMIZER_TRAITS_TO_INSTANTIATE=all. Whether we actually want a cmake variable is a different story. However, there is precedent in the TARGETS_TO_BUILD cmake variable: Distributions can configure the enabled backends. Yet, there are target-specific intrinsics in every LLVM distribution. I wouldn't be surprised if some distributors removed those manually to make their builds smaller.