Page MenuHomePhabricator

[InstCombine] Add target-specific inst combining
ClosedPublic

Authored by Flakebi on Jun 12 2020, 3:08 AM.

Details

Summary

Targets can combine intrinsics in
TargetTransformInfo::instCombineIntrinsic.
This allows accessing target specific features and combining
instructions only if the target supports certain features.

Diff Detail

Event Timeline

Flakebi created this revision.Jun 12 2020, 3:08 AM
lebedev.ri added a subscriber: lebedev.ri.
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
3879

This opens a dangerous floodgates of instcombine not being target-independent canonicalization pass.

To add more context to this, the problem I am facing is that amdgpu image intrinsics are usually called with float arguments. However, on some subtargets/hardware generations it is possible to call them with half arguments.
If llvm is compiling for such a subtarget, it is beneficial to combine

%s32 = fpext half %s to float
call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(…, float %s32, …)

into

call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f16(…, half %s, …)

This combines instructions, so I think it belongs into the InstCombine pass. On the other hand, the f16 form of the intrinsics is not available on all targets, so this combination cannot be applied unconditionally but it needs to be gated depending on the target.

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
3879

That is the point of this change, to allow target-dependent combinations in TargetTransformInfo::instCombineIntrinsic.
Imo, all the target specific intrinsic combinations in InstCombineCalls.cpp (x86, amdgpu, etc.) can be moved to their respective target.

I don’t have a great overview of LLVM, so I might be wrong on this.

lebedev.ri added inline comments.Jun 12 2020, 6:23 AM
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
3879

Imo, all the target specific intrinsic combinations in InstCombineCalls.cpp (x86, amdgpu, etc.) can be moved to their respective target.

I agree with that, yes.

The problem i'm seeing is that even having TTI in the pass
"significantly" lowers the barrier of entry for then using
TTI to guard some generic transforms in the instcombine.

This combines instructions, so I think it belongs into the InstCombine pass. On the other hand, the f16 form of the intrinsics is not available on all targets, so this combination cannot be applied unconditionally but it needs to be gated depending on the target.

The fact that this pass recognizes target-specific intrinsics at all is widely regarded as a mistake:
http://lists.llvm.org/pipermail/llvm-dev/2016-July/102317.html

Target-specific transforms should look first at codegen combiners (SDAG or GlobalISel). If that's too late, consider a target-specific IR codegen pass (I think AMDGPU has a few examples of this already). If that's still too late, write a generic IR transform pass that accesses TTI?

This combines instructions, so I think it belongs into the InstCombine pass. On the other hand, the f16 form of the intrinsics is not available on all targets, so this combination cannot be applied unconditionally but it needs to be gated depending on the target.

The fact that this pass recognizes target-specific intrinsics at all is widely regarded as a mistake:
http://lists.llvm.org/pipermail/llvm-dev/2016-July/102317.html

Target-specific transforms should look first at codegen combiners (SDAG or GlobalISel). If that's too late, consider a target-specific IR codegen pass (I think AMDGPU has a few examples of this already). If that's still too late, write a generic IR transform pass that accesses TTI?

The problem with all of these suggestions is that they're likely technically-inferior solutions compared to sitting inside of InstCombine's fixed-point iteration scheme. Honestly, I think that the way we should ensure that InstCombine does not start using TTI to define a canonical form for non-target-specific intrinsics is via documentation and code review. InstCombine has long had logic to deal with target-specific intrinsics (in InstCombineCalls.cpp), and refactoring things so that this logic can live in each backend seems like an improvement to me.

nikic added a subscriber: nikic.Jun 13 2020, 4:11 AM
nikic added inline comments.Jun 13 2020, 5:49 AM
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
150

Actually implementing this would require us to export the InstCombiner class, which is part of InstCombineInternal.h. I don't think we would want to do this in its current form. This would require a larger refactoring to separate out the implementation and API portions of InstCombine.

foad added a subscriber: foad.Jun 16 2020, 6:34 AM

Summarizing the comments, the important points are

  1. Everyone agrees on moving target specific stuff out of Transforms/InstCombine into target specific folders
  2. Keep running the instruction combining in the InstCombine pass, so the fixed-point iteration works

The majority of target specific code is intrinsic combining, there is only one more amdgpu specific part in InstCombineSimplifyDemanded.cpp:SimplifyDemandedVectorElts. Unless someone has an idea on how to implement this in a more generic way, I’ll keep it like in the current diff, only combining intrinsics in TargetTransformInfo::instCombineIntrinsic.

Actually implementing this would require us to export the InstCombiner class, which is part of InstCombineInternal.h. I don't think we would want to do this in its current form. This would require a larger refactoring to separate out the implementation and API portions of InstCombine.

Good point, I’ll try to add that here in the next week.

Flakebi updated this revision to Diff 273054.Jun 24 2020, 8:52 AM

Moved most target specific InstCombine parts to their respective targets.
The largest left-over part in InstCombineCalls.cpp is the code shared between arm and aarch64. Is there a place where code for these targets is shared?

The gist of these changes is in the following files:

  • llvm/include/llvm/Analysis/TargetTransformInfo.h
  • llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
  • llvm/include/llvm/CodeGen/BasicTTIImpl.h
  • llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
  • llvm/lib/Analysis/TargetTransformInfo.cpp
  • llvm/lib/Transforms/InstCombine/InstCombineInternal.h
  • llvm/lib/Transforms/InstCombine/InstructionCombining.cpp

The rest of the changes are moving about 3000 lines out from InstCombine to the targets and slightly adjust them for the new interface, there should be no other changes in there.

As far as I know and I might be wrong, but TargetTransformInfo up til now has only provided information. It doesn't do any transforms itself. Is adding transforms to it the right thing to do?

As far as I know and I might be wrong, but TargetTransformInfo up til now has only provided information. It doesn't do any transforms itself. Is adding transforms to it the right thing to do?

This isn't strictly true. I recently added rewriteIntrinsicWithAddressSpace for example

As far as I know and I might be wrong, but TargetTransformInfo up til now has only provided information. It doesn't do any transforms itself. Is adding transforms to it the right thing to do?

This isn't strictly true. I recently added rewriteIntrinsicWithAddressSpace for example

I stand corrected then.

As far as I know and I might be wrong, but TargetTransformInfo up til now has only provided information. It doesn't do any transforms itself. Is adding transforms to it the right thing to do?

This isn't strictly true. I recently added rewriteIntrinsicWithAddressSpace for example

I stand corrected then.

This may be the only example though. I may have introduced something conceptually new without realizing it. The current use also doesn't exactly make the change. It does introduce new instructions, but the pass is still responsible for doing the replacement/delete of the old value

As far as I know and I might be wrong, but TargetTransformInfo up til now has only provided information. It doesn't do any transforms itself. Is adding transforms to it the right thing to do?

This isn't strictly true. I recently added rewriteIntrinsicWithAddressSpace for example

I stand corrected then.

This may be the only example though. I may have introduced something conceptually new without realizing it. The current use also doesn't exactly make the change. It does introduce new instructions, but the pass is still responsible for doing the replacement/delete of the old value

I guess it also modifies the original instruction in place in some cases

Flakebi updated this revision to Diff 273425.Jun 25 2020, 10:08 AM

Adjust failing clang test, TargetIRAnalysis is run earlier now

Flakebi updated this revision to Diff 273458.Jun 25 2020, 10:59 AM

Rebased, so the automatic builds can run

We've been handling target-specific intrinsics in InstCombine for a long time, and that's the place where they should naturally sit. This is a pretty clean refactoring in my opinion, I'm in favor. It's substantial enough as a change that it should probably receive a heads-up on llvm-dev, though.

I think an interface usable by InstructionSimplify would be helpful too, so I think that would be a separate thing from TTI

This combines instructions, so I think it belongs into the InstCombine pass. On the other hand, the f16 form of the intrinsics is not available on all targets, so this combination cannot be applied unconditionally but it needs to be gated depending on the target.

I don't think this is a great justification for doing anything here. You can always reverse the transform in isel on targets where it isn't supported; adding more IR patterns increases the potential for missed optimizations.

That said, I think moving the handling for target intrinsics into the target makes sense as a cleanup.

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
1445

Is there some way we can check that an intrinsic is actually target-specific, to discourage people from handling generic intrinsics in target-specific ways?

foad added a subscriber: bogner.Jun 30 2020, 1:14 AM
foad added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
1445
Flakebi updated this revision to Diff 274436.Jun 30 2020, 5:47 AM

Rebased and call target-specific combining only for target-specific intrinsics as suggested.
Add Function::isTargetIntrinsic() for this purpose.

lattner requested changes to this revision.Jun 30 2020, 1:24 PM
lattner added a subscriber: lattner.

This looks like a great direction, but please make sure to minimize public implementation details. We don't want the vast majority of instcombine to be visible outside of its library (it is hairy enough as it is :-)

llvm/include/llvm/Analysis/TargetTransformInfo.h
29

Can this be forward declared instead of #include'd?

llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
31

Please minimize #includes in general, thanks :)

47

I would really rather not make this be a public class - this is a very thick interface. Can this be cut down to something much smaller than the implementation details of InstCombine?

If you're curious for a pattern that could be followed, the MLIR AsmParser is a reasonable example. The parser is spread across a bunch of classes in the lib/ directory:
https://github.com/llvm/llvm-project/blob/master/mlir/lib/Parser/Parser.cpp

But then there is a much smaller public API exposed through a header:
https://github.com/llvm/llvm-project/blob/master/mlir/include/mlir/IR/OpImplementation.h#L229

This revision now requires changes to proceed.Jun 30 2020, 1:24 PM
nhaehnle added inline comments.Jul 1 2020, 7:08 AM
llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
47

I agree with the sentiment, but note @Flakebi has split up the InstCombiner class into InstCombiner and InstCombinerImpl classes, which addresses those concerns already as far as I'm concerned.

Looking through the new InstCombiner, aside from methods that are core to the workings of InstCombine (modifying instructions while keeping track of the Worklist) and methods for accessing the analyses, what's left is:

  • A bunch of static methods that should arguably just be global functions in a utils header somewhere.
  • CreateOverflowTuple and CreateNonTerminatorUnreachable

Moving those methods feels sensible, but is likely to touch a lot of code, so I think it would be better to do it in a separate commit.

Flakebi updated this revision to Diff 275617.Jul 6 2020, 2:30 AM

Rebased and removed a few includes as suggested.
Make the TargetTransformInfo a private member of InstCombiner because it should not be used in general inst combines.
Move CreateOverflowTuple out of InstCombiner and make CreateNonTerminatorUnreachable static.

I would really rather not make this be a public class - this is a very thick interface. Can this be cut down to something much smaller than the implementation details of InstCombine?

I agrees that keeping the public interface small is desirable and I tried to do that by splitting the class into InstCombiner – the internal, public interface – and InstCombinerImpl – the actual implementation of the pass.
As far as I understand it, LLVM_LIBRARY_VISIBILITY hides this class so it is not visible outside LLVM?

With this change, inst combining is split across several places, the general InstCombine and all the targets. They do similar things with the difference that the inst combining part inside the targets does only have access to the public InstCombiner interface.
As the target specific parts want to use the same helper methods, these helpers need to be in a public interface (public to the targets, not to LLVM users). The most prominent of these helpers is peekThroughBitcast.

Some of these helper functions are currently not used by targets, so they can be moved to a utils header if desired. In general, I think we want them to be shared, so that not every target has its own set of helpers.

Flakebi updated this revision to Diff 276983.Fri, Jul 10, 4:18 AM

Rebased (no conflicts this time).

Friendly ping for review.

nikic added inline comments.Fri, Jul 10, 9:57 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
526

For all three functions, the calling convention seems rather non-idiomatic for InstCombine. Rather than having an Instruction ** argument and bool result, is there any reason not to have an Instruction * return value, with nullptr indicating that the intrinsic couldn't be simplified?

528

const APInt &DemandedMask?

532

const APInt &DemandedElts?

Flakebi marked an inline comment as done.Fri, Jul 10, 12:22 PM
Flakebi added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
526

Yes, the function must have the option to return a nullptr and prevent that visitCallBase is called or other code is executed after instCombineIntrinsic.
So, somehow the caller must be able to see a difference between 'do nothing, just continue execution' and 'return this Instruction*', where the Instruction* can also be a nullptr.
The return type could be an optional<Instruction*>.

I’ll take a look at your other comments on Monday.

lattner resigned from this revision.Fri, Jul 10, 3:33 PM

Please don't consider me a blocker on this patch, thank you for pushing on it!

Flakebi marked an inline comment as done.Mon, Jul 13, 3:04 AM
Flakebi added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
528

I tried to change it it to to const APInt &DemandedMask but the x86 simplifyDemandedVectorEltsIntrinsic changes DemandedMask, so this function would have to copy it or take a non-const reference.
Looking more into it, SimplifyAndSetOp takes DemandedElts by value too.
An APInt consists of a uint64_t and an unsigned, so it should be 16 Byte in most cases. Only if the represented int is larger than 64 bit, it comes with an allocation. I guess copying should be fine.
If you think it should be a reference anyway, let me know and I’ll change it.

Flakebi updated this revision to Diff 278711.Fri, Jul 17, 3:40 AM

Rebased and added some docs.

Is there anything left that needs to be done before this can be pushed?

foad added inline comments.Fri, Jul 17, 4:46 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
552–555

Did you consider returning std::pair<bool,Instruction*>?

Flakebi updated this revision to Diff 278735.Fri, Jul 17, 5:34 AM

Here you go.

Change return types of TargetTransformInfo::instCombineIntrinsic and others to Optional<Instruction *> and Optional<Value *>.

dmgreen added inline comments.Tue, Jul 21, 2:39 AM
llvm/test/CodeGen/Thumb2/mve-intrinsics/predicates.ll
2

Please use the same triple as llc for any test with "mve" in the title.

Flakebi updated this revision to Diff 279463.Tue, Jul 21, 3:14 AM

Rebased and fix triple for Thumb2 tests as suggested.

nhaehnle accepted this revision.Tue, Jul 21, 10:41 AM

This has had a month of good review that has been addressed, I'd say it's good to go.

This revision is now accepted and ready to land.Tue, Jul 21, 10:41 AM
This revision was automatically updated to reflect the committed changes.

I have a multi-stage, auto-git-bisecting bot that has identifying this commit as the source of a regression on Fedora 32 (x86-64). This commit broke my first stage test (release, no asserts). Might a quick fix happen or do we need to revert this?

FAIL: Clang :: CodeGen/aarch64-bf16-ldst-intrinsics.c (7188 of 67650)
******************** TEST 'Clang :: CodeGen/aarch64-bf16-ldst-intrinsics.c' FAILED ********************
Script:
--
: 'RUN: at line 1';   /tmp/_update_lc/r/bin/clang -cc1 -internal-isystem /tmp/_update_lc/r/lib/clang/12.0.0/include -nostdsysteminc -triple aarch64-arm-none-eabi -target-feature +neon -target-feature +bf16   -O2 -emit-llvm /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c -o - | /tmp/_update_lc/r/bin/FileCheck /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c --check-prefixes=CHECK,CHECK64
: 'RUN: at line 3';   /tmp/_update_lc/r/bin/clang -cc1 -internal-isystem /tmp/_update_lc/r/lib/clang/12.0.0/include -nostdsysteminc -triple armv8.6a-arm-none-eabi -target-feature +neon -target-feature +bf16 -mfloat-abi hard   -O2 -emit-llvm /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c -o - | /tmp/_update_lc/r/bin/FileCheck /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c --check-prefixes=CHECK,CHECK32
--
Exit Code: 1

Command Output (stderr):
--
/home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c:14:13: error: CHECK32: expected string not found in input
// CHECK32: %1 = load <4 x bfloat>, <4 x bfloat>* %0, align 2
            ^
<stdin>:7:52: note: scanning from here
define arm_aapcs_vfpcc <4 x bfloat> @test_vld1_bf16(bfloat* readonly %ptr) local_unnamed_addr #0 {
                                                   ^
<stdin>:10:5: note: possible intended match here
 %vld1 = tail call <4 x bfloat> @llvm.arm.neon.vld1.v4bf16.p0i8(i8* %0, i32 2)
    ^
/home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c:23:13: error: CHECK32: expected string not found in input
// CHECK32: %1 = load <8 x bfloat>, <8 x bfloat>* %0, align 2
            ^
<stdin>:18:53: note: scanning from here
define arm_aapcs_vfpcc <8 x bfloat> @test_vld1q_bf16(bfloat* readonly %ptr) local_unnamed_addr #2 {
                                                    ^
<stdin>:21:5: note: possible intended match here
 %vld1 = tail call <8 x bfloat> @llvm.arm.neon.vld1.v8bf16.p0i8(i8* %0, i32 2)
    ^

Input file: <stdin>
Check file: /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c

-dump-input=help explains the following input dump.

Input was:
<<<<<<
            1: ; ModuleID = '/home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c'
            2: source_filename = "/home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c"
            3: target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
            4: target triple = "armv8.6a-arm-none-eabi"
            5:
            6: ; Function Attrs: nounwind readonly
            7: define arm_aapcs_vfpcc <4 x bfloat> @test_vld1_bf16(bfloat* readonly %ptr) local_unnamed_addr #0 {
check:14'0                                                        X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
            8: entry:
check:14'0     ~~~~~~
            9:  %0 = bitcast bfloat* %ptr to i8*
check:14'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           10:  %vld1 = tail call <4 x bfloat> @llvm.arm.neon.vld1.v4bf16.p0i8(i8* %0, i32 2)
check:14'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:14'1         ?                                                                          possible intended match
           11:  ret <4 x bfloat> %vld1
check:14'0     ~~~~~~~~~~~~~~~~~~~~~~~
           12: }
check:14'0     ~
           13:
check:14'0     ~
           14: ; Function Attrs: argmemonly nounwind readonly
check:14'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           15: declare <4 x bfloat> @llvm.arm.neon.vld1.v4bf16.p0i8(i8*, i32) #1
check:14'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           16:
check:14'0     ~
           17: ; Function Attrs: nounwind readonly
check:14'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           18: define arm_aapcs_vfpcc <8 x bfloat> @test_vld1q_bf16(bfloat* readonly %ptr) local_unnamed_addr #2 {
check:14'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:23'0                                                         X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
           19: entry:
check:23'0     ~~~~~~
           20:  %0 = bitcast bfloat* %ptr to i8*
check:23'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           21:  %vld1 = tail call <8 x bfloat> @llvm.arm.neon.vld1.v8bf16.p0i8(i8* %0, i32 2)
check:23'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:23'1         ?                                                                          possible intended match
           22:  ret <8 x bfloat> %vld1
check:23'0     ~~~~~~~~~~~~~~~~~~~~~~~
           23: }
check:23'0     ~
           24:
check:23'0     ~
           25: ; Function Attrs: argmemonly nounwind readonly
check:23'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           26: declare <8 x bfloat> @llvm.arm.neon.vld1.v8bf16.p0i8(i8*, i32) #1
check:23'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            .
            .
            .
>>>>>>

--

********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
********************
Failed Tests (1):
  Clang :: CodeGen/aarch64-bf16-ldst-intrinsics.c


Testing Time: 71.60s
  Unsupported      : 10693
  Passed           : 56854
  Expectedly Failed:   102
  Failed           :     1

Thanks for the notification @davezarzycki, an auto-bisecting bot is cool!

This failure should be fixed in b99898c1e9c5d8bade1d898e84604d3241b0087c.