Targets can combine intrinsics in
TargetTransformInfo::instCombineIntrinsic.
This allows accessing target specific features and combining
instructions only if the target supports certain features.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
110 ms | Clang.CodeGen::Unknown Unit Message ("") | |
240 ms | Clang.CodeGen::Unknown Unit Message ("") |
Event Timeline
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
3781 | This opens a dangerous floodgates of instcombine not being target-independent canonicalization pass. |
To add more context to this, the problem I am facing is that amdgpu image intrinsics are usually called with float arguments. However, on some subtargets/hardware generations it is possible to call them with half arguments.
If llvm is compiling for such a subtarget, it is beneficial to combine
%s32 = fpext half %s to float call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f32(…, float %s32, …)
into
call <4 x float> @llvm.amdgcn.image.sample.2d.v4f32.f16(…, half %s, …)
This combines instructions, so I think it belongs into the InstCombine pass. On the other hand, the f16 form of the intrinsics is not available on all targets, so this combination cannot be applied unconditionally but it needs to be gated depending on the target.
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
3781 | That is the point of this change, to allow target-dependent combinations in TargetTransformInfo::instCombineIntrinsic. I don’t have a great overview of LLVM, so I might be wrong on this. |
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
3781 |
I agree with that, yes. The problem i'm seeing is that even having TTI in the pass |
The fact that this pass recognizes target-specific intrinsics at all is widely regarded as a mistake:
http://lists.llvm.org/pipermail/llvm-dev/2016-July/102317.html
Target-specific transforms should look first at codegen combiners (SDAG or GlobalISel). If that's too late, consider a target-specific IR codegen pass (I think AMDGPU has a few examples of this already). If that's still too late, write a generic IR transform pass that accesses TTI?
The problem with all of these suggestions is that they're likely technically-inferior solutions compared to sitting inside of InstCombine's fixed-point iteration scheme. Honestly, I think that the way we should ensure that InstCombine does not start using TTI to define a canonical form for non-target-specific intrinsics is via documentation and code review. InstCombine has long had logic to deal with target-specific intrinsics (in InstCombineCalls.cpp), and refactoring things so that this logic can live in each backend seems like an improvement to me.
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h | ||
---|---|---|
149 | Actually implementing this would require us to export the InstCombiner class, which is part of InstCombineInternal.h. I don't think we would want to do this in its current form. This would require a larger refactoring to separate out the implementation and API portions of InstCombine. |
Summarizing the comments, the important points are
- Everyone agrees on moving target specific stuff out of Transforms/InstCombine into target specific folders
- Keep running the instruction combining in the InstCombine pass, so the fixed-point iteration works
The majority of target specific code is intrinsic combining, there is only one more amdgpu specific part in InstCombineSimplifyDemanded.cpp:SimplifyDemandedVectorElts. Unless someone has an idea on how to implement this in a more generic way, I’ll keep it like in the current diff, only combining intrinsics in TargetTransformInfo::instCombineIntrinsic.
Actually implementing this would require us to export the InstCombiner class, which is part of InstCombineInternal.h. I don't think we would want to do this in its current form. This would require a larger refactoring to separate out the implementation and API portions of InstCombine.
Good point, I’ll try to add that here in the next week.
Moved most target specific InstCombine parts to their respective targets.
The largest left-over part in InstCombineCalls.cpp is the code shared between arm and aarch64. Is there a place where code for these targets is shared?
The gist of these changes is in the following files:
- llvm/include/llvm/Analysis/TargetTransformInfo.h
- llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
- llvm/include/llvm/CodeGen/BasicTTIImpl.h
- llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
- llvm/lib/Analysis/TargetTransformInfo.cpp
- llvm/lib/Transforms/InstCombine/InstCombineInternal.h
- llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
The rest of the changes are moving about 3000 lines out from InstCombine to the targets and slightly adjust them for the new interface, there should be no other changes in there.
As far as I know and I might be wrong, but TargetTransformInfo up til now has only provided information. It doesn't do any transforms itself. Is adding transforms to it the right thing to do?
This isn't strictly true. I recently added rewriteIntrinsicWithAddressSpace for example
This may be the only example though. I may have introduced something conceptually new without realizing it. The current use also doesn't exactly make the change. It does introduce new instructions, but the pass is still responsible for doing the replacement/delete of the old value
We've been handling target-specific intrinsics in InstCombine for a long time, and that's the place where they should naturally sit. This is a pretty clean refactoring in my opinion, I'm in favor. It's substantial enough as a change that it should probably receive a heads-up on llvm-dev, though.
I think an interface usable by InstructionSimplify would be helpful too, so I think that would be a separate thing from TTI
This combines instructions, so I think it belongs into the InstCombine pass. On the other hand, the f16 form of the intrinsics is not available on all targets, so this combination cannot be applied unconditionally but it needs to be gated depending on the target.
I don't think this is a great justification for doing anything here. You can always reverse the transform in isel on targets where it isn't supported; adding more IR patterns increases the potential for missed optimizations.
That said, I think moving the handling for target intrinsics into the target makes sense as a cleanup.
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
---|---|---|
1444 ↗ | (On Diff #273458) | Is there some way we can check that an intrinsic is actually target-specific, to discourage people from handling generic intrinsics in target-specific ways? |
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
---|---|---|
1444 ↗ | (On Diff #273458) | That was the intent of @bogner's rG92a8c6112c6571112e8b622bfddc7e4d1685a6fe. |
Rebased and call target-specific combining only for target-specific intrinsics as suggested.
Add Function::isTargetIntrinsic() for this purpose.
This looks like a great direction, but please make sure to minimize public implementation details. We don't want the vast majority of instcombine to be visible outside of its library (it is hairy enough as it is :-)
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
29 | Can this be forward declared instead of #include'd? | |
llvm/include/llvm/Transforms/InstCombine/InstCombiner.h | ||
30 ↗ | (On Diff #274436) | Please minimize #includes in general, thanks :) |
46 ↗ | (On Diff #274436) | I would really rather not make this be a public class - this is a very thick interface. Can this be cut down to something much smaller than the implementation details of InstCombine? If you're curious for a pattern that could be followed, the MLIR AsmParser is a reasonable example. The parser is spread across a bunch of classes in the lib/ directory: But then there is a much smaller public API exposed through a header: |
llvm/include/llvm/Transforms/InstCombine/InstCombiner.h | ||
---|---|---|
46 ↗ | (On Diff #274436) | I agree with the sentiment, but note @Flakebi has split up the InstCombiner class into InstCombiner and InstCombinerImpl classes, which addresses those concerns already as far as I'm concerned. Looking through the new InstCombiner, aside from methods that are core to the workings of InstCombine (modifying instructions while keeping track of the Worklist) and methods for accessing the analyses, what's left is:
Moving those methods feels sensible, but is likely to touch a lot of code, so I think it would be better to do it in a separate commit. |
Rebased and removed a few includes as suggested.
Make the TargetTransformInfo a private member of InstCombiner because it should not be used in general inst combines.
Move CreateOverflowTuple out of InstCombiner and make CreateNonTerminatorUnreachable static.
I would really rather not make this be a public class - this is a very thick interface. Can this be cut down to something much smaller than the implementation details of InstCombine?
I agrees that keeping the public interface small is desirable and I tried to do that by splitting the class into InstCombiner – the internal, public interface – and InstCombinerImpl – the actual implementation of the pass.
As far as I understand it, LLVM_LIBRARY_VISIBILITY hides this class so it is not visible outside LLVM?
With this change, inst combining is split across several places, the general InstCombine and all the targets. They do similar things with the difference that the inst combining part inside the targets does only have access to the public InstCombiner interface.
As the target specific parts want to use the same helper methods, these helpers need to be in a public interface (public to the targets, not to LLVM users). The most prominent of these helpers is peekThroughBitcast.
Some of these helper functions are currently not used by targets, so they can be moved to a utils header if desired. In general, I think we want them to be shared, so that not every target has its own set of helpers.
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
537 | For all three functions, the calling convention seems rather non-idiomatic for InstCombine. Rather than having an Instruction ** argument and bool result, is there any reason not to have an Instruction * return value, with nullptr indicating that the intrinsic couldn't be simplified? | |
539 | const APInt &DemandedMask? | |
543 | const APInt &DemandedElts? |
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
537 | Yes, the function must have the option to return a nullptr and prevent that visitCallBase is called or other code is executed after instCombineIntrinsic. I’ll take a look at your other comments on Monday. |
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
539 | I tried to change it it to to const APInt &DemandedMask but the x86 simplifyDemandedVectorEltsIntrinsic changes DemandedMask, so this function would have to copy it or take a non-const reference. |
Rebased and added some docs.
Is there anything left that needs to be done before this can be pushed?
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
541–544 | Did you consider returning std::pair<bool,Instruction*>? |
Here you go.
Change return types of TargetTransformInfo::instCombineIntrinsic and others to Optional<Instruction *> and Optional<Value *>.
llvm/test/CodeGen/Thumb2/mve-intrinsics/predicates.ll | ||
---|---|---|
2 ↗ | (On Diff #278735) | Please use the same triple as llc for any test with "mve" in the title. |
This has had a month of good review that has been addressed, I'd say it's good to go.
I have a multi-stage, auto-git-bisecting bot that has identifying this commit as the source of a regression on Fedora 32 (x86-64). This commit broke my first stage test (release, no asserts). Might a quick fix happen or do we need to revert this?
FAIL: Clang :: CodeGen/aarch64-bf16-ldst-intrinsics.c (7188 of 67650) ******************** TEST 'Clang :: CodeGen/aarch64-bf16-ldst-intrinsics.c' FAILED ******************** Script: -- : 'RUN: at line 1'; /tmp/_update_lc/r/bin/clang -cc1 -internal-isystem /tmp/_update_lc/r/lib/clang/12.0.0/include -nostdsysteminc -triple aarch64-arm-none-eabi -target-feature +neon -target-feature +bf16 -O2 -emit-llvm /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c -o - | /tmp/_update_lc/r/bin/FileCheck /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c --check-prefixes=CHECK,CHECK64 : 'RUN: at line 3'; /tmp/_update_lc/r/bin/clang -cc1 -internal-isystem /tmp/_update_lc/r/lib/clang/12.0.0/include -nostdsysteminc -triple armv8.6a-arm-none-eabi -target-feature +neon -target-feature +bf16 -mfloat-abi hard -O2 -emit-llvm /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c -o - | /tmp/_update_lc/r/bin/FileCheck /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c --check-prefixes=CHECK,CHECK32 -- Exit Code: 1 Command Output (stderr): -- /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c:14:13: error: CHECK32: expected string not found in input // CHECK32: %1 = load <4 x bfloat>, <4 x bfloat>* %0, align 2 ^ <stdin>:7:52: note: scanning from here define arm_aapcs_vfpcc <4 x bfloat> @test_vld1_bf16(bfloat* readonly %ptr) local_unnamed_addr #0 { ^ <stdin>:10:5: note: possible intended match here %vld1 = tail call <4 x bfloat> @llvm.arm.neon.vld1.v4bf16.p0i8(i8* %0, i32 2) ^ /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c:23:13: error: CHECK32: expected string not found in input // CHECK32: %1 = load <8 x bfloat>, <8 x bfloat>* %0, align 2 ^ <stdin>:18:53: note: scanning from here define arm_aapcs_vfpcc <8 x bfloat> @test_vld1q_bf16(bfloat* readonly %ptr) local_unnamed_addr #2 { ^ <stdin>:21:5: note: possible intended match here %vld1 = tail call <8 x bfloat> @llvm.arm.neon.vld1.v8bf16.p0i8(i8* %0, i32 2) ^ Input file: <stdin> Check file: /home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c -dump-input=help explains the following input dump. Input was: <<<<<< 1: ; ModuleID = '/home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c' 2: source_filename = "/home/dave/ro_s/lp/clang/test/CodeGen/aarch64-bf16-ldst-intrinsics.c" 3: target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64" 4: target triple = "armv8.6a-arm-none-eabi" 5: 6: ; Function Attrs: nounwind readonly 7: define arm_aapcs_vfpcc <4 x bfloat> @test_vld1_bf16(bfloat* readonly %ptr) local_unnamed_addr #0 { check:14'0 X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found 8: entry: check:14'0 ~~~~~~ 9: %0 = bitcast bfloat* %ptr to i8* check:14'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 10: %vld1 = tail call <4 x bfloat> @llvm.arm.neon.vld1.v4bf16.p0i8(i8* %0, i32 2) check:14'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ check:14'1 ? possible intended match 11: ret <4 x bfloat> %vld1 check:14'0 ~~~~~~~~~~~~~~~~~~~~~~~ 12: } check:14'0 ~ 13: check:14'0 ~ 14: ; Function Attrs: argmemonly nounwind readonly check:14'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 15: declare <4 x bfloat> @llvm.arm.neon.vld1.v4bf16.p0i8(i8*, i32) #1 check:14'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 16: check:14'0 ~ 17: ; Function Attrs: nounwind readonly check:14'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 18: define arm_aapcs_vfpcc <8 x bfloat> @test_vld1q_bf16(bfloat* readonly %ptr) local_unnamed_addr #2 { check:14'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ check:23'0 X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found 19: entry: check:23'0 ~~~~~~ 20: %0 = bitcast bfloat* %ptr to i8* check:23'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21: %vld1 = tail call <8 x bfloat> @llvm.arm.neon.vld1.v8bf16.p0i8(i8* %0, i32 2) check:23'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ check:23'1 ? possible intended match 22: ret <8 x bfloat> %vld1 check:23'0 ~~~~~~~~~~~~~~~~~~~~~~~ 23: } check:23'0 ~ 24: check:23'0 ~ 25: ; Function Attrs: argmemonly nounwind readonly check:23'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26: declare <8 x bfloat> @llvm.arm.neon.vld1.v8bf16.p0i8(i8*, i32) #1 check:23'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ . . . >>>>>> -- ******************** Testing: 0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90.. ******************** Failed Tests (1): Clang :: CodeGen/aarch64-bf16-ldst-intrinsics.c Testing Time: 71.60s Unsupported : 10693 Passed : 56854 Expectedly Failed: 102 Failed : 1
Thanks for the notification @davezarzycki, an auto-bisecting bot is cool!
This failure should be fixed in b99898c1e9c5d8bade1d898e84604d3241b0087c.
Can this be forward declared instead of #include'd?