Previously in D126043, the transformation was added and guarded by an option.
This commit attempts to create an TTI and enable it for the RISC-V backend.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Benchmark results of Spec2k6 on FPGA with our downstream compiler under the patch applied show difference larger than just fluctuation in the following benchmarks and little difference for the others.
- 2.6% runtime improvement in perlbench
- 1.5% runtime improvement in hmmer
- 1.3% runtime regression in sjeng
Could you elaborate why a target-specific option is needed? Shouldn't it be beneficial for all targets and be enabled independently?
Enabling in all targets will require a lot of changes in test cases across targets so I think I should divide and conquer here.
Oh I see, there are ~28 LSR tests failing and ~50 codegen tests (with AArch64, ARM & X86 backends).
Are you confident that the patch is working as expected in all cases?
My main worry is that by not updating all tests we may miss bugs in the code. I have no idea if all the assembly changes in the codegen tests for RISCV are beneficial, but I spot-checked the impact on some other tests and both llvm/test/CodeGen/Thumb2/LowOverheadLoops/vcmp-vpst-combination.ll and llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-fabs.ll seem to regress with AllowDropSolutionIfLessProfitable set to `true. IIUC this might indicate that the cost estimate may not be working as expected in all cases.
@fhahn Thank you for checking this in the Arm backend.
I think this transformation makes sense for all targets and regression should come from an insufficient cost model. The fact that improvement in RISC-V is observed and Arm is producing regressed loops supports the TTI approach.
The regressed result [3] of CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll shows that the {vector contiguous load/store + post-increment instruction} vldrw.u32 and vstrw.32 was not leveraged efficiently, which is why there are two more add.w instructions for the pointers. Observing on the IR that produced the regressed loop [0], I would say that the lowering does not successfully recognize the pattern of vector load/store instructions using values of the gep instruction that is indexed by the primary IV. The cost model logs [1] make sense to me since address mode CAN be folded and its the codegen's reponsibility to recognize the pattern. The original IR that is generated after LSR is shown below [2] and I think [0] is capable of producing the same codegen with some additional pattern recognition.
I can create another patch to enable it in Arm so we can get attention from guys in the Arm backend, but at the same time I think the regression here should not be blocking the landing of this particular patch which only affects RISC-V.
[0] LLVM IR after LSR with -lsr-drop-solution enabled for CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll
*** Code after LSR *** define arm_aapcs_vfpcc void @fabs(float* noalias nocapture readonly %pSrcA, float* noalias nocapture %pDst, i32 %blockSize) #0 { entry: %cmp3 = icmp eq i32 %blockSize, 0 br i1 %cmp3, label %while.end, label %vector.ph vector.ph: ; preds = %entry %n.rnd.up = add i32 %blockSize, 3 %n.vec = and i32 %n.rnd.up, -4 br label %vector.body vector.body: ; preds = %vector.body, %vector.ph %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ] %next.gep = getelementptr float, float* %pDst, i32 %index %next.gep13 = getelementptr float, float* %pSrcA, i32 %index %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %blockSize) %0 = bitcast float* %next.gep13 to <4 x float>* %wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %0, i32 4, <4 x i1> %active.lane.mask, <4 x float> undef) %1 = call fast <4 x float> @llvm.fabs.v4f32(<4 x float> %wide.masked.load) %2 = bitcast float* %next.gep to <4 x float>* call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %1, <4 x float>* %2, i32 4, <4 x i1> %active.lane.mask) %index.next = add i32 %index, 4 %3 = icmp eq i32 %index.next, %n.vec br i1 %3, label %while.end.loopexit, label %vector.body while.end.loopexit: ; preds = %vector.body br label %while.end while.end: ; preds = %while.end.loopexit, %entry ret void }
[1] Output log on LSR proposed solution and baseline solution
The chosen solution requires 1 instruction 5 regs, with addrec cost 1, plus 8 setup cost: LSR Use: Kind=ICmpZero, Offsets={0}, widest fixup type: i32 reg({(4 * ((3 + %blockSize) /u 4))<nuw>,+,-4}<%vector.body>) LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32 reg({0,+,4}<%vector.body>) LSR Use: Kind=Address of <4 x float> in addrspace(0), Offsets={0}, widest fixup type: <4 x float>* reg({%pSrcA,+,16}<%vector.body>) LSR Use: Kind=Address of <4 x float> in addrspace(0), Offsets={0}, widest fixup type: <4 x float>* reg({%pDst,+,16}<%vector.body>) LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32 reg(%blockSize) The baseline solution requires 1 instruction 4 regs, with addrec cost 1, plus 7 setup cost lsr-drop-solution: 0 Baseline is more profitable than chosen solution, add option 'lsr-drop-solution' to drop LSR solution.
[2] LLVM IR after LSR without -lsr-drop-solution enabled for CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll
*** Code after LSR *** define arm_aapcs_vfpcc void @fabs(float* noalias nocapture readonly %pSrcA, float* noalias nocapture %pDst, i32 %blockSize) #0 { entry: %cmp3 = icmp eq i32 %blockSize, 0 br i1 %cmp3, label %while.end, label %vector.ph vector.ph: ; preds = %entry %n.rnd.up = add i32 %blockSize, 3 %n.vec = and i32 %n.rnd.up, -4 br label %vector.body vector.body: ; preds = %vector.body, %vector.ph %lsr.iv3 = phi float* [ %scevgep4, %vector.body ], [ %pDst, %vector.ph ] %lsr.iv1 = phi float* [ %scevgep, %vector.body ], [ %pSrcA, %vector.ph ] %lsr.iv = phi i32 [ %lsr.iv.next, %vector.body ], [ %n.vec, %vector.ph ] %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ] %lsr.iv12 = bitcast float* %lsr.iv1 to <4 x float>* %lsr.iv35 = bitcast float* %lsr.iv3 to <4 x float>* %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %blockSize) %wide.masked.load = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* %lsr.iv12, i32 4, <4 x i1> %active.lane.mask, <4 x float> undef) %0 = call fast <4 x float> @llvm.fabs.v4f32(<4 x float> %wide.masked.load) call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %0, <4 x float>* %lsr.iv35, i32 4, <4 x i1> %active.lane.mask) %index.next = add i32 %index, 4 %lsr.iv.next = add i32 %lsr.iv, -4 %scevgep = getelementptr float, float* %lsr.iv1, i32 4 %scevgep4 = getelementptr float, float* %lsr.iv3, i32 4 %1 = icmp eq i32 %lsr.iv.next, 0 br i1 %1, label %while.end.loopexit, label %vector.body while.end.loopexit: ; preds = %vector.body br label %while.end while.end: ; preds = %while.end.loopexit, %entry ret void }
[3] Diff when -lsr-drop-solution is enabled for llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-fabs.ll
git diff ../llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-fabs.ll diff --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-fabs.ll b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-fabs.ll index 66216022d647..48f4d5355599 100644 --- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-fabs.ll +++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-fabs.ll @@ -4,21 +4,25 @@ define arm_aapcs_vfpcc void @fabs(float* noalias nocapture readonly %pSrcA, float* noalias nocapture %pDst, i32 %blockSize) { ; CHECK-LABEL: fabs: ; CHECK: @ %bb.0: @ %entry -; CHECK-NEXT: .save {r7, lr} -; CHECK-NEXT: push {r7, lr} +; CHECK-NEXT: .save {r4, lr} +; CHECK-NEXT: push {r4, lr} ; CHECK-NEXT: cmp r2, #0 ; CHECK-NEXT: it eq -; CHECK-NEXT: popeq {r7, pc} +; CHECK-NEXT: popeq {r4, pc} ; CHECK-NEXT: .LBB0_1: @ %vector.ph +; CHECK-NEXT: movs r3, #0 ; CHECK-NEXT: dlstp.32 lr, r2 ; CHECK-NEXT: .LBB0_2: @ %vector.body ; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1 -; CHECK-NEXT: vldrw.u32 q0, [r0], #16 +; CHECK-NEXT: add.w r4, r0, r3, lsl #2 +; CHECK-NEXT: vldrw.u32 q0, [r4] +; CHECK-NEXT: add.w r12, r1, r3, lsl #2 +; CHECK-NEXT: adds r3, #4 ; CHECK-NEXT: vabs.f32 q0, q0 -; CHECK-NEXT: vstrw.32 q0, [r1], #16 +; CHECK-NEXT: vstrw.32 q0, [r12] ; CHECK-NEXT: letp lr, .LBB0_2 ; CHECK-NEXT: @ %bb.3: @ %while.end -; CHECK-NEXT: pop {r7, pc} +; CHECK-NEXT: pop {r4, pc} entry: %cmp3 = icmp eq i32 %blockSize, 0 br i1 %cmp3, label %while.end, label %vector.ph
I just checked some random failing tests. Another interesting one is llvm/test/CodeGen/X86/2007-03-15-GEP-Idx-Sink.ll which also seems to regress. My main point is that I think we should try to avoid fragmentation between backends for generic features such as this one. Having this enabled by default and ironing out the remaining test issues will be much more beneficial for the overall LLVM project. By making sure is is enabled on heavily used architectures like X86 and AArch64 we also ensure that it gets as much testing as possible.
IMO it may be fine to gradually enable it, but I think we first need to understand what the issues with the other tests are. My concern is that after only enabling it for RISCV it will remain enabled for RISCV only
The regressed result [3] of CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll shows that the {vector contiguous load/store + post-increment instruction} vldrw.u32 and vstrw.32 was not leveraged efficiently, which is why there are two more add.w instructions for the pointers.
I am not too familiar with Thumb2, @dmgreen any thoughts on this?
MVE relies pretty heavily on AMK_PostIndexed from getPreferredAddressingMode. I could easily imagine something could be going wrong there. But I don't see this as profitable for Thumb1 targets or AArch64 either, when ran across a wide set of benchmarks. We certainly know that LSR is important for performance, and there are a number of known issues where it is not optimal.
This is my brief understanding of how LSR works:
- Create a bunch of little formula that look useful (isLegalUse) and can be combined in different ways for loop varying accesses.
- Realise that there are far too many formula combinations and start filtering them out based on heuristics of which ones look unprofitable.
- "Solve" by considering all the remaining formula for which gives the best cost.
- Use this best solution.
The creation and filtering isn't unimportant for profitability. The algorithm as a whole is presumably assuming that certain formula are already ruled out when it comes to costing them, at least undef MVE. Even if they are the existing formula.
The best cost is by default based on the Number of Registers used, followed by AddRecCost, Muls, Adds, Imms and SetupCost. X86 and some other architectures also consider Number of Instructions. The NumRegs often isn't the most important issue in a loop. In the past I had tried altering the cost on Arm/AArch64 to not consider NumRegs as the most important factor, but the performance wasn't good enough to justify it. LSR is pretty fragmented already though between different backends as a result.
Ideally LSR shouldn't need patched like this, even if I'm not against it for other architectures. It sounds like it should be sensible enough. But why isn't the most profitable formula already being considered in the Solve? Is it never generated, or is it filtered out? I don't know RISCV assembly very well. A lot of the examples look like they have more instructions in the loop to me.
@dmgreen Thank you for giving this a look. My understanding to LSR matches yours. The filtering happens after all interesting variations were generated, and the filtering does not necessarily preserves the original ones, which motivates this patch of considering the LSR derived solution to the original existing one. My main motivation was from RISCV/lsr-drop-solution.ll, which the original LSR suggestion does make a worse transformation than the original unchanged IR. This patch is an amendment that comes from the fallacy in filtering.
On the other hand (since it looks like you also have given a decent trace to the code), the current LLVM LSR implementation is not documented anywhere, nor cited from an existing paper. As mentioned, incorrect pruning (filtering) will have us fall into some local minimum. Has anyone every questioned the validity of the current implemented filtering heuristics and considers a different approach?
Sam Parker implemented a lsr-complexity-limit, which can be set much higher to perform less filtering. The complexity needs to be kept under control though if just for compile time, and it is very easy to go over that with unrolled loops or more complex cases. We should be getting the simple cases correct though, and my view is that we could adjust the filtering to be more profitable. It is just a set of heuristics, after all.
I see some intrinsics that I guess are loads and stores. Have you thought about implementing getTgtMemIntrinsic for them? I'm not sure it will help, but the example doesn't look very complex. If it is filtering the wrong solutions I would look into why and if it is possible to prevent it. After all, we could always start with the "bad" solution and then never see the "good" alternative because of it. I'm not really against this patch though, so long as it doesn't get enabled for targets where it is obviously worse. It would be better if the "cost" could be relied upon but that doesn't seem to currently be possible in all cases.
Another case is mentioned that can be resolved by this enabling drop solution. https://github.com/llvm/llvm-project/issues/59366
@dmgreen I would to look deeper into the pass, but my current priorities right now is away from this pass, I will keep this in mind though and I will be comfortable being pinged 3 months from now.
If there is no further objection here, I would like to land this. @fhahn, @dmgreen
I have no objections for anything RISCV related, so long as others who deal with that architecture more agree with it.
Aborting this revision. As mentioned by Florian the righteous target should be to have drop solution be available for ALL targets.