There is a regression in AArch64 after commit https://reviews.llvm.org/rL317368.
Some blocks are considered worthwile to merge conditional stores, but these merges lead to unoptimal code (bug https://bugs.llvm.org/show_bug.cgi?id=43205 PR#43205 )
So it is needed to add some additional cost to instructions, if these instructions cannot be represented by 1 conditional instructions.
For example:
add nsw i32 %tmp 1 can be represented as cset
or will be split in csel and orr
Details
- Reviewers
efriedma lebedev.ri
Diff Detail
Event Timeline
Maybe the reason of this regression is not here and I worked in wrong direction. If so - I will really appreciate any clarifications about it
I'm trying to understand the issue you're seeing... I guess it comes down to something like the following?
Before:
tst x18, x1 b.eq .LBB0_10 .LBB0_9: orr x16, x16, x18 add w0, w0, #1 str xzr, [x13, #56] .LBB0_10: cbz x11, .LBB0_7
After:
and x2, x18, x1 orr x3, x16, x18 tst x18, x1 orr x2, x2, x11 cinc w0, w0, ne csel x16, x16, x3, eq cbz x2, .LBB0_7
I agree, five instructions is probably too many to speculate to eliminate a store. But the patch doesn't really reflect the actual costs here, which largely have to do with the PHI->select transform rather than the actual arithmetic instructions.
Here's the relevant IR after the store merging transform:
if.then: ; preds = %for.body12 %or = or i64 %res_in7.046, %bit.044 %inc = add nsw i32 %retval1.243, 1 br label %if.end if.end: ; preds = %if.then, %for.body12 %retval1.3 = phi i32 [ %inc, %if.then ], [ %retval1.243, %for.body12 ] %res_in7.1 = phi i64 [ %or, %if.then ], [ %res_in7.046, %for.body12 ] br i1 %tobool14, label %for.inc, label %if.then15 if.then15: ; preds = %if.end br label %for.inc for.inc: ; preds = %if.then15, %if.end %simplifycfg.merge = phi i8* [ null, %if.then15 ], [ null, %if.end ] %4 = xor i1 %tobool, true %5 = xor i1 %tobool14, true %6 = or i1 %4, %5 br i1 %6, label %7, label %8 7: ; preds = %for.inc store i8* %simplifycfg.merge, i8** %proc, align 8 br label %8 8: ; preds = %for.inc, %7 %inc18 = add nuw nsw i32 %j.047, 1 %shl = shl i64 %bit.044, 1 %exitcond = icmp eq i32 %inc18, 64 br i1 %exitcond, label %for.cond.cleanup11, label %for.body12
Some of the branches collapse; at this point, we've basically traded a store for a branch which is likely predictable, assuming we turn the "or i1" back into a branch. That's maybe okay, I guess? But then we decide "if.then" is small enough to "predicate" it at the IR level, and we never reverse that decision when it turns out that doesn't simplify anything.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h | ||
---|---|---|
179 | This makes no sense; integer logic should be cheap. | |
llvm/test/Transforms/SimplifyCFG/AArch64/check-instr-cost-for-folding.ll | ||
2 ↗ | (On Diff #219108) | This testcase is way too complicated given the change you're making. |
Thank you for detailed answer.
But then we decide "if.then" is small enough to "predicate" it at the IR level, and we never reverse that decision when it turns out that doesn't simplify anything.
Yes, I totally agree with you.
I agree, five instructions is probably too many to speculate to eliminate a store. But the patch doesn't really reflect the actual costs here, which largely have to do with the PHI->select transform rather than the actual arithmetic instructions.
Two phi nodes are replaced with select instructions in static bool FoldTwoEntryPHINode (which is invoked after mergeConditionalStoreToAddress)
while (PHINode *PN = dyn_cast<PHINode>(BB->begin())) { // Change the PHI node into a select instruction. Value *TrueVal = PN->getIncomingValue(PN->getIncomingBlock(0) == IfFalse); Value *FalseVal = PN->getIncomingValue(PN->getIncomingBlock(0) == IfTrue); Value *Sel = Builder.CreateSelect(IfCond, TrueVal, FalseVal, "", InsertPt); PN->replaceAllUsesWith(Sel); Sel->takeName(PN); PN->eraseFromParent(); }
So it is needed to add some additional analysis here? I mean - FoldTwoEntryPHINode should make some checks about expediency of this folding?
This needs a testcase (check-instr-cost-for-folding.ll isn't it.)
Can you see if D67315 already solves whatever issue you are seeing?
Simplify test case. Relevant C code is:
typedef struct { int * dummy; void *proc; } ptr_wrapper; ptr_wrapper * fds; int do_select(const unsigned long in, unsigned long bit, const unsigned long mask) { int retval = 0; unsigned long res_in = 0; for( ; bit != 0; bit <<= 1) { if (in & bit) { res_in |= bit; retval++; fds->proc = NULL; } if (mask & 0x1) { fds->proc = NULL; } } return retval + res_in; }
Rework patch. Allow merge store but bring additional check to FoldTwoEntryPHINode: before hoisting all instructions from IfBlock0 and IfBlock1 to DomBlock we check number of instructions in this DomBlock.
Please upload patches with full context (-U99999)
Patches should be as compared to the trunk, not previous patch.
I need to change my patch entirely, it needs discussion in mailing list and it will take time, so at the moment I will close this revision and after some time create another one.
Thanks to all.
I really wonder if the SimplifyCFG behavior is correct and you want to instead be tuning the undo transform in backend..
Yes, I think it is possible, that this code should be changed later in backend. It needs more investigation )
I changed my patch, now before merging blocks it calculates existing instructions in DomBB,
just to be sure that this block is not too big already
Maybe instead of using phi-node-folding threshold it is needed to use some other separate threshold.
And also it is still not clear, if we need to optimize this in a middle end or in a backend.
Maybe backend will be more proper place for it. @dmgreen please feel free to bring a light to this problem
P.S.: In this change I also simplified a testcase
Recall, what exactly I trying to eliminate:
clang merges too big basic blocks and it leads to performance regression
Before merging:
tst x18, x1 b.eq .LBB0_10 .LBB0_9: orr x16, x16, x18 add w0, w0, #1 str xzr, [x13, #56] .LBB0_10: cbz x11, .LBB0_7
After merging
and x2, x18, x1 orr x3, x16, x18 tst x18, x1 orr x2, x2, x11 cinc w0, w0, ne csel x16, x16, x3, eq cbz x2, .LBB0_7
Similar picture in test case:
Before merging:
for.body.lr.ph: ; preds = %entry %0 = load %struct.ptr_wrapper*, %struct.ptr_wrapper** @g_wrapper, align 8 %proc = getelementptr inbounds %struct.ptr_wrapper, %struct.ptr_wrapper* %0, i64 0, i32 1 %and2 = and i64 %mask, 1 %tobool3 = icmp eq i64 %and2, 0 %and = and i64 %bit, %in %tobool = icmp eq i64 %and, 0 br i1 %tobool, label %if.end, label %if.then if.then: ; preds = %for.body.lr.ph %or = or i64 0, %bit %inc = add nsw i32 0, 1 br label %if.end if.end: ; preds = %if.then, %for.body.lr.ph %retval1.1 = phi i32 [ %inc, %if.then ], [ 0, %for.body.lr.ph ] %res_in.1 = phi i64 [ %or, %if.then ], [ 0, %for.body.lr.ph ] %1 = xor i1 %tobool, true %2 = xor i1 %tobool3, true %3 = or i1 %1, %2 store i8* null, i8** %proc, align 8 %shl = shl i64 %bit, 1 %cmp = icmp eq i64 %shl, 0 br label %for.end
After merging:
for.body.lr.ph: ; preds = %entry %0 = load %struct.ptr_wrapper*, %struct.ptr_wrapper** @g_wrapper, align 8 %proc = getelementptr inbounds %struct.ptr_wrapper, %struct.ptr_wrapper* %0, i64 0, i32 1 %and2 = and i64 %mask, 1 %tobool3 = icmp eq i64 %and2, 0 %and = and i64 %bit, %in %tobool = icmp eq i64 %and, 0 %or = or i64 0, %bit %inc = add nsw i32 0, 1 %retval1.1 = select i1 %tobool, i32 0, i32 %inc %res_in.1 = select i1 %tobool, i64 0, i64 %or %1 = xor i1 %tobool, true %2 = xor i1 %tobool3, true %3 = or i1 %1, %2 store i8* null, i8** %proc, align 8 %shl = shl i64 %bit, 1 %cmp = icmp eq i64 %shl, 0 br label %for.end
llvm/lib/Transforms/Utils/SimplifyCFG.cpp | ||
---|---|---|
2429 | Not clang-formatted | |
llvm/test/Transforms/SimplifyCFG/AArch64/check-instr-cost-for-folding.ll | ||
25–29 ↗ | (On Diff #230235) | I think this testcase is overreduced, which makes it impossible to actually guess what the fix should be. |
Fix formatting in SimplifyCFG.cpp and remaster testcase (I reduced it more accurately to avoid overreducing)
Please @lebedev.ri take a look
On X86 branch misprediction is ~10..~20 cycles,
Is it dirt cheap on AArch64? (any number?)
Perhaps we need to redefine the threshold in terms of branch misprediction cost?
llvm/lib/Transforms/Utils/SimplifyCFG.cpp | ||
---|---|---|
2421–2424 | So in other words we'd only perform the fold only if the preceding block is not larger than |
Very interesting, I didn't measure cycles of misprediction. Only time of execution.
Is it dirt cheap on AArch64? (any number?)
Excuse me, I didn't get the question :) You asked about number of branch misprediction cycles?
Perhaps we need to redefine the threshold in terms of branch misprediction cost?
It is actually sounds very promising. But at the moment I do not know where to get this cost :) Is it supposed to be in processor description (e.g. for Cortex-75 -https://developer.arm.com/docs/100403/0301 ) ?
And yet another thought: maybe we will just compare execution latency / throughput for merged and non-merged variants and will choose the variant with the smallest total value?
For example (all data about latency/throughput is taken from https://developer.arm.com/docs/101398/0200/arm-cortex-a75-software-optimization-guide-v20 / code taken from https://bugs.llvm.org/show_bug.cgi?id=43205 ):
Non-merged variant | Execution Latency | Execution Throughput --------------------------------------------------------------------- tst x18, x1 | 1 | 2 b.eq .LBB0_10 | 1 | 1 .LBB0_9: | | orr x16, x16, x18 | 1 | 2 add w0, w0, #1 | 1 | 2 str xzr, [x13, #56] | 1 | 1 .LBB0_10: | | cbz x11, .LBB0_7 | 1 | 1 str xzr, [x13,#56] | 1 | 1 --------------------------------------------------------------------- Total: | 7 | 10
Merged variant | Execution Latency | Execution Throughput --------------------------------------------------------------------- and x3, x2, x1 | 1 | 2 tst x2, x1 | 1 | 2 orr x5, x11, x3 | 1 | 2 cinc w0, w0, ne | 1 | 2 csel x3, xzr, x2, eq | 1 | 2 cbz x5, .LBB0_7 | 1 | 1 str xzr, [x13,#56] | 1 | 1 orr x16, x16, x3 | 1 | 2 --------------------------------------------------------------------- Total: | 8 | 14
In case above non-merged variant is better.
Is it valid approach?
llvm/lib/Transforms/Utils/SimplifyCFG.cpp | ||
---|---|---|
2421–2424 | Yes, agree, using this comparison is a hand-wave as it is :) if (Cost > PHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic) { So we need a separate threshold here. I will change it today |
I have added check for additional instructions which will be generated after select lowering (In lambda IsWorthwile in mergeConditionalStoreToAddress)
Update test.
I think it could be useful, if I add my small runtime benchmark here:
According to this benchmark patched version better than original for about 2 times
- Compiled with flags: clang -target aarch64-linux-android -O2
- Device with CortexA76
To = execution time of original version (average for 1000 iterations) Tp = execution time of patched version (average for 1000 iterations) To / Tp Match all branches. 2.11 Match no branches. 2.12 Match IN branch only. 2.12 Match OUT branch only. 2.11 Match EX branch only. 2.07 Match IN & OUT branches. 2.13 Match IN & EX branches. 2.15 Match OUT & EX branches. 2.08
I'm sorry.
I'm sure the problem this is trying to address is real, but i don't have any suggestions how it should be fixed.
I somewhat think that the current branch->select logic is still too weak, and this problem should be fixed in backend.
This makes no sense; integer logic should be cheap.