This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SimplifyCFG] Add additional cost for instructions in mergeConditionalStoreToAddress
AbandonedPublic

Authored by kpdev42 on Sep 6 2019, 8:37 AM.

Details

Summary

There is a regression in AArch64 after commit https://reviews.llvm.org/rL317368.
Some blocks are considered worthwile to merge conditional stores, but these merges lead to unoptimal code (bug https://bugs.llvm.org/show_bug.cgi?id=43205 PR#43205 )
So it is needed to add some additional cost to instructions, if these instructions cannot be represented by 1 conditional instructions.
For example:
add nsw i32 %tmp 1 can be represented as cset
or will be split in csel and orr

Diff Detail

Event Timeline

kpdev42 created this revision.Sep 6 2019, 8:37 AM

Maybe the reason of this regression is not here and I worked in wrong direction. If so - I will really appreciate any clarifications about it

I'm trying to understand the issue you're seeing... I guess it comes down to something like the following?

Before:

        tst     x18, x1
        b.eq    .LBB0_10
.LBB0_9:
        orr     x16, x16, x18
        add     w0, w0, #1
        str     xzr, [x13, #56]
.LBB0_10:
        cbz     x11, .LBB0_7

After:

and     x2, x18, x1
orr     x3, x16, x18
tst     x18, x1
orr     x2, x2, x11
cinc    w0, w0, ne
csel    x16, x16, x3, eq
cbz     x2, .LBB0_7

I agree, five instructions is probably too many to speculate to eliminate a store. But the patch doesn't really reflect the actual costs here, which largely have to do with the PHI->select transform rather than the actual arithmetic instructions.

Here's the relevant IR after the store merging transform:

if.then:                                          ; preds = %for.body12
  %or = or i64 %res_in7.046, %bit.044
  %inc = add nsw i32 %retval1.243, 1
  br label %if.end

if.end:                                           ; preds = %if.then, %for.body12
  %retval1.3 = phi i32 [ %inc, %if.then ], [ %retval1.243, %for.body12 ]
  %res_in7.1 = phi i64 [ %or, %if.then ], [ %res_in7.046, %for.body12 ]
  br i1 %tobool14, label %for.inc, label %if.then15

if.then15:                                        ; preds = %if.end
  br label %for.inc

for.inc:                                          ; preds = %if.then15, %if.end
  %simplifycfg.merge = phi i8* [ null, %if.then15 ], [ null, %if.end ]
  %4 = xor i1 %tobool, true
  %5 = xor i1 %tobool14, true
  %6 = or i1 %4, %5
  br i1 %6, label %7, label %8

7:                                                ; preds = %for.inc
  store i8* %simplifycfg.merge, i8** %proc, align 8
  br label %8

8:                                                ; preds = %for.inc, %7
  %inc18 = add nuw nsw i32 %j.047, 1
  %shl = shl i64 %bit.044, 1
  %exitcond = icmp eq i32 %inc18, 64
  br i1 %exitcond, label %for.cond.cleanup11, label %for.body12

Some of the branches collapse; at this point, we've basically traded a store for a branch which is likely predictable, assuming we turn the "or i1" back into a branch. That's maybe okay, I guess? But then we decide "if.then" is small enough to "predicate" it at the IR level, and we never reverse that decision when it turns out that doesn't simplify anything.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
179

This makes no sense; integer logic should be cheap.

llvm/test/Transforms/SimplifyCFG/AArch64/check-instr-cost-for-folding.ll
2

This testcase is way too complicated given the change you're making.

Thank you for detailed answer.

But then we decide "if.then" is small enough to "predicate" it at the IR level, and we never reverse that decision when it turns out that doesn't simplify anything.

Yes, I totally agree with you.

I agree, five instructions is probably too many to speculate to eliminate a store. But the patch doesn't really reflect the actual costs here, which largely have to do with the PHI->select transform rather than the actual arithmetic instructions.

Two phi nodes are replaced with select instructions in static bool FoldTwoEntryPHINode (which is invoked after mergeConditionalStoreToAddress)

while (PHINode *PN = dyn_cast<PHINode>(BB->begin())) {
  // Change the PHI node into a select instruction.
  Value *TrueVal = PN->getIncomingValue(PN->getIncomingBlock(0) == IfFalse);
  Value *FalseVal = PN->getIncomingValue(PN->getIncomingBlock(0) == IfTrue);

  Value *Sel = Builder.CreateSelect(IfCond, TrueVal, FalseVal, "", InsertPt);
  PN->replaceAllUsesWith(Sel);
  Sel->takeName(PN);
  PN->eraseFromParent();
}

So it is needed to add some additional analysis here? I mean - FoldTwoEntryPHINode should make some checks about expediency of this folding?

This needs a testcase (check-instr-cost-for-folding.ll isn't it.)
Can you see if D67315 already solves whatever issue you are seeing?

I checked D67315 , unfortunately it does not help, generated code is the same.

kpdev42 updated this revision to Diff 219480.EditedSep 10 2019, 12:03 AM
kpdev42 edited reviewers, added: lebedev.ri; removed: craig.topper, sanjoy.

Simplify test case. Relevant C code is:

typedef struct {
  int * dummy;
  void *proc;
} ptr_wrapper;

ptr_wrapper * fds;

int do_select(const unsigned long in, 
              unsigned long bit,
              const unsigned long mask)
{
  int retval = 0;
  unsigned long res_in = 0;

  for( ; bit != 0; bit <<= 1)
  {
  	if (in & bit) {
  		res_in |= bit;
  		retval++;
  		fds->proc = NULL;
  	}
  	
  	if (mask & 0x1) {
  		fds->proc = NULL;
  	}
  }

  return retval + res_in;
}
kpdev42 updated this revision to Diff 219564.Sep 10 2019, 10:05 AM

Rework patch. Allow merge store but bring additional check to FoldTwoEntryPHINode: before hoisting all instructions from IfBlock0 and IfBlock1 to DomBlock we check number of instructions in this DomBlock.

Please upload patches with full context (-U99999)
Patches should be as compared to the trunk, not previous patch.

lebedev.ri requested changes to this revision.Sep 11 2019, 3:57 AM
This revision now requires changes to proceed.Sep 11 2019, 3:57 AM

Yes, sure, will upload it in 24 hours, thank you!

kpdev42 abandoned this revision.Sep 12 2019, 6:55 AM

I need to change my patch entirely, it needs discussion in mailing list and it will take time, so at the moment I will close this revision and after some time create another one.
Thanks to all.

I really wonder if the SimplifyCFG behavior is correct and you want to instead be tuning the undo transform in backend..

Yes, I think it is possible, that this code should be changed later in backend. It needs more investigation )

kpdev42 reclaimed this revision.Nov 20 2019, 4:51 AM
This revision now requires changes to proceed.Nov 20 2019, 4:51 AM
kpdev42 updated this revision to Diff 230235.Nov 20 2019, 5:16 AM

I changed my patch, now before merging blocks it calculates existing instructions in DomBB,
just to be sure that this block is not too big already
Maybe instead of using phi-node-folding threshold it is needed to use some other separate threshold.

And also it is still not clear, if we need to optimize this in a middle end or in a backend.
Maybe backend will be more proper place for it. @dmgreen please feel free to bring a light to this problem

P.S.: In this change I also simplified a testcase


Recall, what exactly I trying to eliminate:
clang merges too big basic blocks and it leads to performance regression

Before merging:

        tst     x18, x1
        b.eq    .LBB0_10
.LBB0_9:
        orr     x16, x16, x18
        add     w0, w0, #1
        str     xzr, [x13, #56]
.LBB0_10:
        cbz     x11, .LBB0_7

After merging

and     x2, x18, x1
orr     x3, x16, x18
tst     x18, x1
orr     x2, x2, x11
cinc    w0, w0, ne
csel    x16, x16, x3, eq
cbz     x2, .LBB0_7

Similar picture in test case:
Before merging:

for.body.lr.ph:                                   ; preds = %entry
  %0 = load %struct.ptr_wrapper*, %struct.ptr_wrapper** @g_wrapper, align 8
  %proc = getelementptr inbounds %struct.ptr_wrapper, %struct.ptr_wrapper* %0, i64 0, i32 1
  %and2 = and i64 %mask, 1
  %tobool3 = icmp eq i64 %and2, 0
  %and = and i64 %bit, %in
  %tobool = icmp eq i64 %and, 0
  br i1 %tobool, label %if.end, label %if.then

if.then:                                          ; preds = %for.body.lr.ph
  %or = or i64 0, %bit
  %inc = add nsw i32 0, 1
  br label %if.end

if.end:                                           ; preds = %if.then, %for.body.lr.ph
  %retval1.1 = phi i32 [ %inc, %if.then ], [ 0, %for.body.lr.ph ]
  %res_in.1 = phi i64 [ %or, %if.then ], [ 0, %for.body.lr.ph ]
  %1 = xor i1 %tobool, true
  %2 = xor i1 %tobool3, true
  %3 = or i1 %1, %2
  store i8* null, i8** %proc, align 8
  %shl = shl i64 %bit, 1
  %cmp = icmp eq i64 %shl, 0
  br label %for.end

After merging:

for.body.lr.ph:                                   ; preds = %entry
  %0 = load %struct.ptr_wrapper*, %struct.ptr_wrapper** @g_wrapper, align 8
  %proc = getelementptr inbounds %struct.ptr_wrapper, %struct.ptr_wrapper* %0, i64 0, i32 1
  %and2 = and i64 %mask, 1
  %tobool3 = icmp eq i64 %and2, 0
  %and = and i64 %bit, %in
  %tobool = icmp eq i64 %and, 0
  %or = or i64 0, %bit
  %inc = add nsw i32 0, 1
  %retval1.1 = select i1 %tobool, i32 0, i32 %inc
  %res_in.1 = select i1 %tobool, i64 0, i64 %or
  %1 = xor i1 %tobool, true
  %2 = xor i1 %tobool3, true
  %3 = or i1 %1, %2
  store i8* null, i8** %proc, align 8
  %shl = shl i64 %bit, 1
  %cmp = icmp eq i64 %shl, 0
  br label %for.end

lebedev.ri added inline comments.Nov 20 2019, 5:33 AM
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
2428

Not clang-formatted

llvm/test/Transforms/SimplifyCFG/AArch64/check-instr-cost-for-folding.ll
26–30

I think this testcase is overreduced, which makes it impossible to actually guess what the fix should be.
https://godbolt.org/z/sxmNMd

kpdev42 updated this revision to Diff 230616.Nov 22 2019, 2:27 AM

Fix formatting in SimplifyCFG.cpp and remaster testcase (I reduced it more accurately to avoid overreducing)
Please @lebedev.ri take a look

On X86 branch misprediction is ~10..~20 cycles,
Is it dirt cheap on AArch64? (any number?)
Perhaps we need to redefine the threshold in terms of branch misprediction cost?

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
2420–2423

So in other words we'd only perform the fold only if the preceding block is not larger than
what we'd add via folding. In other words if we are okay flatteing 4-instruction 2-entry PHI,
the dominating BB must contain less than 4 instructions.
That seems awfully hand-wavy to me, i'm afraid :(
It will make the fold not happen in all the cases i'm aware of.

On X86 branch misprediction is ~10..~20 cycles,

Very interesting, I didn't measure cycles of misprediction. Only time of execution.

Is it dirt cheap on AArch64? (any number?)

Excuse me, I didn't get the question :) You asked about number of branch misprediction cycles?

Perhaps we need to redefine the threshold in terms of branch misprediction cost?

It is actually sounds very promising. But at the moment I do not know where to get this cost :) Is it supposed to be in processor description (e.g. for Cortex-75 -https://developer.arm.com/docs/100403/0301 ) ?

And yet another thought: maybe we will just compare execution latency / throughput for merged and non-merged variants and will choose the variant with the smallest total value?

For example (all data about latency/throughput is taken from https://developer.arm.com/docs/101398/0200/arm-cortex-a75-software-optimization-guide-v20 / code taken from https://bugs.llvm.org/show_bug.cgi?id=43205 ):

Non-merged variant
                            | Execution Latency | Execution Throughput
---------------------------------------------------------------------
tst     x18, x1             | 1                 | 2
b.eq    .LBB0_10            | 1                 | 1
.LBB0_9:                    |                   | 
orr     x16, x16, x18       | 1                 | 2
add     w0, w0, #1          | 1                 | 2
str     xzr, [x13, #56]     | 1                 | 1
.LBB0_10:                   |                   | 
cbz     x11, .LBB0_7        | 1                 | 1
str	xzr, [x13,#56]          | 1                 | 1
---------------------------------------------------------------------
Total:                      | 7                 | 10
Merged variant
                            | Execution Latency | Execution Throughput
---------------------------------------------------------------------
and	x3, x2, x1              | 1                 | 2
tst	x2, x1                  | 1                 | 2
orr	x5, x11, x3             | 1                 | 2
cinc	w0, w0, ne          | 1                 | 2
csel	x3, xzr, x2, eq     | 1                 | 2
cbz	x5, .LBB0_7             | 1                 | 1
str	xzr, [x13,#56]          | 1                 | 1
orr	x16, x16, x3            | 1                 | 2
---------------------------------------------------------------------
Total:                      | 8                 | 14

In case above non-merged variant is better.

Is it valid approach?

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
2420–2423

Yes, agree, using this comparison is a hand-wave as it is :)

if (Cost > PHINodeFoldingThreshold * TargetTransformInfo::TCC_Basic) {

So we need a separate threshold here. I will change it today

kpdev42 updated this revision to Diff 234845.EditedDec 20 2019, 2:53 AM

I have added check for additional instructions which will be generated after select lowering (In lambda IsWorthwile in mergeConditionalStoreToAddress)
Update test.

I think it could be useful, if I add my small runtime benchmark here:

According to this benchmark patched version better than original for about 2 times

  • Compiled with flags: clang -target aarch64-linux-android -O2
  • Device with CortexA76
To = execution time of original version (average for 1000 iterations)
Tp = execution time of patched version (average for 1000 iterations)

                          To / Tp 
Match all branches.       2.11
Match no branches.        2.12
Match IN branch only.     2.12
Match OUT branch only.    2.11
Match EX branch only.     2.07
Match IN & OUT branches.  2.13
Match IN & EX branches.   2.15
Match OUT & EX branches.  2.08
lebedev.ri resigned from this revision.Jan 25 2020, 3:24 AM

I'm sorry.
I'm sure the problem this is trying to address is real, but i don't have any suggestions how it should be fixed.
I somewhat think that the current branch->select logic is still too weak, and this problem should be fixed in backend.

kpdev42 marked 2 inline comments as done.Oct 25 2021, 1:54 AM
kpdev42 abandoned this revision.Dec 23 2022, 5:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 23 2022, 5:34 AM