This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Coalesce Copy Zero during instruction selection
ClosedPublic

Authored by haicheng on Jul 31 2017, 12:31 PM.

Details

Summary

Now, move constant zero is lowered into two MIRs after instruction selection

v1 = copy wzr/xzr
v2 = copy v1

These two copies are coalesced in a later pass.

One problem of this is in Machine-Sink pass which runs before the copy propogation pass. Machine-sink can break a critical edge if at least two cheap MIRs can be sinked to that path. Thus, we may have a MBB which has only one mov wzr/xzr instruction. This can make block placement difficult to do the layout. For example, the test case below, copy-zero-reg.ll, has a loop unrolled by two. Sinking the mov wzr/xzr makes it impossible to find a fallthrough for every MBB and the currently generated code has a block looks like this

// BB#1:
	mov	 w9, wzr
	cbnz	w8, .LBB0_5
	b	.LBB0_6

This patch coalesce two COPYs during instruction selection. Below is the performance impacted by this patch

spec2000/vpr+1.4%
spec2006/libquantum+4.9%
spec2006/perlbench+1.2%
spec2017/blender+3.5%
spec2017/deepsjeng-1.1%

Diff Detail

Repository
rL LLVM

Event Timeline

haicheng created this revision.Jul 31 2017, 12:31 PM
haicheng edited the summary of this revision. (Show Details)Jul 31 2017, 12:31 PM
haicheng edited the summary of this revision. (Show Details)
haicheng edited the summary of this revision. (Show Details)Jul 31 2017, 4:12 PM
This revision was automatically updated to reflect the committed changes.
haicheng reopened this revision.Aug 1 2017, 5:43 PM

Sorry. This patch is not committed. I made a mistake when committing another patch D36174.

haicheng updated this revision to Diff 109257.Aug 1 2017, 5:50 PM

Upload the correct patch. Please take a look. Sorry for the confusion.

gberry added inline comments.Aug 1 2017, 6:18 PM
lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2768

Typo: 'conatants' -> 'constants'

test/CodeGen/AArch64/arm64-addr-type-promotion.ll
31

Is this a regression?

haicheng added inline comments.Aug 1 2017, 7:31 PM
test/CodeGen/AArch64/arm64-addr-type-promotion.ll
31

This corresponds to block if.end25. Do you think it is better to create a new block to sink mov wzr here? I think neither sinking or not is a clear win since it depends on which path the control flow takes. Also, the cost of mov wzr can be as cheap as zero. So, I am bias to keep the current CFG.

gberry added inline comments.Aug 2 2017, 1:15 PM
lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2767

Would it not make sense to replace any use of constant 0 with wzr/xzr when it is legal?

test/CodeGen/AArch64/arm64-addr-type-promotion.ll
31

That seems fine, it just wasn't clear from the CHECK diffs if this was a new 'mov'

haicheng added inline comments.Aug 2 2017, 1:43 PM
lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2767

I agree with you. I started with replacing any constant 0 with wzr/xzr, but triggered a lot of assertions. For example, wzr/xzr is not expected to appear as the condition of a conditional branch. Then, I narrowed down to CopyToReg only, but still triggered some assertions when copying wzr/xzr to another physical register. Now, I narrow down to the most common situation, copying constant zero to a virtual reg, no assertion is trigged and performance looks good.

haicheng added inline comments.Aug 9 2017, 12:18 PM
lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2767

And I check if all the uses are copies to virtual regs and call RAUW to replace because SDUse::set() is a private method. Do you think it is worthwhile to change the API to be public? It does not make big difference to the performance since most of the time constant 0 has only one use.

haicheng updated this revision to Diff 110454.Aug 9 2017, 12:30 PM

Fix a typo. Please take a look. Thank you.

gberry added inline comments.Aug 9 2017, 2:10 PM
lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2767

Yeah, that's why I said "when it is legal" :)
It made be too much of a pain to determine legality here, but what about handling cases where only some of the uses are virt reg copies (and just using xzr/wzr directly in those cases)?

2767

No, I'm not suggesting you change SDUse (that is private for a reason). If you want to do this experiment you would need to replace the users themselves with new nodes that read XZR/WZR directly.

I think either approach is okay here, but you should probably wait for someone else to approve it.

haicheng added inline comments.Aug 11 2017, 7:21 PM
lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2767

I tried and it seems I cannot replace the users themselves with new nodes that read XZR/WZR directly. The new node I can create here is CopyFromReg and its type is MVT::i32/i64. The user is CopyToReg and its type is MVT::Other. The types do not match and I cannot do the replacement. Another potential issues is that the users can be used in the other MBB because it is lowered from a PHINode, but use_iterator seems not include this usage.

I think I would stick to the current approach which just replace #0 with XZR/WZR because it is simple and conservative but covers the most common situations. If I miss any coalescing opportunities, they would be coalesced anyway in the later pass.

gberry added inline comments.Aug 14 2017, 11:38 AM
lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2767

I'm not sure I fully understand the problem you're running into trying to do this forward substitution on only some uses. If you wanted to try that you would leave this node alone and instead replace the use nodes that are vreg COPYs that are reading the COPY of XZR/WZR.

gberry edited edge metadata.Aug 17 2017, 10:38 AM

Something like the below is what I meant by transforming the uses instead. The FIXME comment needs to be addressed, but I think it should do what you want and catch the cases where not all uses are vreg COPYs:

In AArch64DAGToDAGISel::Select, instead of chaning the Constant case, add the following:

case ISD::CopyToReg: {
  // Special case for copy of zero to avoid a double copy.
  // FIXME: check dst is virt reg and regclass is okay
  SDNode *CopyVal = Node->getOperand(2).getNode();
  if (ConstantSDNode *CopyValConst = dyn_cast<ConstantSDNode>(CopyVal))
    if (CopyValConst->isNullValue()) {
      unsigned ZeroReg;
      EVT ZeroVT = CopyValConst->getValueType(0);
      if (ZeroVT == MVT::i32)
        ZeroReg = AArch64::WZR;
      else if (ZeroVT == MVT::i64)
        ZeroReg = AArch64::XZR;
      else
        break;

      SDValue ZeroRegVal = CurDAG->getRegister(ZeroReg, ZeroVT);
      SDValue New = CurDAG->getNode(ISD::CopyToReg, SDLoc(Node), MVT::Other,
                                    Node->getOperand(0), Node->getOperand(1),
                                    ZeroRegVal);
      ReplaceNode(Node, New.getNode());
      return;
    }
  break;
}
haicheng updated this revision to Diff 115633.Sep 18 2017, 6:34 AM

Use Geoff's code. Please take a look.

haicheng updated this revision to Diff 115884.Sep 19 2017, 12:24 PM

Bail out if the copy is used with glue. Please take a look.

Kindly Ping (#2)

gberry added a comment.Oct 4 2017, 8:08 AM

Someone else should probably approve this since I wrote some of the code.

lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2794

Style nit: you can reduce the level of indentation below by inverting these if tests and doing an early break after each one.

2800

This could use a comment explaining why it is being rejected.

test/CodeGen/AArch64/copy-zero-reg.ll
4

This description seems a little vague. Can you spell out which block you don't want the mov to appear in? Also, it seems like you might want a negative CHECK-NOT below to make sure there isn't another mov?

haicheng updated this revision to Diff 118145.Oct 7 2017, 6:03 PM
haicheng marked 2 inline comments as done.

Add the support of "glue".

haicheng marked an inline comment as not done.Oct 9 2017, 11:17 AM
haicheng marked an inline comment as done.Oct 20 2017, 5:49 AM

This looks pretty good to me (with one minor comment), but someone else should probably approve it.

lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2815

I think you can get rid of this if/else by re-writing as:

SDValue New = CurDAG->getNode(ISD::CopyToReg, SDLoc(Node), Node->getVTList(), makeArrayRef(Ops, NumOperands));
haicheng updated this revision to Diff 120601.Oct 27 2017, 7:29 AM

Address Geoff's comments. Thank you.

Kindly Ping

Kindly Ping (#2)

rengolin accepted this revision.Nov 10 2017, 10:31 PM

Piggyback on @gberry's review, I can't see anything wrong with it, LGTM. Thanks!

This revision is now accepted and ready to land.Nov 10 2017, 10:31 PM
This revision was automatically updated to reflect the committed changes.
SirishP reopened this revision.EditedJun 20 2018, 1:43 PM
SirishP added a subscriber: SirishP.

I am planning to revert this change. This works with tests like test/CodeGen/AArch64/copy-zero-reg.ll. However, if there are multiple branches, this patch degrades in performance due to large number of mov instructions in each fall through. Here's an example of test case, where it degrades in performance.

target triple = "aarch64-none-linux-gnu"

%struct.foo = type {i8* }

; Function Attrs: nounwind
define void @func(i32* nocapture readonly %initval, i32* nocapture readonly %compptr, i16* nocapture readonly %coef_block) local_unnamed_addr {
entry:
  br label %for.body

for.body:                                         ; preds = %for.inc, %entry
  %inptr.0106 = phi i16* [ %coef_block, %entry ], [ %inptr.1, %for.inc ]
  %ctr.0105 = phi i32 [ 8, %entry ], [ %dec, %for.inc ]
  %qptr.0104 = phi i32* [ %compptr, %entry ], [ %qptr.1, %for.inc ]
  %wsptr.0103 = phi i32* [ %initval, %entry ], [ %wsptr.1, %for.inc ]
  %arrayidx = getelementptr inbounds i16, i16* %inptr.0106, i64 8
  %0 = load i16, i16* %arrayidx, align 2
  %arrayidx3 = getelementptr inbounds i16, i16* %inptr.0106, i64 16
  %1 = load i16, i16* %arrayidx3, align 2
  %2 = or i16 %1, %0
  %3 = icmp eq i16 %2, 0
  br i1 %3, label %land.lhs.true7, label %if.end

land.lhs.true7:                                   ; preds = %for.body
  %arrayidx8 = getelementptr inbounds i16, i16* %inptr.0106, i64 24
  %true7 = load i16, i16* %arrayidx8, align 2
  %cmp10 = icmp eq i16 %true7, 0
  br i1 %cmp10, label %land.lhs.true12, label %if.end

land.lhs.true12:                                  ; preds = %land.lhs.true7
  %arrayidx13 = getelementptr inbounds i16, i16* %inptr.0106, i64 32
  %true12 = load i16, i16* %arrayidx13, align 2
  %cmp15 = icmp eq i16 %true12, 0
  br i1 %cmp15, label %land.lhs.true17, label %if.end

land.lhs.true17:                                  ; preds = %land.lhs.true12
  %arrayidx18 = getelementptr inbounds i16, i16* %inptr.0106, i64 40
  %true17 = load i16, i16* %arrayidx18, align 2
  %cmp20 = icmp eq i16 %true17, 0
  br i1 %cmp20, label %land.lhs.true22, label %if.end

land.lhs.true22:                                  ; preds = %land.lhs.true17
  %arrayidx23 = getelementptr inbounds i16, i16* %inptr.0106, i64 48
  %true22 = load i16, i16* %arrayidx23, align 2
  %cmp25 = icmp eq i16 %true22, 0
  br i1 %cmp25, label %land.lhs.true27, label %if.end

land.lhs.true27:                                  ; preds = %land.lhs.true22
  %arrayidx28 = getelementptr inbounds i16, i16* %inptr.0106, i64 56
  %true27 = load i16, i16* %arrayidx28, align 2
  %cmp30 = icmp eq i16 %true27, 0
  br i1 %cmp30, label %if.then, label %if.end

if.then:                                          ; preds = %land.lhs.true27
  %4 = load i16, i16* %inptr.0106, align 2
  %conv33 = sext i16 %4 to i32
  %5 = load i32, i32* %qptr.0104, align 4
  %mul = shl nsw i32 %conv33, 2
  %shl = mul i32 %mul, %5
  store i32 %shl, i32* %wsptr.0103, align 4
  br label %for.inc

if.end:                                           ; preds = %land.lhs.true27, %land.lhs.true22, %land.lhs.true17, %land.lhs.true12, %land.lhs.true7, %for.body
  %6 = phi i16 [ 0, %land.lhs.true27 ], [ 0, %land.lhs.true22 ], [ 0, %land.lhs.true17 ], [ 0, %land.lhs.true12 ], [ 0, %land.lhs.true7 ], [ %1, %for.body ]
  %conv40 = sext i16 %6 to i32
  %arrayidx41 = getelementptr inbounds i32, i32* %qptr.0104, i64 16
  %7 = load i32, i32* %arrayidx41, align 4
  %mul42 = mul nsw i32 %7, %conv40
  store i32 %mul42, i32* %wsptr.0103, align 4
  br label %for.inc

for.inc:                                          ; preds = %if.end, %if.then
  %conv54.sink = phi i32 [ %mul42, %if.end ], [ %shl, %if.then ]
  %arrayidx55 = getelementptr inbounds i32, i32* %wsptr.0103, i64 56
  store i32 %conv54.sink, i32* %arrayidx55, align 4
  %wsptr.1 = getelementptr inbounds i32, i32* %wsptr.0103, i64 1
  %qptr.1 = getelementptr inbounds i32, i32* %qptr.0104, i64 1
  %inptr.1 = getelementptr inbounds i16, i16* %inptr.0106, i64 1
  %dec = add nsw i32 %ctr.0105, -1
  %cmp = icmp ugt i32 %ctr.0105, 1
  br i1 %cmp, label %for.body, label %for.end

for.end:                                          ; preds = %for.inc
  ret void
}

This is slightly smaller version of libjpeg code.

The intention of this patch is to remove a BB that has one mov instruction. In order to do that, this patch pessmizes MachineSinking by introducing a copy, such that mov instruction is NOT moved to the BB. Optimization downstream gets rid of the BB with only mov instruction. This works well if we have only one fall through branch as there is only one "extra" mov instruction, like in copy-reg-zero.ll

If we have multiple fall throughs like in above test case, we will have a lot of redundant movs. In such a case, it's better to have this BB which has one mov instruction.

This is causing degradation in jpeg, fft and other codebases. I believe if we want to remove a BB with only one mov instruction, we should not pessimize Machine Sinking at all, and find some other solution.

This revision is now accepted and ready to land.Jun 20 2018, 1:43 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 4:46 AM
Herald added a subscriber: hiraditya. · View Herald Transcript