This is an archive of the discontinued LLVM Phabricator instance.

[PPC] Better codegen for AND, ANY_EXT, SRL sequence
ClosedPublic

Authored by amehsan on Sep 26 2016, 9:35 AM.

Download Raw Diff

Details

Reviewers

kbarton
hfinkel
nemanjai

Summary

This fixes the first issue exposed by PR30483. (See comment 1 in the IR).

First I tried to fix a more general problem in dag combine. I wanted to address all code sequences of shifts and ands, with an extension in the middle. My solution was to bring the extension insn to the beginning of the sequence so the sequence of shifts and ands can be merged together during isel. That approach did not work because this particular sequence is created in target independent dag combine in DAGCombiner::visitSRL.

So alternatively I decided to address this particular sequence in isel. If we encounter similar issues, we can think of a more general solution.

Diff Detail

Event Timeline

amehsan updated this revision to Diff 72504.Sep 26 2016, 9:35 AM

amehsan retitled this revision from to [PPC] Better codegen for AND, ANY_EXT, SRL sequence.

amehsan updated this object.

amehsan added reviewers: hfinkel, kbarton, nemanjai.

amehsan added subscribers: llvm-commits, Carrot, echristo.

Herald added a subscriber: nemanjai. · View Herald TranscriptSep 26 2016, 9:35 AM

amehsan added inline comments.Sep 26 2016, 9:36 AM

test/CodeGen/PowerPC/anyext_srl.ll
24	I should probably remove this line :)

amehsan added inline comments.Sep 26 2016, 10:15 AM

lib/Target/PowerPC/PPCISelDAGToDAG.cpp
2639	I am not sure if this is always legal. Will check that.

Alternatively, we might teach the BitPermutationSelector to look through extends, which would be more general. Had you looked at that?

lib/Target/PowerPC/PPCISelDAGToDAG.cpp
2647	We shouldn't speculatively create new nodes if we can avoid it.
test/CodeGen/PowerPC/anyext_srl.ll
1	You need an architecture or a triple here too.
24	Yes, although if you keep the cpu attribute here, you don't need it on the command line.

amehsan added inline comments.Sep 26 2016, 10:46 AM

lib/Target/PowerPC/PPCISelDAGToDAG.cpp
2647	Sorry, I am not sure I understand this comment. What is speculative here? I have a i32 and want to convert it to i64. I tried a couple of different options and this sequence was the only one that worked. This appeared in a small kernel that I wrote and included a similar conversion.

hfinkel added inline comments.Sep 26 2016, 10:54 AM

lib/Target/PowerPC/PPCISelDAGToDAG.cpp
2647	No, I mean that you're calling getMachineNode here to generate new SDAG nodes; are you sure that when you do this one of the conditions below will match and these will never just end up being garbage collected?

amehsan added inline comments.Sep 26 2016, 11:00 AM

lib/Target/PowerPC/PPCISelDAGToDAG.cpp
2647	The conditions below and the ones that reaches this line of code are mutually exclusive. Note that this basic block is ended in line 2650. The next condition will not be satisfied (because we have proved that Val.getOpcode() == ISD::ANY_EXTEND and in the next condition they want Val.getOpcode() == ISD::SRL) and we go straight to the line 2665 and generate code.

In D24924#552610, @hfinkel wrote:

Alternatively, we might teach the BitPermutationSelector to look through extends, which would be more general. Had you looked at that?

There is potentially a more general solution as well. I can call it "bubbling up" any_ext and zero_ext. So for example in the following selection DAG

Optimized type-legalized selection DAG: BB#0 '_Z3fooRK3PB2S1_:entry'
SelectionDAG has 17 nodes:
  t0: ch = EntryToken
              t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
            t37: i32,ch = load<LD1[%arrayidx.i6](align=8), anyext from i8> t0, t2, undef:i64
              t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
            t42: i32,ch = load<LD1[%arrayidx.i37](align=8), anyext from i8> t0, t4, undef:i64
          t54: i32 = xor t37, t42
        t55: i32 = srl t54, Constant:i64<3>
      t56: i64 = any_extend t55
    t51: i64 = and t56, Constant:i64<1>
  t20: ch,glue = CopyToReg t0, Register:i64 %X3, t51
  t21: ch = PPCISD::RET_FLAG t20, Register:i64 %X3, t20:1

We can first push any_ext before srl. (This is what I did in my first attempt). But we don't need to stop here. We can then push it before xor and apply (and merge it) to load instructions that feed xor. Given that zero extension is free for PPC loads this is easier than applying the similar idea to sign_extend.

I also played a little bit with BitPermutationSelector. I added the following code to BitPermutationSelector::getValueBits

+    case ISD::ANY_EXTEND: {
+        auto Size = V.getOperand(0).getNode()->getValueType(0).getSizeInBits();
+        DEBUG(dbgs() << "LOC B1\n" << NumBits << "\n" << Size <<"\n" ; );
+        const SmallVector<ValueBit, 64> *InnerBits;
+        std::tie(Interesting, InnerBits) = getValueBits(V.getOperand(0), Size);
+        for (unsigned i = 0; i < Size; ++i)
+          Bits[i] = (*InnerBits)[i];
+        for (unsigned i = Size; i < NumBits; ++i) {
+          Bits[i] = ValueBit(ValueBit::ConstZero);
+        }
+        return std::make_pair(Interesting, &Bits);
+      }
+      break;

This works, but the codegen currently does not generate correct code for i32 to i64 conversions. So more work is needed. As we discussed, this pattern might be special given it is generated in the target independent codegen. So a more general solution may not be needed. So I will add a note to our readme files, with a pointer to this comment, so in future if we need better handling of this opcodes, we follow one of these ideas.

amehsan updated this revision to Diff 72859.Sep 28 2016, 10:36 AM

amehsan edited edge metadata.

LGTM

This revision is now accepted and ready to land.Oct 20 2016, 1:09 PM

Commited 284983

Revision Contents

Path

Size

lib/

Target/

PowerPC/

PPCISelDAGToDAG.cpp

13 lines

test/

CodeGen/

PowerPC/

anyext_srl.ll

29 lines

Diff 72504

lib/Target/PowerPC/PPCISelDAGToDAG.cpp

Show First 20 Lines • Show All 2,630 Lines • ▼ Show 20 Lines	case ISD::AND: {
}		}
// If this is a 64-bit zero-extension mask, emit rldicl.		// If this is a 64-bit zero-extension mask, emit rldicl.
if (isInt64Immediate(N->getOperand(1).getNode(), Imm64) &&		if (isInt64Immediate(N->getOperand(1).getNode(), Imm64) &&
isMask_64(Imm64)) {		isMask_64(Imm64)) {
SDValue Val = N->getOperand(0);		SDValue Val = N->getOperand(0);
MB = 64 - countTrailingOnes(Imm64);		MB = 64 - countTrailingOnes(Imm64);
SH = 0;		SH = 0;

		auto Op0 = Val.getOperand(0);
		amehsanAuthorUnsubmitted Not Done Reply Inline Actions I am not sure if this is always legal. Will check that. amehsan: I am not sure if this is always legal. Will check that.
		if (Val.getOpcode() == ISD::ANY_EXTEND && Op0.getOpcode() == ISD::SRL &&
		isInt32Immediate(Op0.getOperand(1).getNode(), Imm) && Imm <= MB) {

		auto ResultType = Val.getNode()->getValueType(0);
		auto ImDef = CurDAG->getMachineNode(PPC::IMPLICIT_DEF, dl, ResultType);
		SDValue IDVal (ImDef, 0);

		Val = SDValue(CurDAG->getMachineNode(PPC::INSERT_SUBREG, dl, ResultType,
		hfinkelUnsubmitted Not Done Reply Inline Actions We shouldn't speculatively create new nodes if we can avoid it. hfinkel: We shouldn't speculatively create new nodes if we can avoid it.
		amehsanAuthorUnsubmitted Not Done Reply Inline Actions Sorry, I am not sure I understand this comment. What is speculative here? I have a i32 and want to convert it to i64. I tried a couple of different options and this sequence was the only one that worked. This appeared in a small kernel that I wrote and included a similar conversion. amehsan: Sorry, I am not sure I understand this comment. What is speculative here? I have a i32 and want…
		hfinkelUnsubmitted Not Done Reply Inline Actions No, I mean that you're calling getMachineNode here to generate new SDAG nodes; are you sure that when you do this one of the conditions below will match and these will never just end up being garbage collected? hfinkel: No, I mean that you're calling getMachineNode here to generate new SDAG nodes; are you sure…
		amehsanAuthorUnsubmitted Not Done Reply Inline Actions The conditions below and the ones that reaches this line of code are mutually exclusive. Note that this basic block is ended in line 2650. The next condition will not be satisfied (because we have proved that Val.getOpcode() == ISD::ANY_EXTEND and in the next condition they want Val.getOpcode() == ISD::SRL) and we go straight to the line 2665 and generate code. amehsan: The conditions below and the ones that reaches this line of code are mutually exclusive. Note…
		IDVal, Op0.getOperand(0), getI32Imm(1, dl)), 0);
		SH = 64 - Imm;
		}

// If the operand is a logical right shift, we can fold it into this		// If the operand is a logical right shift, we can fold it into this
// instruction: rldicl(rldicl(x, 64-n, n), 0, mb) -> rldicl(x, 64-n, mb)		// instruction: rldicl(rldicl(x, 64-n, n), 0, mb) -> rldicl(x, 64-n, mb)
// for n <= mb. The right shift is really a left rotate followed by a		// for n <= mb. The right shift is really a left rotate followed by a
// mask, and this mask is a more-restrictive sub-mask of the mask implied		// mask, and this mask is a more-restrictive sub-mask of the mask implied
// by the shift.		// by the shift.
if (Val.getOpcode() == ISD::SRL &&		if (Val.getOpcode() == ISD::SRL &&
isInt32Immediate(Val.getOperand(1).getNode(), Imm) && Imm <= MB) {		isInt32Immediate(Val.getOperand(1).getNode(), Imm) && Imm <= MB) {
assert(Imm < 64 && "Illegal shift amount");		assert(Imm < 64 && "Illegal shift amount");
▲ Show 20 Lines • Show All 1,843 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/anyext_srl.ll

This file was added.

				; RUN: llc -verify-machineinstrs -mcpu=pwr8 < %s \| FileCheck %s
				hfinkelUnsubmitted Not Done Reply Inline Actions You need an architecture or a triple here too. hfinkel: You need an architecture or a triple here too.

				%class.PB2 = type { [1 x i32], %class.PB1* }
				%class.PB1 = type { [1 x i32], i64, i64, i32 }

				; Function Attrs: norecurse nounwind readonly
				define zeroext i1 @foo(%class.PB2* nocapture readonly dereferenceable(16) %s_a, %class.PB2* nocapture readonly dereferenceable(16) %s_b) local_unnamed_addr #0 {
				entry:
				%arrayidx.i6 = bitcast %class.PB2* %s_a to i32*
				%0 = load i32, i32* %arrayidx.i6, align 8, !tbaa !1
				%and.i = and i32 %0, 8
				%cmp.i = icmp ne i32 %and.i, 0
				%arrayidx.i37 = bitcast %class.PB2* %s_b to i32*
				%1 = load i32, i32* %arrayidx.i37, align 8, !tbaa !1
				%and.i4 = and i32 %1, 8
				%cmp.i5 = icmp ne i32 %and.i4, 0
				%cmp = xor i1 %cmp.i, %cmp.i5
				ret i1 %cmp
				; CHECK-LABEL: @foo
				; CHECK: rldicl {{[0-9]+}}, {{[0-9]+}}, 61, 63

				}

				!0 = !{!"clang version 4.0.0 (http://llvm.org/git/clang.git 7981b20f318488a10e7c0c8e0f0ca502e02e74cd) (http://llvm.org/git/llvm.git 3b621275428532a32a2806585282fa025af2d241)"}
				amehsanAuthorUnsubmitted Not Done Reply Inline Actions I should probably remove this line :) amehsan: I should probably remove this line :)
				hfinkelUnsubmitted Not Done Reply Inline Actions Yes, although if you keep the cpu attribute here, you don't need it on the command line. hfinkel: Yes, although if you keep the cpu attribute here, you don't need it on the command line.
				!1 = !{!2, !2, i64 0}
				!2 = !{!"int", !3, i64 0}
				!3 = !{!"omnipotent char", !4, i64 0}
				!4 = !{!"Simple C++ TBAA"}