Download Raw Diff

Details

Reviewers

dmgreen
efriedma
mingmingl

Commits

rG5cd900ce3c6e: [AArch64] Transform shift+and to shift+shift to select more shifted register

Summary

and (shl/srl/sra, x, c), mask --> shl (srl/sra, x, c1), c2

Diff Detail

Event Timeline

bcl5980 created this revision.Nov 29 2022, 2:30 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 29 2022, 2:30 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

bcl5980 requested review of this revision.Nov 29 2022, 2:30 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 29 2022, 2:30 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

I will check in the precommit tests if the reviewers look good for the tests.

Harbormaster completed remote builds in B199961: Diff 478490.Nov 29 2022, 3:21 AM

dmgreen added a reviewer: mingmingl.Nov 29 2022, 7:59 AM

dmgreen added inline comments.

llvm/test/CodeGen/AArch64/shiftregister-from-and.ll
115	This instruction looks wrong, with the shift being more than 32 for w regs.

mingmingl added inline comments.Nov 29 2022, 5:37 PM

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
632	nit: use `uint64_t` (the same type as the return value of `getZExtValue()` to avoid warnings.
647–671	nit pick: To ensure that UBFM/SBFM is used as a right shift (essentially UBFX) here, add `assert(BitWidth-1>=NewShiftC && "error message")`
654–655	Relatedly, with 1) N as `and (srl x, c), mask` and 2) LowZBits == 0, `N` should be selected as UBFX by isBitfieldExtractOp before tablegen pattern (where `AArch64DAGToDAGISel::SelectShiftedRegister` gets called) applies. Similarly, with `and(shl x, c), mask` and LowZBits == 0, `N` is a UBFIZ. So I think this shouldn't happen be seen in practice; but an early exit (not assertion) looks okay.
657–658	For correctness, similar check is needed when LHSOpcode is ISD::SRL; meanwhile for `srl`, the condition `BitWidth == LowZBits + MaskLen + ShiftAmtC` should be sufficient For example, https://gcc.godbolt.org/z/YvGG3Pov3 shows an IR at trunk; and with the current patch the optimized code changes the result (not correct). lsr x8, x0, #3 add x0, x1, x8, lsl #1 ret
llvm/test/CodeGen/AArch64/shiftregister-from-and.ll
138–140	nit pick: i16 will be legalized to i32 [1], and the patch optimizes `shiftedreg_from_and_negative_type` (trunk https://gcc.godbolt.org/z/cnac5989e). This doesn't seem to be a negative test . Probably use a vector like https://gcc.godbolt.org/z/T4szTsEav [1] `llc -O3 -mtriple=aarch64 -debug -print-after-all test.ll` gives debug logs, and https://gist.github.com/minglotus/763173f099bec76c720ec9ce7c181471 shows the legalization step.

address comments.

test comments update.

bcl5980 marked 5 inline comments as done.Nov 29 2022, 7:49 PM

bcl5980 added inline comments.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
657–658	`BitWidth <= LowZBits + MaskLen + ShiftAmtC` should be better.
llvm/test/CodeGen/AArch64/shiftregister-from-and.ll
115	fixed the code and rename the test to @shiftedreg_from_and_negative_andc5

remove redundant condition

Harbormaster completed remote builds in B200165: Diff 478790.Nov 29 2022, 8:38 PM

mingmingl added inline comments.Nov 30 2022, 10:15 AM

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
690–695	Selecting `and (shl/srl/sra, x, c), mask` into `shl (srl/sra, x, c1), c2` (and folding shl) is simplifying one instruction away but could change the pipeline (latency and throughput as well) of the instruciton. For example, arithmetic operations with shifted operand could use M pipeline [1] on neoverse n1, and M pipeline is used for all ALU operations with imm-shifted operand for cortex a57. I wonder if improvements are seen in some benchmarks from this patch? [1] for neoverse n1, M pipeline is used for {ADDS, SUBS} with "Arithmetic, LSR/ASR/ROR shift or LSL shift > 4".

bcl5980 added inline comments.Nov 30 2022, 1:19 PM

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
690–695	I agree that it may not get any timing improvement except lsl < 4. I also don't believe this patch can get improvement in any real benchmarks. Actually the current shifted register select function also does not consider for the thing you mentioned. If you really worry about that I can limit the lsl constant < 4.But personally I don't want to do that.

bcl5980 added inline comments.Nov 30 2022, 1:56 PM

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
690–695	And for high-end CPU, it's really hard to model by just execute ports. 2 instructions will use 2 Rob entry, 2 physical registers, more front end decode, more code size. At least we should not consider these detail things when these pattern with the same latency and only some targets may have different throughput. If we really want to consider these detail micro architecture trade off, the better way is always do here and maybe split it in machine combiner that we already have detail schmodel.

Performance can depend on many factors, and often less instructions is a win in itself. Can you separate out and rebase on the tests to just show the differences?

bcl5980 mentioned this in rGe63f64bd14b5: [AArch64] Precommit test for D138904; NFC.Dec 1 2022, 6:59 PM

rebase code

Harbormaster completed remote builds in B200687: Diff 479503.Dec 2 2022, 12:17 AM

mingmingl added inline comments.Dec 4 2022, 10:08 PM

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
690–695	Thanks for the discussion! It makes sense to me to consider rewriting ALU with shifted operand into two simpler instructions in machine-combiner pass, and that less instructions is a win in itself (regarding David's comment) To provide some context why the original ask, there are motivating test cases (micro benchmarks) that want something opposite (i.e., split ALU with shifted operand into two simpler instructions); a manual hack to use dummy inline assembly (`asm volatile("" : "+r" (variable));`) shows speedups for neoverse n1 (sole M pipeline, and could become bottleneck) -> and I was thinking about machine-combiner as one place (to implement split with some critical path analysis?) as well.

So how about this patch？

In D138904#3970087, @bcl5980 wrote:

So how about this patch？

I'm not too concerned about having fewer instructions and using M pipeline, since I also concur having fewer instructions makes general sense (as you and David pointed out). So this transformation looks good to me . Also note on some aarch64 processors (e.g., neoverse n1), logical instructions with imm-shifted operand is a net win compared with two instructions; and some aarch64 processors have more than one M pipeline (neoverse n2), so fewer chances of M pipeline being a bottleneck.

Tests on microbenchmarks (llvm-test-suite, etc) might give us more confidence (the numbers depend on the specific processor though), but I don't strongly prefer a performance test.

In D138904#3971306, @mingmingl wrote:

In D138904#3970087, @bcl5980 wrote:

So how about this patch？

I'm not too concerned about having fewer instructions and using M pipeline, since I also concur having fewer instructions makes general sense (as you and David pointed out). So this transformation looks good to me . Also note on some aarch64 processors (e.g., neoverse n1), logical instructions with imm-shifted operand is a net win compared with two instructions; and some aarch64 processors have more than one M pipeline (neoverse n2), so fewer chances of M pipeline being a bottleneck.

Tests on microbenchmarks (llvm-test-suite, etc) might give us more confidence (the numbers depend on the specific processor though), but I don't strongly prefer a performance test.

I try llvm-test-suite but unfortunately it looks no test trigger the code.

From what I can tell I think the code is OK. LGTM

This doesn't come up often, but I do see it triggering. I'm not sure what the motivating case is but it seemed OK for performance in the tests I ran.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
633–634	If RHS isnt used anywhere else: ConstantSDNode *RHSC = dyn_cast<ConstantSDNode>(N.getOperand(1));
658	Some more comments explaining these conditions would be good.

This revision is now accepted and ready to land.Dec 6 2022, 1:42 AM

bcl5980 updated this revision to Diff 480485.Dec 6 2022, 7:29 AM

This revision was landed with ongoing or failed builds.Dec 6 2022, 7:30 AM

Closed by commit rG5cd900ce3c6e: [AArch64] Transform shift+and to shift+shift to select more shifted register (authored by bcl5980). · Explain Why

This revision was automatically updated to reflect the committed changes.

bcl5980 added a commit: rG5cd900ce3c6e: [AArch64] Transform shift+and to shift+shift to select more shifted register.

Harbormaster completed remote builds in B201395: Diff 480485.Dec 6 2022, 1:26 PM

Diff 478788

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp

Show First 20 Lines • Show All 363 Lines • ▼ Show 20 Lines	public:
bool tryWriteRegister(SDNode *N);		bool tryWriteRegister(SDNode *N);

// Include the pieces autogenerated from the target description.		// Include the pieces autogenerated from the target description.
#include "AArch64GenDAGISel.inc"		#include "AArch64GenDAGISel.inc"

private:		private:
bool SelectShiftedRegister(SDValue N, bool AllowROR, SDValue &Reg,		bool SelectShiftedRegister(SDValue N, bool AllowROR, SDValue &Reg,
SDValue &Shift);		SDValue &Shift);
		bool SelectShiftedRegisterFromAnd(SDValue N, SDValue &Reg, SDValue &Shift);
bool SelectAddrModeIndexed7S(SDValue N, unsigned Size, SDValue &Base,		bool SelectAddrModeIndexed7S(SDValue N, unsigned Size, SDValue &Base,
SDValue &OffImm) {		SDValue &OffImm) {
return SelectAddrModeIndexedBitWidth(N, true, 7, Size, Base, OffImm);		return SelectAddrModeIndexedBitWidth(N, true, 7, Size, Base, OffImm);
}		}
bool SelectAddrModeIndexedBitWidth(SDValue N, bool IsSignedImm, unsigned BW,		bool SelectAddrModeIndexedBitWidth(SDValue N, bool IsSignedImm, unsigned BW,
unsigned Size, SDValue &Base,		unsigned Size, SDValue &Base,
SDValue &OffImm);		SDValue &OffImm);
bool SelectAddrModeIndexed(SDValue N, unsigned Size, SDValue &Base,		bool SelectAddrModeIndexed(SDValue N, unsigned Size, SDValue &Base,
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	if (Subtarget->hasLSLFast() && V.getOpcode() == ISD::ADD) {
if (RHS.getOpcode() == ISD::SHL && isWorthFoldingSHL(RHS))		if (RHS.getOpcode() == ISD::SHL && isWorthFoldingSHL(RHS))
return true;		return true;
}		}

// It hurts otherwise, since the value will be reused.		// It hurts otherwise, since the value will be reused.
return false;		return false;
}		}

		/// and (shl/srl/sra, x, c), mask --> shl (srl/sra, x, c1), c2
		/// to select more shifted register
		bool AArch64DAGToDAGISel::SelectShiftedRegisterFromAnd(SDValue N, SDValue &Reg,
		SDValue &Shift) {
		EVT VT = N.getValueType();
		if (VT != MVT::i32 && VT != MVT::i64)
		return false;

		if (N->getOpcode() != ISD::AND \|\| !N->hasOneUse())
		return false;
		SDValue LHS = N.getOperand(0);
		if (!LHS->hasOneUse())
		return false;

		unsigned LHSOpcode = LHS->getOpcode();
		if (LHSOpcode != ISD::SHL && LHSOpcode != ISD::SRL && LHSOpcode != ISD::SRA)
		return false;

		ConstantSDNode *ShiftAmtNode = dyn_cast<ConstantSDNode>(LHS.getOperand(1));
		if (!ShiftAmtNode)
		return false;

		uint64_t ShiftAmtC = ShiftAmtNode->getZExtValue();
		mingminglUnsubmitted Done Reply Inline Actions nit: use `uint64_t` (the same type as the return value of `getZExtValue()` to avoid warnings. mingmingl: nit: use `uint64_t` (the same type as the return value of `getZExtValue()` to avoid warnings.
		SDValue RHS = N.getOperand(1);
		ConstantSDNode *RHSC = dyn_cast<ConstantSDNode>(RHS);
		dmgreenUnsubmitted Not Done Reply Inline Actions If RHS isnt used anywhere else: ConstantSDNode RHSC = dyn_cast<ConstantSDNode>(N.getOperand(1)); dmgreen:* If RHS isnt used anywhere else: ConstantSDNode *RHSC = dyn_cast<ConstantSDNode>(N.getOperand…
		if (!RHSC)
		return false;

		APInt AndMask = RHSC->getAPIntValue();
		unsigned LowZBits, MaskLen;
		if (!AndMask.isShiftedMask(LowZBits, MaskLen))
		return false;

		unsigned BitWidth = N.getValueSizeInBits();
		SDLoc DL(LHS);
		uint64_t NewShiftC;
		unsigned NewShiftOp;
		if (LHSOpcode == ISD::SHL) {
		if (LowZBits <= ShiftAmtC \|\| (BitWidth != LowZBits + MaskLen))
		return false;

		NewShiftC = LowZBits - ShiftAmtC;
		NewShiftOp = VT == MVT::i64 ? AArch64::UBFMXri : AArch64::UBFMWri;
		} else {
		if (LowZBits == 0)
		return false;
		mingminglUnsubmitted Done Reply Inline Actions Relatedly, with 1) N as `and (srl x, c), mask` and 2) LowZBits == 0, `N` should be selected as UBFX by isBitfieldExtractOp before tablegen pattern (where `AArch64DAGToDAGISel::SelectShiftedRegister` gets called) applies. Similarly, with `and(shl x, c), mask` and LowZBits == 0, `N` is a UBFIZ. So I think this shouldn't happen be seen in practice; but an early exit (not assertion) looks okay. mingmingl: Relatedly, with 1) N as `and (srl x, c), mask` and 2) LowZBits == 0, `N` should be selected as…

		NewShiftC = LowZBits + ShiftAmtC;
		if (NewShiftC >= BitWidth)
		mingminglUnsubmitted Done Reply Inline Actions For correctness, similar check is needed when LHSOpcode is ISD::SRL; meanwhile for `srl`, the condition `BitWidth == LowZBits + MaskLen + ShiftAmtC` should be sufficient For example, https://gcc.godbolt.org/z/YvGG3Pov3 shows an IR at trunk; and with the current patch the optimized code changes the result (not correct). lsr x8, x0, #3 add x0, x1, x8, lsl #1 ret mingmingl: For correctness, similar check is needed when LHSOpcode is ISD::SRL; meanwhile for `srl`, the…
		bcl5980AuthorUnsubmitted Done Reply Inline Actions `BitWidth <= LowZBits + MaskLen + ShiftAmtC` should be better. bcl5980: `BitWidth <= LowZBits + MaskLen + ShiftAmtC` should be better.
		dmgreenUnsubmitted Not Done Reply Inline Actions Some more comments explaining these conditions would be good. dmgreen: Some more comments explaining these conditions would be good.
		return false;

		if (LHSOpcode == ISD::SRA &&
		((BitWidth != (LowZBits + MaskLen)) \|\| (BitWidth <= NewShiftC)))
		return false;

		if (LHSOpcode == ISD::SRL && (BitWidth > (NewShiftC + MaskLen)))
		return false;

		if (LHSOpcode == ISD::SRL)
		NewShiftOp = VT == MVT::i64 ? AArch64::UBFMXri : AArch64::UBFMWri;
		else
		NewShiftOp = VT == MVT::i64 ? AArch64::SBFMXri : AArch64::SBFMWri;
		mingminglUnsubmitted Done Reply Inline Actions nit pick: To ensure that UBFM/SBFM is used as a right shift (essentially UBFX) here, add `assert(BitWidth-1>=NewShiftC && "error message")` mingmingl: nit pick: To ensure that UBFM/SBFM is used as a right shift (essentially [[ https://developer.
		}

		assert(NewShiftC < BitWidth && "Invalid shift amount");
		SDValue NewShiftAmt = CurDAG->getTargetConstant(NewShiftC, DL, VT);
		SDValue BitWidthMinus1 = CurDAG->getTargetConstant(BitWidth - 1, DL, VT);
		Reg = SDValue(CurDAG->getMachineNode(NewShiftOp, DL, VT, LHS->getOperand(0),
		NewShiftAmt, BitWidthMinus1),
		0);
		unsigned ShVal = AArch64_AM::getShifterImm(AArch64_AM::LSL, LowZBits);
		Shift = CurDAG->getTargetConstant(ShVal, DL, MVT::i32);
		return true;
		}

/// SelectShiftedRegister - Select a "shifted register" operand. If the value		/// SelectShiftedRegister - Select a "shifted register" operand. If the value
/// is not shifted, set the Shift operand to default of "LSL 0". The logical		/// is not shifted, set the Shift operand to default of "LSL 0". The logical
/// instructions allow the shifted register to be rotated, but the arithmetic		/// instructions allow the shifted register to be rotated, but the arithmetic
/// instructions do not. The AllowROR parameter specifies whether ROR is		/// instructions do not. The AllowROR parameter specifies whether ROR is
/// supported.		/// supported.
bool AArch64DAGToDAGISel::SelectShiftedRegister(SDValue N, bool AllowROR,		bool AArch64DAGToDAGISel::SelectShiftedRegister(SDValue N, bool AllowROR,
SDValue &Reg, SDValue &Shift) {		SDValue &Reg, SDValue &Shift) {
		if (SelectShiftedRegisterFromAnd(N, Reg, Shift))
		return true;

AArch64_AM::ShiftExtendType ShType = getShiftTypeForNode(N);		AArch64_AM::ShiftExtendType ShType = getShiftTypeForNode(N);
		mingminglUnsubmitted Not Done Reply Inline Actions Selecting `and (shl/srl/sra, x, c), mask` into `shl (srl/sra, x, c1), c2` (and folding shl) is simplifying one instruction away but could change the pipeline (latency and throughput as well) of the instruciton. For example, arithmetic operations with shifted operand could use M pipeline [1] on neoverse n1, and M pipeline is used for all ALU operations with imm-shifted operand for cortex a57. I wonder if improvements are seen in some benchmarks from this patch? [1] for neoverse n1, M pipeline is used for {ADDS, SUBS} with "Arithmetic, LSR/ASR/ROR shift or LSL shift > 4". mingmingl: Selecting `and (shl/srl/sra, x, c), mask` into `shl (srl/sra, x, c1), c2` (and folding shl) is…
		bcl5980AuthorUnsubmitted Done Reply Inline Actions I agree that it may not get any timing improvement except lsl < 4. I also don't believe this patch can get improvement in any real benchmarks. Actually the current shifted register select function also does not consider for the thing you mentioned. If you really worry about that I can limit the lsl constant < 4.But personally I don't want to do that. bcl5980: I agree that it may not get any timing improvement except lsl < 4. I also don't believe this…
		bcl5980AuthorUnsubmitted Done Reply Inline Actions And for high-end CPU, it's really hard to model by just execute ports. 2 instructions will use 2 Rob entry, 2 physical registers, more front end decode, more code size. At least we should not consider these detail things when these pattern with the same latency and only some targets may have different throughput. If we really want to consider these detail micro architecture trade off, the better way is always do here and maybe split it in machine combiner that we already have detail schmodel. bcl5980: And for high-end CPU, it's really hard to model by just execute ports. 2 instructions will use…
		mingminglUnsubmitted Done Reply Inline Actions Thanks for the discussion! It makes sense to me to consider rewriting ALU with shifted operand into two simpler instructions in machine-combiner pass, and that less instructions is a win in itself (regarding David's comment) To provide some context why the original ask, there are motivating test cases (micro benchmarks) that want something opposite (i.e., split ALU with shifted operand into two simpler instructions); a manual hack to use dummy inline assembly (`asm volatile("" : "+r" (variable));`) shows speedups for neoverse n1 (sole M pipeline, and could become bottleneck) -> and I was thinking about machine-combiner as one place (to implement split with some critical path analysis?) as well. mingmingl: Thanks for the discussion! It makes sense to me to consider rewriting ALU with shifted operand…
if (ShType == AArch64_AM::InvalidShiftExtend)		if (ShType == AArch64_AM::InvalidShiftExtend)
return false;		return false;
if (!AllowROR && ShType == AArch64_AM::ROR)		if (!AllowROR && ShType == AArch64_AM::ROR)
return false;		return false;

if (ConstantSDNode *RHS = dyn_cast<ConstantSDNode>(N.getOperand(1))) {		if (ConstantSDNode *RHS = dyn_cast<ConstantSDNode>(N.getOperand(1))) {
unsigned BitSize = N.getValueSizeInBits();		unsigned BitSize = N.getValueSizeInBits();
unsigned Val = RHS->getZExtValue() & (BitSize - 1);		unsigned Val = RHS->getZExtValue() & (BitSize - 1);
▲ Show 20 Lines • Show All 5,033 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/shiftregister-from-and.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=aarch64-- \| FileCheck %s

				; logic shift reg pattern: and
				; already optimized by another pattern

				define i64 @and_shiftedreg_from_and(i64 %a, i64 %b) {
				; CHECK-LABEL: and_shiftedreg_from_and:
				; CHECK: // %bb.0:
				; CHECK-NEXT: and x8, x1, x0, asr #23
				; CHECK-NEXT: and x0, x8, #0xffffffffff000000
				; CHECK-NEXT: ret
				%ashr = ashr i64 %a, 23
				%and = and i64 %ashr, -16777216
				%r = and i64 %b, %and
				ret i64 %r
				}

				; TODO: logic shift reg pattern: bic

				define i64 @bic_shiftedreg_from_and(i64 %a, i64 %b) {
				; CHECK-LABEL: bic_shiftedreg_from_and:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16777215
				; CHECK-NEXT: orn x8, x8, x0, asr #23
				; CHECK-NEXT: and x0, x1, x8
				; CHECK-NEXT: ret
				%ashr = ashr i64 %a, 23
				%and = and i64 %ashr, -16777216
				%not = xor i64 %and, -1
				%r = and i64 %b, %not
				ret i64 %r
				}

				; logic shift reg pattern: eon

				define i64 @eon_shiftedreg_from_and(i64 %a, i64 %b) {
				; CHECK-LABEL: eon_shiftedreg_from_and:
				; CHECK: // %bb.0:
				; CHECK-NEXT: lsr x8, x0, #17
				; CHECK-NEXT: eon x0, x1, x8, lsl #53
				; CHECK-NEXT: ret
				%shl = shl i64 %a, 36
				%and = and i64 %shl, -9007199254740992
				%xor = xor i64 %and, -1
				%r = xor i64 %b, %xor
				ret i64 %r
				}

				; logic shift reg pattern: eor

				define i64 @eor_shiftedreg_from_and(i64 %a, i64 %b) {
				; CHECK-LABEL: eor_shiftedreg_from_and:
				; CHECK: // %bb.0:
				; CHECK-NEXT: lsr x8, x0, #47
				; CHECK-NEXT: eor x0, x1, x8, lsl #24
				; CHECK-NEXT: ret
				%lshr = lshr i64 %a, 23
				%and = and i64 %lshr, 2199006478336
				%or = xor i64 %and, %b
				ret i64 %or
				}

				; logic shift reg pattern: mvn
				; already optimized by another pattern

				define i64 @mvn_shiftedreg_from_and(i64 %a) {
				; CHECK-LABEL: mvn_shiftedreg_from_and:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #9007199254740991
				; CHECK-NEXT: orn x0, x8, x0, lsl #36
				; CHECK-NEXT: ret
				%shl = shl i64 %a, 36
				%and = and i64 %shl, -9007199254740992
				%xor = xor i64 %and, -1
				ret i64 %xor
				}

				; logic shift reg pattern: orn
				; already optimized by another pattern

				define i64 @orn_shiftedreg_from_and(i64 %a, i64 %b) {
				; CHECK-LABEL: orn_shiftedreg_from_and:
				; CHECK: // %bb.0:
				; CHECK-NEXT: orn x8, x1, x0, lsr #23
				; CHECK-NEXT: orr x0, x8, #0xfffffe0000ffffff
				; CHECK-NEXT: ret
				%lshr = lshr i64 %a, 23
				%and = and i64 %lshr, 2199006478336
				%not = xor i64 %and, -1
				%or = or i64 %not, %b
				ret i64 %or
				}

				; logic shift reg pattern: orr
				; lshr constant bitwidth == (lowbits + masklen + shiftamt)

				define i64 @orr_shiftedreg_from_and(i64 %a, i64 %b) {
				; CHECK-LABEL: orr_shiftedreg_from_and:
				; CHECK: // %bb.0:
				; CHECK-NEXT: lsr x8, x0, #47
				; CHECK-NEXT: orr x0, x1, x8, lsl #24
				; CHECK-NEXT: ret
				%lshr = lshr i64 %a, 23
				%and = and i64 %lshr, 2199006478336 ; 0x1ffff000000
				%or = or i64 %and, %b
				ret i64 %or
				}

				; logic shift reg pattern: orr
				; lshr constant bitwidth < (lowbits + masklen + shiftamt)

				define i64 @orr_shiftedreg_from_and_mask2(i64 %a, i64 %b) {
				; CHECK-LABEL: orr_shiftedreg_from_and_mask2:
				; CHECK: // %bb.0:
				dmgreenUnsubmitted Done Reply Inline Actions This instruction looks wrong, with the shift being more than 32 for w regs. dmgreen: This instruction looks wrong, with the shift being more than 32 for w regs.
				bcl5980AuthorUnsubmitted Done Reply Inline Actions fixed the code and rename the test to @shiftedreg_from_and_negative_andc5 bcl5980: fixed the code and rename the test to @shiftedreg_from_and_negative_andc5
				; CHECK-NEXT: lsr x8, x0, #47
				; CHECK-NEXT: orr x0, x1, x8, lsl #24
				; CHECK-NEXT: ret
				%lshr = lshr i64 %a, 23
				%and = and i64 %lshr, 4398029733888 ; 0x3ffff000000
				%or = or i64 %and, %b
				ret i64 %or
				}


				; arithmetic shift reg pattern: add

				define i32 @add_shiftedreg_from_and(i32 %a, i32 %b) {
				; CHECK-LABEL: add_shiftedreg_from_and:
				; CHECK: // %bb.0:
				; CHECK-NEXT: asr w8, w0, #27
				; CHECK-NEXT: add w0, w1, w8, lsl #24
				; CHECK-NEXT: ret
				%ashr = ashr i32 %a, 3
				%and = and i32 %ashr, -16777216
				%add = add i32 %and, %b
				ret i32 %add
				}

				; arithmetic shift reg pattern: sub
				mingminglUnsubmitted Done Reply Inline Actions nit pick: i16 will be legalized to i32 [1], and the patch optimizes `shiftedreg_from_and_negative_type` (trunk https://gcc.godbolt.org/z/cnac5989e). This doesn't seem to be a negative test . Probably use a vector like https://gcc.godbolt.org/z/T4szTsEav [1] `llc -O3 -mtriple=aarch64 -debug -print-after-all test.ll` gives debug logs, and https://gist.github.com/minglotus/763173f099bec76c720ec9ce7c181471 shows the legalization step. mingmingl: nit pick: i16 will be legalized to i32 [1], and the patch optimizes…

				define i64 @sub_shiftedreg_from_and_shl(i64 %a, i64 %b) {
				; CHECK-LABEL: sub_shiftedreg_from_and_shl:
				; CHECK: // %bb.0:
				; CHECK-NEXT: lsr x8, x0, #17
				; CHECK-NEXT: sub x0, x1, x8, lsl #53
				; CHECK-NEXT: ret
				%shl = shl i64 %a, 36
				%and = and i64 %shl, -9007199254740992
				%sub = sub i64 %b, %and
				ret i64 %sub
				}

				; negative test: type is not i32 or i64

				define <2 x i32> @shiftedreg_from_and_negative_type(<2 x i32> %a, <2 x i32> %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_type:
				; CHECK: // %bb.0:
				; CHECK-NEXT: shl v0.2s, v0.2s, #2
				; CHECK-NEXT: bic v0.2s, #31
				; CHECK-NEXT: sub v0.2s, v1.2s, v0.2s
				; CHECK-NEXT: ret
				%shl = shl <2 x i32> %a, <i32 2, i32 2>
				%and = and <2 x i32> %shl, <i32 -32, i32 -32>
				%sub = sub <2 x i32> %b, %and
				ret <2 x i32> %sub
				}

				; negative test: shift one-use

				define i32 @shiftedreg_from_and_negative_oneuse1(i32 %a, i32 %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_oneuse1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: asr w8, w0, #23
				; CHECK-NEXT: and w9, w8, #0xff000000
				; CHECK-NEXT: add w9, w9, w1
				; CHECK-NEXT: mul w0, w8, w9
				; CHECK-NEXT: ret
				%ashr = ashr i32 %a, 23
				%and = and i32 %ashr, -16777216
				%add = add i32 %and, %b
				%r = mul i32 %ashr, %add
				ret i32 %r
				}

				; negative test: and one-use

				define i32 @shiftedreg_from_and_negative_oneuse2(i32 %a, i32 %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_oneuse2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: asr w8, w0, #23
				; CHECK-NEXT: and w8, w8, #0xff000000
				; CHECK-NEXT: add w9, w8, w1
				; CHECK-NEXT: mul w0, w8, w9
				; CHECK-NEXT: ret
				%ashr = ashr i32 %a, 23
				%and = and i32 %ashr, -16777216
				%add = add i32 %and, %b
				%r = mul i32 %and, %add
				ret i32 %r
				}

				; negative test: and c is not mask

				define i32 @shiftedreg_from_and_negative_andc1(i32 %a, i32 %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_andc1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #26215
				; CHECK-NEXT: movk w8, #65510, lsl #16
				; CHECK-NEXT: and w8, w8, w0, asr #23
				; CHECK-NEXT: add w0, w8, w1
				; CHECK-NEXT: ret
				%ashr = ashr i32 %a, 23
				%and = and i32 %ashr, -1677721
				%add = add i32 %and, %b
				ret i32 %add
				}

				; negative test: sra with and c is not legal mask

				define i32 @shiftedreg_from_and_negative_andc2(i32 %a, i32 %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_andc2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #-285212672
				; CHECK-NEXT: and w8, w8, w0, asr #23
				; CHECK-NEXT: add w0, w8, w1
				; CHECK-NEXT: ret
				%ashr = ashr i32 %a, 23
				%and = and i32 %ashr, 4009754624 ; 0xef000000
				%add = add i32 %and, %b
				ret i32 %add
				}

				; negative test: shl with and c is not legal mask

				define i64 @shiftedreg_from_and_negative_andc3(i64 %a, i64 %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_andc3:
				; CHECK: // %bb.0:
				; CHECK-NEXT: eor x0, x1, x0, lsl #36
				; CHECK-NEXT: ret
				%shl = shl i64 %a, 36
				%and = and i64 %shl, -4294967296
				%xor = xor i64 %and, %b
				ret i64 %xor
				}

				; negative test: shl with and c is not legal mask

				define i64 @shiftedreg_from_and_negative_andc4(i64 %a, i64 %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_andc4:
				; CHECK: // %bb.0:
				; CHECK-NEXT: lsl x8, x0, #36
				; CHECK-NEXT: and x8, x8, #0x7fe0000000000000
				; CHECK-NEXT: eor x0, x8, x1
				; CHECK-NEXT: ret
				%shl = shl i64 %a, 36
				%and = and i64 %shl, 9214364837600034816
				%xor = xor i64 %and, %b
				ret i64 %xor
				}

				; negative test: sra with and c is not legal mask
				; lshr constant bitwidth > (lowbits + masklen + shiftamt)

				define i32 @shiftedreg_from_and_negative_andc5(i32 %a, i32 %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_andc5:
				; CHECK: // %bb.0:
				; CHECK-NEXT: asr w8, w0, #23
				; CHECK-NEXT: and w8, w8, #0xff000000
				; CHECK-NEXT: add w0, w8, w1
				; CHECK-NEXT: ret
				%ashr = ashr i32 %a, 23
				%and = and i32 %ashr, -16777216
				%add = add i32 %and, %b
				ret i32 %add
				}

				; negative test: srl with and c is not legal mask

				define i64 @shiftedreg_from_and_negative_andc6(i64 %a, i64 %b) {
				; CHECK-LABEL: shiftedreg_from_and_negative_andc6:
				; CHECK: // %bb.0:
				; CHECK-NEXT: lsr x8, x0, #2
				; CHECK-NEXT: and x8, x8, #0x6
				; CHECK-NEXT: add x0, x8, x1
				; CHECK-NEXT: ret
				%lshr = lshr i64 %a, 2
				%and = and i64 %lshr, 6
				%add = add i64 %and, %b
				ret i64 %add
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Transform shift+and to shift+shift to select more shifted register
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 478788

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp

llvm/test/CodeGen/AArch64/shiftregister-from-and.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Transform shift+and to shift+shift to select more shifted registerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 478788

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp

llvm/test/CodeGen/AArch64/shiftregister-from-and.ll

[AArch64] Transform shift+and to shift+shift to select more shifted register
ClosedPublic