This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
aarch64-lsr-bfi.ll
-
logical_shifted_reg.ll
-
unfold-masked-merge-scalar-constmask-innerouter.ll

Differential D135102

[AArch64] Compare BFI and ORR with left-shifted operand for OR instruction selection.
ClosedPublic

Authored by mingmingl on Oct 3 2022, 1:52 PM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma
fhahn

Commits

rGf62d8a1a5044: [AArch64] Compare BFI and ORR with left-shifted operand for OR instruction…

Summary

Before this patch:

For r = or op0, op1, tryBitfieldInsertOpFromOr combines it to BFI when
1. one of the two operands is bit-field-positioning or bit-field-extraction op; and
2. bits from the two operands don't overlap

After this patch:

Right before OR is combined to BFI, evaluates if ORR with left-shifted operand is better.

A motivating example (https://godbolt.org/z/rnMrzs5vn, which is added as a test case in test_orr_not_bfi in CodeGen/AArch64/bitfield-insert.ll)

For IR:

define i64 @test_orr_not_bfxil(i64 %0) {
  %2 = and i64 %0, 127
  %3 = lshr i64 %0, 1
  %4 = and i64 %3, 16256
  %5 = or i64 %4, %2
  ret i64 %5
}

Before:

lsr     x8, x0, #1
and     x8, x8, #0x3f80
bfxil   x8, x0, #0, #7

After:

ubfx x8, x0, #8, #7
and x9, x0, #0x7f
orr x0, x9, x8, lsl #7

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mingmingl created this revision.Oct 3 2022, 1:52 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2022, 1:52 PM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

mingmingl requested review of this revision.Oct 3 2022, 1:52 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2022, 1:52 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B190059: Diff 464803.Oct 3 2022, 1:53 PM

Fix code style issues (nested if, etc)

mingmingl added reviewers: dmgreen, efriedma, fhahn.Oct 4 2022, 12:16 AM

Harbormaster completed remote builds in B190126: Diff 464896.Oct 4 2022, 12:16 AM

mingmingl edited the summary of this revision. (Show Details)Oct 4 2022, 12:16 AM

Actually the current implementation causes indefinite loop for 'test_nouseful_strb' in 'llvm/test/CodeGen/AArch64/bitfield-insert.ll' (check-llvm hangs at 99%). Will make fixes for that. Sorry about it.

An update when trying to figure out the cause of indefinite loop:

How indefinite loop take place (diff as it is shown)
1. By rewriting AND(val, shifted-mask) to shl(and(srl(val,N), mask), N), the patch creates two more SDNode in dag-combiner; the added nodes could easily interact badly with existing combining logic, causing a repeat of {node expansion (as this patch does), node combination (existing logic)} in general. One two-line LLMV IR is attached in [1] to exemplify this.
2. Indefinite loop could be solved by adding these lines [2] (atop current patch), but it's hard to prove rewriting one dag node to three dag nodes (AND(val, shifted-mask) to shl(and(srl(val,N), mask), N)) is not fragile (i.e., interact badly with future combination logic)

The motivating test case (including {5-line C++ =, current codegen, optimal codegen}) is https://godbolt.org/z/h96b1sGco
- The source code of over-eager BFM usage is this in AArch64DAGToDAGISel::Select -> there is no DAG node for BFM instruction -> the BFM selection happens after dag-combiner, and is written in C++ to see through bit-simplification opportunities between two operands
- What bit-simplification opportunities means -> getUsefulBits scans uses of ISD::OR (with a limited recursion depth) to shrink the number of useful bits -> if bits could be proved not useful (by users), usage of BFM eliminates DAG nodes and thereby reducing the number of instructions (code)

What I learnt from 1 and 2

bit-simplification opportunities from BFM should be retained, since in the best cases it eliminates instructions
Instruction selection needs to be enhanced to choose between ORR (with shifted register) and BFM (for the motivating test case in 2); one way to do it, is to introduce SelectionDAG node for 'BFM' instruction and let instructions go through DAG-Combiner for evaluation, and re-write getUsefulBits function (which relies on MachineOpCode (i.e., users of a SDNode already being selected) now) so it could analyze useful bits when SDNodes are not selected yet.

To enhance ISel to choose between 'ORR' and 'BFM', I'm planning on changes to adding SelectionDAG Node for 'BFM' (and probably UBFM since UBFM is helpful to see through useful bits ). Going to make some revisions and send them out in stacked diffs..

[1] https://godbolt.org/z/qexqzYx1W (AND with shifted-mask operand is rewritten by this patch, and combined back by shouldFoldConstantShiftPairToMask to an AND with shifted-mask again, causing indefinite loop

[2]

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index fc1893b6d61d..a2d4fe280134 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -14220,6 +14220,13 @@ bool AArch64TargetLowering::shouldFoldConstantShiftPairToMask(
     return (!C1 || !C2 || C1->getZExtValue() >= C2->getZExtValue());
   }
 
+  // 1. It's know that N is a shl or srl
+  // 2. If N has one use, and that use could fold shift, return false
+  // FIXME: Extend this from ORR to other ops
+  if(N->hasNUsesOfValue(1, 0) && N->use_begin()->getOpcode() == ISD::OR) {
+    return false;
+  }
+
   return true;
 }

khchen added a subscriber: khchen.Oct 12 2022, 7:52 AM

mingmingl mentioned this in D135844: [AArch64][2/4]Regard (shl val, N) as a potential bit-field-positioning op regardless of the number of uses..Oct 18 2022, 11:08 AM

In order to generate orr with shifted operand (not bfi) when orr is better, this patch does a comparison inside tryBitfieldInsertOpFromOr (i.e., after dag-combiner, alongside cpp-based instruction selection).

This patch optimizes many existing test cases (as shown by updated tests), but introduces one regression.

Before the patch, the generation of BFM tells the bits being used in Rd and Rm respectively (see getUsefulBitsFromBFM)
After this patch, ORRWrs op0, op1, lsl #imm (with shifted register) is generated rather than BFM in some cases; however, the bit field usage information in op0 is not preserved (and there isn't a way to express this in class SDNode without adding a specialized that derives from SDNode) (for example LoadSDNode)

A side question, is it a typical use case to convey metadata (e.g., op0 and op1 inside ORR op0, op1 contributes bits that doesn't overlap) in the SDNode class?

Two alternative options:

Introduce a DAG node for BFM, so as to compare BFM and ORR in dag-combiner.
- - The drawback of generating BFM earlier (i.e., in dag-combiner) is that, all other bit-field processing nodes (AND, SHL, etc) need to be taught to combine with BFM. In other words, introducing a BFM dag-node without regressing existing combination requires a lot of work.
- I had a local patch that actually adds the BFM node, where missed combinations of BFM with existing nodes manifest.
Do the transformation in aarch64-mi-peephole-opt pass.
- Since aarch64-mi-peephole-opt optimizes based on ISel output, it's a net optimization (compared with current patch, i.e., no drawback of lost information).
- However, the implementation inside aarch64-mi-peephole-opt would handle MachineInstructions (and MachineOpcode, i.e., ISD::AND fleshes out in many forms, like AndWrs, AndWrr, etc), and cannot reuse helper functions in the ISel pass.

I think alternative #2 (inside aarch64-mi-peephole-opt) is better than alternative #1 (could be a can of worms due to missed combinations between BFM and existing AND/OR nodes), and in some sense better than the current patch (at the cost of more code work)

Feedback/thoughts on where (peephole or the current patch) to pursue this optimization would be appreciated!

Harbormaster completed remote builds in B195369: Diff 472121.Oct 31 2022, 4:27 PM

Hmm. I had not considered am ORR with shift to be cheaper than a BFM before. From what I can tell it doesn't seem to be universal across all cpus, but does look like it will be faster or equal.

A side question, is it a typical use case to convey metadata (e.g., op0 and op1 inside ORR op0, op1 contributes bits that doesn't overlap) in the SDNode class?

That sounds like it would usually be calculated with KnownBits, like in haveNoCommonBitsSet. Unfortunately post-isel the amount of information we can extract is much less than from the generic DAG nodes.

[About the two/three options]

Is the motivating pattern just the one from the commit message, or any bfm that could be a shifted orr? aarch64-mi-peephole-opt is an option - we always run into problems implementing things there but if it is easier to write that is always an option. (The machine combiner too, if scheduling info is useful). Larger patterns might be more difficult though. The (existing) code in DAG2DAG doesn't feel like it scales super well. But equally like you say adding ISel nodes has downsides. What would this look like from GlobalISel? How much code would need to be added to make aarch64-mi-peephole-opt work?

Are the only regressions on uaddo_v4i1 and umulo_v4i1? I'm not against ignoring those, if they are just overflowing nodes on i1 types being awkwardly expanded and it doesn't come up in other places.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2835 ↗	(On Diff #472121)	You say #7 here, but #8 elsewhere, including the test. I think 7 is correct for this example.

fix a bug around width computation (updated commit message and patch summary as well), and adjust whitespace in a few test cases (mainly those not updated by utils/update_llc_test_checks.py) to remove irrelevant diff.

In D135102#3899360, @dmgreen wrote:

Hmm. I had not considered am ORR with shift to be cheaper than a BFM before. From what I can tell it doesn't seem to be universal across all cpus, but does look like it will be faster or equal.

Yes it's true that ORR with shift could be the same as BFM (e.g. Cortex A57), or faster (e.g. NeoverseN1, CortexA77)

A side question, is it a typical use case to convey metadata (e.g., op0 and op1 inside ORR op0, op1 contributes bits that doesn't overlap) in the SDNode class?

That sounds like it would usually be calculated with KnownBits, like in haveNoCommonBitsSet. Unfortunately post-isel the amount of information we can extract is much less than from the generic DAG nodes.

Yes, SelectionDAG::computeKnownBits handles generic DAG nodes but doesn't handle machine-op-code (with NodeType < 0); as a result, computing something similar to 'haveNoCommonBitSet' won't work out of the box.

[About the two/three options]

Is the motivating pattern just the one from the commit message, or any bfm that could be a shifted orr?

Yes, the motivating test case is just the one from commit message; and the other updated tests are results of other lines (that actually look simpler, and added to show ORR-not-BFM is a more generic question to solve).

aarch64-mi-peephole-opt is an option - we always run into problems implementing things there but if it is easier to write that is always an option. (The machine combiner too, if scheduling info is useful). Larger patterns might be more difficult though. The (existing) code in DAG2DAG doesn't feel like it scales super well. But equally like you say adding ISel nodes has downsides. What would this look like from GlobalISel? How much code would need to be added to make aarch64-mi-peephole-opt work?

BFM is not used in GlobalISel (four instructions generated https://godbolt.org/z/MMvMe34zv), so a BFM pattern matcher (inside peephole or machine-combiner) won't optimize GlobalISel output in this case.

Regarding the amount of work inside peephole (or machine-combiner), I don't have a demo at hand but the number of lines should be within a few hundred (not thousand) just for the motivating test case.

However, without the context that this BFI is from ISD::OR, building up this context (that it's correct to convert BFI back to ORR) and fixing the other affected tests in this patch would require some implementation.

Are the only regressions on uaddo_v4i1 and umulo_v4i1? I'm not against ignoring those, if they are just overflowing nodes on i1 types being awkwardly expanded and it doesn't come up in other places.

In the affected test cases, only uaddo_v4i1 and umulo_v4i1 regressed -> more generally, useful-bit info (from BFM, lost in ORR) simplifies away one AND node from Dst (as shown in code link, when AND zeros exactly the bits that are going to be inserted from Src) -> in this sense, other cases might show up (not type-extended small integers)

Maybe I could write a working demo in peephole or machine-combiner for one motivating case as a start? For the rest of tests, I could file Github PR to track them.

mingmingl marked an inline comment as done.Nov 2 2022, 12:50 AM

mingmingl added inline comments.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2835 ↗	(On Diff #472121)	Thanks for the good catch! Fixed it (overlooked the width is 'immr - imms + 1' for UBFX)
llvm/test/CodeGen/AArch64/trunc-to-tbl.ll
238 ↗	(On Diff #472519)	This test case is generated by `utils/update_llc_test_checks.py`; but for some reason, the whitespaces cause more diff than expected. I'm going to run auto updater in a clean branch, and see if the whitespace diff is expected without this patch.

Harbormaster completed remote builds in B195635: Diff 472519.Nov 2 2022, 1:52 AM

I read through the code. I'm not the biggest expert on this DAGToDAG code, but what is here seems sensible to me. All the tests look OK too.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2872 ↗	(On Diff #472519)	LLVM usually adds message to all assert messages. It can help distinguish them especially when the condition is fairly generic.
llvm/test/CodeGen/AArch64/trunc-to-tbl.ll
238 ↗	(On Diff #472519)	Feel free to regenerate the files that need it and check those in to reduce the differences here.

mingmingl mentioned this in D137296: [NFC][AArch64]Precommit test for D135102.Nov 2 2022, 3:46 PM

mingmingl added a parent revision: D137296: [NFC][AArch64]Precommit test for D135102.Nov 2 2022, 3:47 PM

resolve comments.

Thanks for reviews! PTAL.

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll
238 ↗	(On Diff #472519)	Thanks! Done in https://reviews.llvm.org/D137296

Harbormaster completed remote builds in B195834: Diff 472802.Nov 2 2022, 6:24 PM

This update runs git clang-format HEAD~1 only, no functional change --> without this, pre-merge checks fails due to ERROR git-clang-format returned an non-zero exit code 1

Harbormaster completed remote builds in B195867: Diff 472846.Nov 2 2022, 11:34 PM

mingmingl mentioned this in rG5d7fdf67f622: [NFC][AArch64]Precommit test for D135102.Nov 3 2022, 10:50 AM

rebase after D137296

mingmingl removed a parent revision: D137296: [NFC][AArch64]Precommit test for D135102.Nov 3 2022, 11:24 AM

Harbormaster completed remote builds in B195966: Diff 472988.Nov 3 2022, 12:04 PM

Thanks. LGTM

This revision is now accepted and ready to land.Nov 3 2022, 12:11 PM

thanks for reviews! Going to submit and implement the FIXME (for SRL) in follow-up patches.

Closed by commit rGf62d8a1a5044: [AArch64] Compare BFI and ORR with left-shifted operand for OR instruction… (authored by mingmingl). · Explain WhyNov 3 2022, 12:32 PM

This revision was automatically updated to reflect the committed changes.

mingmingl added a commit: rGf62d8a1a5044: [AArch64] Compare BFI and ORR with left-shifted operand for OR instruction….

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

88 lines

test/

CodeGen/

AArch64/

aarch64-lsr-bfi.ll

20 lines

logical_shifted_reg.ll

18 lines

unfold-masked-merge-scalar-constmask-innerouter.ll

12 lines

Diff 464803

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 15,349 Lines • ▼ Show 20 Lines	if (N->getOpcode() == ISD::AND) {
CCmp = DAG.getNode(AArch64ISD::CCMP, DL, MVT_CC, Cmp1.getOperand(0),		CCmp = DAG.getNode(AArch64ISD::CCMP, DL, MVT_CC, Cmp1.getOperand(0),
Cmp1.getOperand(1), NZCVOp, Condition, Cmp0);		Cmp1.getOperand(1), NZCVOp, Condition, Cmp0);
}		}
return DAG.getNode(AArch64ISD::CSEL, DL, VT, CSel0.getOperand(0),		return DAG.getNode(AArch64ISD::CSEL, DL, VT, CSel0.getOperand(0),
CSel0.getOperand(1), DAG.getConstant(CC1, DL, MVT::i32),		CSel0.getOperand(1), DAG.getConstant(CC1, DL, MVT::i32),
CCmp);		CCmp);
}		}

		/// isIntImmediate - This method tests to see if the node is a constant
		/// operand. If so Imm will receive the 32-bit value.
		static bool isIntImmediate(const SDNode *N, uint64_t &Imm) {
		if (const ConstantSDNode *C = dyn_cast<const ConstantSDNode>(N)) {
		Imm = C->getZExtValue();
		return true;
		}
		return false;
		}

		// isOpcWithIntImmediate - This method tests to see if the node is a specific
		// opcode and that it has a immediate integer right operand.
		// If so Imm will receive the 32 bit value.
		static bool isOpcWithIntImmediate(const SDNode *N, unsigned Opc,
		uint64_t &Imm) {
		return N->getOpcode() == Opc &&
		isIntImmediate(N->getOperand(1).getNode(), Imm);
		}

		static bool isShiftedMask(uint64_t Mask, EVT VT) {
		assert(VT == MVT::i32 \|\| VT == MVT::i64);
		if (VT == MVT::i32)
		return isShiftedMask_32(Mask);
		return isShiftedMask_64(Mask);
		}

		static SDValue tryCombineToORWithShift(SDNode* N, TargetLowering::DAGCombinerInfo& DCI) {
		assert (N->getOpcode() == ISD::OR && "N must be an OR operation to call this function");

		EVT VT = N->getValueType(0);
		SelectionDAG &DAG = DCI.DAG;
		SDLoc DL(N);

		// Bail out when value type is not one of {i32, i64}, since AArch64 ORR with shifted register is only available for i32 and i64.
		if (VT != MVT::i32 && VT != MVT::i64)
		return SDValue();

		auto isAndOperandWithShiftedMask = [](SDValue V, EVT VT, uint64_t& AndImm) -> bool {
		if (!isOpcWithIntImmediate(V.getNode(), ISD::AND, AndImm))
		return false;

		return isShiftedMask(AndImm, VT);
		};

		auto getAndOperandWithShiftedMaskIndex = [isAndOperandWithShiftedMask](SDNode* N, EVT VT, uint64_t& AndImm) -> int {
		SDValue LHS = N->getOperand(0);
		SDValue RHS = N->getOperand(1);

		if (isAndOperandWithShiftedMask(LHS, VT, AndImm))
		return 0;

		if (isAndOperandWithShiftedMask(RHS, VT, AndImm))
		return 1;

		return -1;
		};

		uint64_t AndImm = 0;

		// Bail out if neither operand is AND(VAL, ShiftedMask)
		if (int OperandIndex = getAndOperandWithShiftedMaskIndex(N, VT, AndImm); OperandIndex >= 0) {
		assert(OperandIndex < 2 && "OperandIndex should be 0 or 1");
		SDValue AndOperandWithShiftedMask = N->getOperand(OperandIndex);
		SDValue OtherOperand = N->getOperand(1 - OperandIndex);

		// Proceed when AND has one use (the OR node).
		if (AndOperandWithShiftedMask.hasOneUse()) {
		assert (isShiftedMask(AndImm, VT) && "AndImm should be a shifted mask");

		const unsigned ShiftAmount = countTrailingZeros(AndImm);

		if (ShiftAmount > 0) {
		SDValue AndOperandWithMask = DAG.getNode(ISD::AND, DL, VT,
		DAG.getNode(ISD::SRL, DL, VT, AndOperandWithShiftedMask.getOperand(0), DAG.getConstant(ShiftAmount, DL, VT)),
		DAG.getConstant(AndImm >> ShiftAmount, DL, VT));

		return DAG.getNode(ISD::OR, DL, VT, DAG.getNode(ISD::SHL, DL, VT, AndOperandWithMask, DAG.getConstant(ShiftAmount, DL, VT)), OtherOperand);
		}
		}
		}


		return SDValue();
		}

static SDValue performORCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,		static SDValue performORCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (SDValue R = performANDORCSELCombine(N, DAG))		if (SDValue R = performANDORCSELCombine(N, DAG))
return R;		return R;

if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))		if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))
return SDValue();		return SDValue();

// Attempt to form an EXTR from (or (shl VAL1, #N), (srl VAL2, #RegWidth-N))		// Attempt to form an EXTR from (or (shl VAL1, #N), (srl VAL2, #RegWidth-N))
if (SDValue Res = tryCombineToEXTR(N, DCI))		if (SDValue Res = tryCombineToEXTR(N, DCI))
return Res;		return Res;

		if (SDValue Res = tryCombineToORWithShift(N, DCI))
		return Res;

if (SDValue Res = tryCombineToBSL(N, DCI))		if (SDValue Res = tryCombineToBSL(N, DCI))
return Res;		return Res;

return SDValue();		return SDValue();
}		}

static bool isConstantSplatVectorMaskForType(SDNode *N, EVT MemVT) {		static bool isConstantSplatVectorMaskForType(SDNode *N, EVT MemVT) {
if (!MemVT.getVectorElementType().isSimple())		if (!MemVT.getVectorElementType().isSimple())
▲ Show 20 Lines • Show All 7,252 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-lsr-bfi.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-none-linux-gnu < %s -o -\| FileCheck %s			; RUN: llc -mtriple=aarch64-none-linux-gnu < %s -o -\| FileCheck %s

	define i32 @lsr_bfi(i32 %a) {			define i32 @lsr_bfi(i32 %a) {
	; CHECK-LABEL: lsr_bfi:			; CHECK-LABEL: lsr_bfi:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: lsr w8, w0, #20			; CHECK-NEXT: ubfx w8, w0, #20, #4
	; CHECK-NEXT: bfi w0, w8, #4, #4			; CHECK-NEXT: bfi w0, w8, #4, #4
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%and1 = and i32 %a, -241			%and1 = and i32 %a, -241
	%1 = lshr i32 %a, 16			%1 = lshr i32 %a, 16
	%shl = and i32 %1, 240			%shl = and i32 %1, 240
	%or = or i32 %shl, %and1			%or = or i32 %shl, %and1
	ret i32 %or			ret i32 %or
	}			}

	define i32 @negative_lsr_bfi0(i32 %a) {			define i32 @negative_lsr_bfi0(i32 %a) {
	; CHECK-LABEL: negative_lsr_bfi0:			; CHECK-LABEL: negative_lsr_bfi0:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: and w0, w0, #0xffffff0f			; CHECK-NEXT: and w0, w0, #0xffffff0f
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%and1 = and i32 %a, -241			%and1 = and i32 %a, -241
	%1 = lshr i32 %a, 28			%1 = lshr i32 %a, 28
	%shl = and i32 %1, 240			%shl = and i32 %1, 240
	%or = or i32 %shl, %and1			%or = or i32 %shl, %and1
	ret i32 %or			ret i32 %or
	}			}

	define i32 @negative_lsr_bfi1(i32 %a) {			define i32 @negative_lsr_bfi1(i32 %a) {
	; CHECK-LABEL: negative_lsr_bfi1:			; CHECK-LABEL: negative_lsr_bfi1:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: lsr w8, w0, #16			; CHECK-NEXT: ubfx w8, w0, #20, #4
	; CHECK-NEXT: lsr w9, w8, #4			; CHECK-NEXT: mov w9, w0
	; CHECK-NEXT: bfi w0, w9, #4, #4			; CHECK-NEXT: bfi w9, w8, #4, #4
	; CHECK-NEXT: add w0, w0, w8			; CHECK-NEXT: add w0, w9, w0, lsr #16
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%and1 = and i32 %a, -241			%and1 = and i32 %a, -241
	%1 = lshr i32 %a, 16			%1 = lshr i32 %a, 16
	%shl = and i32 %1, 240			%shl = and i32 %1, 240
	%or = or i32 %shl, %and1			%or = or i32 %shl, %and1
	%add = add i32 %or, %1			%add = add i32 %or, %1
	ret i32 %add			ret i32 %add
	}			}

	define i64 @lsr_bfix(i64 %a) {			define i64 @lsr_bfix(i64 %a) {
	; CHECK-LABEL: lsr_bfix:			; CHECK-LABEL: lsr_bfix:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: lsr x8, x0, #20			; CHECK-NEXT: ubfx x8, x0, #20, #4
	; CHECK-NEXT: bfi x0, x8, #4, #4			; CHECK-NEXT: bfi x0, x8, #4, #4
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%and1 = and i64 %a, -241			%and1 = and i64 %a, -241
	%1 = lshr i64 %a, 16			%1 = lshr i64 %a, 16
	%shl = and i64 %1, 240			%shl = and i64 %1, 240
	%or = or i64 %shl, %and1			%or = or i64 %shl, %and1
	ret i64 %or			ret i64 %or
	}			}

	define i64 @negative_lsr_bfix0(i64 %a) {			define i64 @negative_lsr_bfix0(i64 %a) {
	; CHECK-LABEL: negative_lsr_bfix0:			; CHECK-LABEL: negative_lsr_bfix0:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: and x0, x0, #0xffffffffffffff0f			; CHECK-NEXT: and x0, x0, #0xffffffffffffff0f
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%and1 = and i64 %a, -241			%and1 = and i64 %a, -241
	%1 = lshr i64 %a, 60			%1 = lshr i64 %a, 60
	%shl = and i64 %1, 240			%shl = and i64 %1, 240
	%or = or i64 %shl, %and1			%or = or i64 %shl, %and1
	ret i64 %or			ret i64 %or
	}			}

	define i64 @negative_lsr_bfix1(i64 %a) {			define i64 @negative_lsr_bfix1(i64 %a) {
	; CHECK-LABEL: negative_lsr_bfix1:			; CHECK-LABEL: negative_lsr_bfix1:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: lsr x8, x0, #16			; CHECK-NEXT: ubfx x8, x0, #20, #4
	; CHECK-NEXT: lsr x9, x8, #4			; CHECK-NEXT: mov x9, x0
	; CHECK-NEXT: bfi x0, x9, #4, #4			; CHECK-NEXT: bfi x9, x8, #4, #4
	; CHECK-NEXT: add x0, x0, x8			; CHECK-NEXT: add x0, x9, x0, lsr #16
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%and1 = and i64 %a, -241			%and1 = and i64 %a, -241
	%1 = lshr i64 %a, 16			%1 = lshr i64 %a, 16
	%shl = and i64 %1, 240			%shl = and i64 %1, 240
	%or = or i64 %shl, %and1			%or = or i64 %shl, %and1
	%add = add i64 %or, %1			%add = add i64 %or, %1
	ret i64 %add			ret i64 %add
	}			}

llvm/test/CodeGen/AArch64/logical_shifted_reg.ll

Show First 20 Lines • Show All 286 Lines • ▼ Show 20 Lines	other_exit:
ret void		ret void
ret:		ret:
ret void		ret void
}		}

define i64 @i64_or_lhs_bitfield_positioning(i64 %tmp1, i64 %tmp2) {		define i64 @i64_or_lhs_bitfield_positioning(i64 %tmp1, i64 %tmp2) {
; CHECK-LABEL: i64_or_lhs_bitfield_positioning:		; CHECK-LABEL: i64_or_lhs_bitfield_positioning:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: lsl w8, w1, #7		; CHECK-NEXT: and x8, x1, #0x7f
; CHECK-NEXT: and x8, x8, #0x3f80		; CHECK-NEXT: orr x0, x0, x8, lsl #7
; CHECK-NEXT: orr x0, x8, x0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%and = shl i64 %tmp2, 7		%and = shl i64 %tmp2, 7
%shl = and i64 %and, 16256 ; 0x3f80		%shl = and i64 %and, 16256 ; 0x3f80
%or = or i64 %shl, %tmp1		%or = or i64 %shl, %tmp1
ret i64 %or		ret i64 %or
}		}

define i64 @i64_or_rhs_bitfield_positioning(i64 %tmp1, i64 %tmp2) {		define i64 @i64_or_rhs_bitfield_positioning(i64 %tmp1, i64 %tmp2) {
; CHECK-LABEL: i64_or_rhs_bitfield_positioning:		; CHECK-LABEL: i64_or_rhs_bitfield_positioning:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: lsl w8, w1, #7		; CHECK-NEXT: and x8, x1, #0x7f
; CHECK-NEXT: and x8, x8, #0x3f80		; CHECK-NEXT: orr x0, x0, x8, lsl #7
; CHECK-NEXT: orr x0, x0, x8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%and = shl i64 %tmp2, 7		%and = shl i64 %tmp2, 7
%shl = and i64 %and, 16256 ; 0x3f80		%shl = and i64 %and, 16256 ; 0x3f80
%or = or i64 %tmp1, %shl		%or = or i64 %tmp1, %shl
ret i64 %or		ret i64 %or
}		}

define i32 @i32_or_lhs_bitfield_positioning(i32 %tmp1, i32 %tmp2) {		define i32 @i32_or_lhs_bitfield_positioning(i32 %tmp1, i32 %tmp2) {
; CHECK-LABEL: i32_or_lhs_bitfield_positioning:		; CHECK-LABEL: i32_or_lhs_bitfield_positioning:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ubfiz w8, w1, #7, #7		; CHECK-NEXT: and w8, w1, #0x7f
; CHECK-NEXT: orr w0, w8, w0		; CHECK-NEXT: orr w0, w0, w8, lsl #7
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%and = shl i32 %tmp2, 7		%and = shl i32 %tmp2, 7
%shl = and i32 %and, 16256 ; 0x3f80		%shl = and i32 %and, 16256 ; 0x3f80
%or = or i32 %shl, %tmp1		%or = or i32 %shl, %tmp1
ret i32 %or		ret i32 %or
}		}

define i32 @i32_or_rhs_bitfield_positioning(i32 %tmp1, i32 %tmp2) {		define i32 @i32_or_rhs_bitfield_positioning(i32 %tmp1, i32 %tmp2) {
; CHECK-LABEL: i32_or_rhs_bitfield_positioning:		; CHECK-LABEL: i32_or_rhs_bitfield_positioning:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ubfiz w8, w1, #7, #7		; CHECK-NEXT: and w8, w1, #0x7f
; CHECK-NEXT: orr w0, w0, w8		; CHECK-NEXT: orr w0, w0, w8, lsl #7
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%and = shl i32 %tmp2, 7		%and = shl i32 %tmp2, 7
%shl = and i32 %and, 16256 ; 0x3f80		%shl = and i32 %and, 16256 ; 0x3f80
%or = or i32 %tmp1, %shl		%or = or i32 %tmp1, %shl
ret i32 %or		ret i32 %or
}		}

!1 = !{!"branch_weights", i32 1, i32 1}		!1 = !{!"branch_weights", i32 1, i32 1}

llvm/test/CodeGen/AArch64/unfold-masked-merge-scalar-constmask-innerouter.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s \| FileCheck %s

	; https://bugs.llvm.org/show_bug.cgi?id=37104			; https://bugs.llvm.org/show_bug.cgi?id=37104

	; X: [byte3] [byte0]			; X: [byte3] [byte0]
	; Y: [byte2][byte1]			; Y: [byte2][byte1]

	define i8 @out8_constmask(i8 %x, i8 %y) {			define i8 @out8_constmask(i8 %x, i8 %y) {
	; CHECK-LABEL: out8_constmask:			; CHECK-LABEL: out8_constmask:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: lsr w8, w0, #2			; CHECK-NEXT: ubfx w8, w0, #2, #4
	; CHECK-NEXT: mov w0, w1			; CHECK-NEXT: mov w0, w1
	; CHECK-NEXT: bfi w0, w8, #2, #4			; CHECK-NEXT: bfi w0, w8, #2, #4
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%mx = and i8 %x, 60			%mx = and i8 %x, 60
	%my = and i8 %y, -61			%my = and i8 %y, -61
	%r = or i8 %mx, %my			%r = or i8 %mx, %my
	ret i8 %r			ret i8 %r
	}			}

	define i16 @out16_constmask(i16 %x, i16 %y) {			define i16 @out16_constmask(i16 %x, i16 %y) {
	; CHECK-LABEL: out16_constmask:			; CHECK-LABEL: out16_constmask:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: lsr w8, w0, #4			; CHECK-NEXT: ubfx w8, w0, #4, #8
	; CHECK-NEXT: mov w0, w1			; CHECK-NEXT: mov w0, w1
	; CHECK-NEXT: bfi w0, w8, #4, #8			; CHECK-NEXT: bfi w0, w8, #4, #8
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%mx = and i16 %x, 4080			%mx = and i16 %x, 4080
	%my = and i16 %y, -4081			%my = and i16 %y, -4081
	%r = or i16 %mx, %my			%r = or i16 %mx, %my
	ret i16 %r			ret i16 %r
	}			}

	define i32 @out32_constmask(i32 %x, i32 %y) {			define i32 @out32_constmask(i32 %x, i32 %y) {
	; CHECK-LABEL: out32_constmask:			; CHECK-LABEL: out32_constmask:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: lsr w8, w0, #8			; CHECK-NEXT: ubfx w8, w0, #8, #16
	; CHECK-NEXT: mov w0, w1			; CHECK-NEXT: mov w0, w1
	; CHECK-NEXT: bfi w0, w8, #8, #16			; CHECK-NEXT: bfi w0, w8, #8, #16
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%mx = and i32 %x, 16776960			%mx = and i32 %x, 16776960
	%my = and i32 %y, -16776961			%my = and i32 %y, -16776961
	%r = or i32 %mx, %my			%r = or i32 %mx, %my
	ret i32 %r			ret i32 %r
	}			}

	define i64 @out64_constmask(i64 %x, i64 %y) {			define i64 @out64_constmask(i64 %x, i64 %y) {
	; CHECK-LABEL: out64_constmask:			; CHECK-LABEL: out64_constmask:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: lsr x8, x0, #16			; CHECK-NEXT: ubfx x8, x0, #16, #32
	; CHECK-NEXT: mov x0, x1			; CHECK-NEXT: mov x0, x1
	; CHECK-NEXT: bfi x0, x8, #16, #32			; CHECK-NEXT: bfi x0, x8, #16, #32
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%mx = and i64 %x, 281474976645120			%mx = and i64 %x, 281474976645120
	%my = and i64 %y, -281474976645121			%my = and i64 %y, -281474976645121
	%r = or i64 %mx, %my			%r = or i64 %mx, %my
	ret i64 %r			ret i64 %r
	}			}
	▲ Show 20 Lines • Show All 181 Lines • ▼ Show 20 Lines
	}			}

	; Various bad variants			; Various bad variants

	define i32 @n0_badconstmask(i32 %x, i32 %y) {			define i32 @n0_badconstmask(i32 %x, i32 %y) {
	; CHECK-LABEL: n0_badconstmask:			; CHECK-LABEL: n0_badconstmask:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: and w8, w1, #0xffffff00			; CHECK-NEXT: and w8, w1, #0xffffff00
	; CHECK-NEXT: and w9, w0, #0xffff00			; CHECK-NEXT: ubfx w9, w0, #8, #16
	; CHECK-NEXT: and w8, w8, #0xff0001ff			; CHECK-NEXT: and w8, w8, #0xff0001ff
	; CHECK-NEXT: orr w0, w9, w8			; CHECK-NEXT: orr w0, w8, w9, lsl #8
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%mx = and i32 %x, 16776960			%mx = and i32 %x, 16776960
	%my = and i32 %y, -16776960 ; instead of -16776961			%my = and i32 %y, -16776960 ; instead of -16776961
	%r = or i32 %mx, %my			%r = or i32 %mx, %my
	ret i32 %r			ret i32 %r
	}			}

	define i32 @n1_thirdvar_constmask(i32 %x, i32 %y, i32 %z) {			define i32 @n1_thirdvar_constmask(i32 %x, i32 %y, i32 %z) {
	Show All 11 Lines