This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64FastISel.cpp
1/1
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
fast-isel-sdiv.ll
-
sdivpow2.ll
-
srem-lkk.ll
-
srem-seteq.ll
-
srem-vector-lkk.ll

Differential D122829

[AArch64] Optimize SDIV with pow2 constant divisor
AbandonedPublic

Authored by bcl5980 on Mar 31 2022, 8:18 AM.

Download Raw Diff

Details

Reviewers

david-arm
paulwalker-arm
efriedma
craig.topper

Summary

SRA/SRL instead CMP/CSEL in SDIV with pow2 const divisor's folding.

Fix https://github.com/llvm/llvm-project/issues/54649

Diff Detail

Unit TestsFailed

	Time	Test
	60,060 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vloxseg_mask_mf.c
	60,060 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vlseg_mask.c
	60,030 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vlsegff_mask.c
	60,060 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vluxseg_mask_mf.c
	60,380 ms	x64 debian > Clang.Driver::fsanitize.c
		View Full Test Results (11 Failed)

Event Timeline

bcl5980 created this revision.Mar 31 2022, 8:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 8:18 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

bcl5980 requested review of this revision.Mar 31 2022, 8:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 8:18 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

add fast isel

Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.

In D122829#3419777, @craig.topper wrote:

Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.

remove BuildSDIVPow2 also works for scalar cases. But test failed on sve cases.
I can also shift 31/63 everytime but I think currect code is also correct.

In D122829#3419836, @bcl5980 wrote:

In D122829#3419777, @craig.topper wrote:

Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.

remove BuildSDIVPow2 also works for scalar cases. But test failed on sve cases.
I can also shift 31/63 everytime but I think currect code is also correct.

Is that because of the SVE special case on line 13536? What if you left that code and returned SDValue() after it?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13536	Should this be checking that the VT is a fixed vector before calling Subtarget->useSVEForFixedLengthVectors()? Without that it looks like even scalar are blocked when Subtarget->useSVEForFixedLengthVectors() is true.

update per @craig.topper's suggestion. Use general BuildSDIVPow2

In D122829#3419867, @craig.topper wrote:

In D122829#3419836, @bcl5980 wrote:

In D122829#3419777, @craig.topper wrote:

Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.

remove BuildSDIVPow2 also works for scalar cases. But test failed on sve cases.
I can also shift 31/63 everytime but I think currect code is also correct.

Is that because of the SVE special case on line 13536? What if you left that code and returned SDValue() after it?

Yeah, you are right. Return after 13536 is correct.

update correct version

bcl5980 added a reviewer: craig.topper.Mar 31 2022, 10:14 AM

Herald added a subscriber: StephenFan. · View Herald TranscriptMar 31 2022, 10:14 AM

If I'm following correctly, this is essentially reverting D4438 . It was implemented that way at the time to avoid the complex add w8, w0, w8, lsr #26 operation. Reverting that isn't obviously an improvement; I don't think ARM microarchitecture has significantly changed in this respect since the last time it was measured.

In D122829#3420108, @efriedma wrote:

If I'm following correctly, this is essentially reverting D4438 . It was implemented that way at the time to avoid the complex add w8, w0, w8, lsr #26 operation. Reverting that isn't obviously an improvement; I don't think ARM microarchitecture has significantly changed in this respect since the last time it was measured.

Yeah I was wondering what made this better. It's smaller at least, which is good. It should probably be used at minsize if it isn't already. But this kind of shows that it's not necessarily better, if the add and cmp can be executed in parallel, and the add+shift is 2 cycles: https://godbolt.org/z/3zvhxPbsP

Harbormaster completed remote builds in B157222: Diff 419506.Mar 31 2022, 3:27 PM

In D122829#3420547, @dmgreen wrote:

In D122829#3420108, @efriedma wrote:

If I'm following correctly, this is essentially reverting D4438 . It was implemented that way at the time to avoid the complex add w8, w0, w8, lsr #26 operation. Reverting that isn't obviously an improvement; I don't think ARM microarchitecture has significantly changed in this respect since the last time it was measured.

Yeah I was wondering what made this better. It's smaller at least, which is good. It should probably be used at minsize if it isn't already. But this kind of shows that it's not necessarily better, if the add and cmp can be executed in parallel, and the add+shift is 2 cycles: https://godbolt.org/z/3zvhxPbsP

I think AArch64's current BuildSDIVPow2 has two AArch64ISD which means less optinuity to combine.

Add with shift can be implemented by 3 ways:

Split to two micro op, one add one shift, it is always slower than independent add+cmp.
One pipeline stage to do the shift, generally it can get better IPC than independent add+cmp but the worst case will be slower than independent add+cmp.
Direct three input combinational logic circuit, it is always better than independent add+cmp.

Can someone tell me which way the mainstream AArch64 processor is?
I know some of Arm processors use case 2

And if we confirmed add with shift is really slower, how about the gcc's implementation for mod (2^N) ?

negs    w1, w0
and     w0, w0, 15
and     w1, w1, 15
csneg   w0, w0, w1, mi
ret

current clang

add     w8, w0, #15
cmp     w0, #0
csel    w8, w8, w0, lt
and     w8, w8, #0xfffffff0
sub     w0, w0, w8
ret

Is it also possible that in some contexts we may want to avoid setting flags where possible? i.e. for loops with control flow? The architecture only has one flags register, but has many GPRs.

And one other point is:
Case @dont_fold_srem_i16_smax save 6 instructions with 3 extra add+shift
Case @dont_fold_srem_power_of_two, it save a 9 instructions with 1 extra add+shift
So maybe we can use the general path for vector case at least?

In D122829#3422022, @bcl5980 wrote:

And one other point is:
Case @dont_fold_srem_i16_smax save 6 instructions with 3 extra add+shift
Case @dont_fold_srem_power_of_two, it save a 9 instructions with 1 extra add+shift
So maybe we can use the general path for vector case at least?

Where is the savings actually coming from here? I don't think it's related to it being a vector; we're just unrolling it into scalar ops.

In D122829#3421545, @david-arm wrote:

Is it also possible that in some contexts we may want to avoid setting flags where possible? i.e. for loops with control flow? The architecture only has one flags register, but has many GPRs.

Doesn't seem likely to me? From what I've seen, flags don't normally have a long live range in practice.

In D122829#3421002, @bcl5980 wrote:

I think AArch64's current BuildSDIVPow2 has two AArch64ISD which means less optinuity to combine.

True; are there specific opportunities that matter?

Add with shift can be implemented by 3 ways:

Split to two micro op, one add one shift, it is always slower than independent add+cmp.

One pipeline stage to do the shift, generally it can get better IPC than independent add+cmp but the worst case will be slower than independent add+cmp.

Direct three input combinational logic circuit, it is always better than independent add+cmp.

Can someone tell me which way the mainstream AArch64 processor is?
I know some of Arm processors use case 2

You can download the software optimization guides for most Cortex cores from ARM. Generally, add-with-shift is one micro-op, but it has two cycles of latency, where basic arithmetic has one. So the current sequence saves a cycle of latency in most situations.

And if we confirmed add with shift is really slower, how about the gcc's implementation for mod (2^N) ?

That looks like an improvement, sure.

In D122829#3422761, @efriedma wrote:

In D122829#3422022, @bcl5980 wrote:

And one other point is:
Case @dont_fold_srem_i16_smax save 6 instructions with 3 extra add+shift
Case @dont_fold_srem_power_of_two, it save a 9 instructions with 1 extra add+shift
So maybe we can use the general path for vector case at least?

Where is the savings actually coming from here? I don't think it's related to it being a vector; we're just unrolling it into scalar ops.

It looks two cases use some extra fmov with v1 register origin version. And they share some SRA with srem with non-power2 case.

True; are there specific opportunities that matter?

It's hard to say. There are a lot of combine code in DAGCombiner::visitADD and DAGCombiner::visitSRA. And also SRA can be shared in the vector case.

That looks like an improvement, sure.

I will use this way to fix the issue later.

bcl5980 abandoned this revision.Apr 1 2022, 11:45 PM

efriedma mentioned this in D122968: [AArch64][SelectionDAG] Add target-specific implementation of srem.Apr 12 2022, 11:41 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64FastISel.cpp

33 lines

AArch64ISelLowering.cpp

33 lines

test/

CodeGen/

AArch64/

20 lines

59 lines

11 lines

11 lines

81 lines

Diff 419506

llvm/lib/Target/AArch64/AArch64FastISel.cpp

Show First 20 Lines • Show All 4,824 Lines • ▼ Show 20 Lines	bool AArch64FastISel::selectSDiv(const Instruction *I) {
if (cast<BinaryOperator>(I)->isExact()) {		if (cast<BinaryOperator>(I)->isExact()) {
unsigned ResultReg = emitASR_ri(VT, VT, Src0Reg, Lg2);		unsigned ResultReg = emitASR_ri(VT, VT, Src0Reg, Lg2);
if (!ResultReg)		if (!ResultReg)
return false;		return false;
updateValueMap(I, ResultReg);		updateValueMap(I, ResultReg);
return true;		return true;
}		}

int64_t Pow2MinusOne = (1ULL << Lg2) - 1;		unsigned BitWidth = VT.getScalarSizeInBits();
unsigned AddReg = emitAdd_ri_(VT, Src0Reg, Pow2MinusOne);		unsigned SignReg = Src0Reg;
if (!AddReg)		if (Lg2 > 1) {
return false;

// (Src0 < 0) ? Pow2 - 1 : 0;		SignReg = emitASR_ri(VT, VT, Src0Reg, BitWidth - 1);
if (!emitICmp_ri(VT, Src0Reg, 0))		if (!SignReg)
return false;		return false;

unsigned SelectOpc;
const TargetRegisterClass *RC;
if (VT == MVT::i64) {
SelectOpc = AArch64::CSELXr;
RC = &AArch64::GPR64RegClass;
} else {
SelectOpc = AArch64::CSELWr;
RC = &AArch64::GPR32RegClass;
}		}
Register SelectReg = fastEmitInst_rri(SelectOpc, RC, AddReg, Src0Reg,
AArch64CC::LT);		unsigned AddReg = emitAddSub_rs(/UseAdd=/true, VT, Src0Reg, SignReg,
if (!SelectReg)		AArch64_AM::LSR, BitWidth - Lg2);
		if (!AddReg)
return false;		return false;

// Divide by Pow2 --> ashr. If we're dividing by a negative value we must also		// Divide by Pow2 --> ashr. If we're dividing by a negative value we must also
// negate the result.		// negate the result.
unsigned ZeroReg = (VT == MVT::i64) ? AArch64::XZR : AArch64::WZR;		unsigned ZeroReg = (VT == MVT::i64) ? AArch64::XZR : AArch64::WZR;
unsigned ResultReg;		unsigned ResultReg;
if (C.isNegative())		if (C.isNegative())
ResultReg = emitAddSub_rs(/UseAdd=/false, VT, ZeroReg, SelectReg,		ResultReg = emitAddSub_rs(/UseAdd=/false, VT, ZeroReg, AddReg,
AArch64_AM::ASR, Lg2);		AArch64_AM::ASR, Lg2);
else		else
ResultReg = emitASR_ri(VT, VT, SelectReg, Lg2);		ResultReg = emitASR_ri(VT, VT, AddReg, Lg2);

if (!ResultReg)		if (!ResultReg)
return false;		return false;

updateValueMap(I, ResultReg);		updateValueMap(I, ResultReg);
return true;		return true;
}		}

▲ Show 20 Lines • Show All 242 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,527 Lines • ▼ Show 20 Lines	AArch64TargetLowering::BuildSDIVPow2(SDNode *N, const APInt &Divisor,
AttributeList Attr = DAG.getMachineFunction().getFunction().getAttributes();		AttributeList Attr = DAG.getMachineFunction().getFunction().getAttributes();
if (isIntDivCheap(N->getValueType(0), Attr))		if (isIntDivCheap(N->getValueType(0), Attr))
return SDValue(N,0); // Lower SDIV as SDIV		return SDValue(N,0); // Lower SDIV as SDIV

EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// For scalable and fixed types, mark them as cheap so we can handle it much		// For scalable and fixed types, mark them as cheap so we can handle it much
// later. This allows us to handle larger than legal types.		// later. This allows us to handle larger than legal types.
if (VT.isScalableVector() \|\| Subtarget->useSVEForFixedLengthVectors())		if (VT.isScalableVector() \|\| Subtarget->useSVEForFixedLengthVectors())
		craig.topperUnsubmitted Done Reply Inline Actions Should this be checking that the VT is a fixed vector before calling Subtarget->useSVEForFixedLengthVectors()? Without that it looks like even scalar are blocked when Subtarget->useSVEForFixedLengthVectors() is true. craig.topper: Should this be checking that the VT is a fixed vector before calling Subtarget…
return SDValue(N, 0);		return SDValue(N, 0);

// fold (sdiv X, pow2)
if ((VT != MVT::i32 && VT != MVT::i64) \|\|
!(Divisor.isPowerOf2() \|\| Divisor.isNegatedPowerOf2()))
return SDValue();		return SDValue();

SDLoc DL(N);
SDValue N0 = N->getOperand(0);
unsigned Lg2 = Divisor.countTrailingZeros();
SDValue Zero = DAG.getConstant(0, DL, VT);
SDValue Pow2MinusOne = DAG.getConstant((1ULL << Lg2) - 1, DL, VT);

// Add (N0 < 0) ? Pow2 - 1 : 0;
SDValue CCVal;
SDValue Cmp = getAArch64Cmp(N0, Zero, ISD::SETLT, CCVal, DAG, DL);
SDValue Add = DAG.getNode(ISD::ADD, DL, VT, N0, Pow2MinusOne);
SDValue CSel = DAG.getNode(AArch64ISD::CSEL, DL, VT, Add, N0, CCVal, Cmp);

Created.push_back(Cmp.getNode());
Created.push_back(Add.getNode());
Created.push_back(CSel.getNode());

// Divide by pow2.
SDValue SRA =
DAG.getNode(ISD::SRA, DL, VT, CSel, DAG.getConstant(Lg2, DL, MVT::i64));

// If we're dividing by a positive value, we're done. Otherwise, we must
// negate the result.
if (Divisor.isNonNegative())
return SRA;

Created.push_back(SRA.getNode());
return DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), SRA);
}		}

static bool IsSVECntIntrinsic(SDValue S) {		static bool IsSVECntIntrinsic(SDValue S) {
switch(getIntrinsicID(S.getNode())) {		switch(getIntrinsicID(S.getNode())) {
default:		default:
break;		break;
case Intrinsic::aarch64_sve_cntb:		case Intrinsic::aarch64_sve_cntb:
case Intrinsic::aarch64_sve_cnth:		case Intrinsic::aarch64_sve_cnth:
▲ Show 20 Lines • Show All 7,289 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/fast-isel-sdiv.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-linux-gnu -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK			; RUN: llc -mtriple=aarch64-linux-gnu -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK
	; RUN: llc -mtriple=aarch64-linux-gnu -fast-isel -fast-isel-abort=1 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK			; RUN: llc -mtriple=aarch64-linux-gnu -fast-isel -fast-isel-abort=1 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK

	define i32 @sdiv_i32_exact(i32 %a) {			define i32 @sdiv_i32_exact(i32 %a) {
	; CHECK-LABEL: sdiv_i32_exact:			; CHECK-LABEL: sdiv_i32_exact:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: asr w0, w0, #3			; CHECK-NEXT: asr w0, w0, #3
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = sdiv exact i32 %a, 8			%1 = sdiv exact i32 %a, 8
	ret i32 %1			ret i32 %1
	}			}

	define i32 @sdiv_i32_pos(i32 %a) {			define i32 @sdiv_i32_pos(i32 %a) {
	; CHECK-LABEL: sdiv_i32_pos:			; CHECK-LABEL: sdiv_i32_pos:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add w8, w0, #7			; CHECK-NEXT: asr w8, w0, #31
	; CHECK-NEXT: cmp w0, #0			; CHECK-NEXT: add w8, w0, w8, lsr #29
	; CHECK-NEXT: csel w8, w8, w0, lt
	; CHECK-NEXT: asr w0, w8, #3			; CHECK-NEXT: asr w0, w8, #3
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = sdiv i32 %a, 8			%1 = sdiv i32 %a, 8
	ret i32 %1			ret i32 %1
	}			}

	define i32 @sdiv_i32_neg(i32 %a) {			define i32 @sdiv_i32_neg(i32 %a) {
	; CHECK-LABEL: sdiv_i32_neg:			; CHECK-LABEL: sdiv_i32_neg:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add w8, w0, #7			; CHECK-NEXT: asr w8, w0, #31
	; CHECK-NEXT: cmp w0, #0			; CHECK-NEXT: add w8, w0, w8, lsr #29
	; CHECK-NEXT: csel w8, w8, w0, lt
	; CHECK-NEXT: neg w0, w8, asr #3			; CHECK-NEXT: neg w0, w8, asr #3
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = sdiv i32 %a, -8			%1 = sdiv i32 %a, -8
	ret i32 %1			ret i32 %1
	}			}

	define i64 @sdiv_i64_exact(i64 %a) {			define i64 @sdiv_i64_exact(i64 %a) {
	; CHECK-LABEL: sdiv_i64_exact:			; CHECK-LABEL: sdiv_i64_exact:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: asr x0, x0, #4			; CHECK-NEXT: asr x0, x0, #4
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = sdiv exact i64 %a, 16			%1 = sdiv exact i64 %a, 16
	ret i64 %1			ret i64 %1
	}			}

	define i64 @sdiv_i64_pos(i64 %a) {			define i64 @sdiv_i64_pos(i64 %a) {
	; CHECK-LABEL: sdiv_i64_pos:			; CHECK-LABEL: sdiv_i64_pos:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add x8, x0, #15			; CHECK-NEXT: asr x8, x0, #63
	; CHECK-NEXT: cmp x0, #0			; CHECK-NEXT: add x8, x0, x8, lsr #60
	; CHECK-NEXT: csel x8, x8, x0, lt
	; CHECK-NEXT: asr x0, x8, #4			; CHECK-NEXT: asr x0, x8, #4
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = sdiv i64 %a, 16			%1 = sdiv i64 %a, 16
	ret i64 %1			ret i64 %1
	}			}

	define i64 @sdiv_i64_neg(i64 %a) {			define i64 @sdiv_i64_neg(i64 %a) {
	; CHECK-LABEL: sdiv_i64_neg:			; CHECK-LABEL: sdiv_i64_neg:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add x8, x0, #15			; CHECK-NEXT: asr x8, x0, #63
	; CHECK-NEXT: cmp x0, #0			; CHECK-NEXT: add x8, x0, x8, lsr #60
	; CHECK-NEXT: csel x8, x8, x0, lt
	; CHECK-NEXT: neg x0, x8, asr #4			; CHECK-NEXT: neg x0, x8, asr #4
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%1 = sdiv i64 %a, -16			%1 = sdiv i64 %a, -16
	ret i64 %1			ret i64 %1
	}			}

llvm/test/CodeGen/AArch64/sdivpow2.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-linux-gnu -fast-isel=0 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=CHECK,ISEL			; RUN: llc -mtriple=aarch64-linux-gnu -fast-isel=0 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK
	; RUN: llc -mtriple=aarch64-linux-gnu -fast-isel=1 -verify-machineinstrs < %s \| FileCheck %s --check-prefixes=CHECK,FAST			; RUN: llc -mtriple=aarch64-linux-gnu -fast-isel=1 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=CHECK

	define i32 @test1(i32 %x) {			define i32 @test1(i32 %x) {
	; CHECK-LABEL: test1:			; CHECK-LABEL: test1:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add w8, w0, #7			; CHECK-NEXT: asr w8, w0, #31
	; CHECK-NEXT: cmp w0, #0			; CHECK-NEXT: add w8, w0, w8, lsr #29
	; CHECK-NEXT: csel w8, w8, w0, lt
	; CHECK-NEXT: asr w0, w8, #3			; CHECK-NEXT: asr w0, w8, #3
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%div = sdiv i32 %x, 8			%div = sdiv i32 %x, 8
	ret i32 %div			ret i32 %div
	}			}

	define i32 @test2(i32 %x) {			define i32 @test2(i32 %x) {
	; CHECK-LABEL: test2:			; CHECK-LABEL: test2:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add w8, w0, #7			; CHECK-NEXT: asr w8, w0, #31
	; CHECK-NEXT: cmp w0, #0			; CHECK-NEXT: add w8, w0, w8, lsr #29
	; CHECK-NEXT: csel w8, w8, w0, lt
	; CHECK-NEXT: neg w0, w8, asr #3			; CHECK-NEXT: neg w0, w8, asr #3
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%div = sdiv i32 %x, -8			%div = sdiv i32 %x, -8
	ret i32 %div			ret i32 %div
	}			}

	define i32 @test3(i32 %x) {			define i32 @test3(i32 %x) {
	; CHECK-LABEL: test3:			; CHECK-LABEL: test3:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add w8, w0, #31			; CHECK-NEXT: asr w8, w0, #31
	; CHECK-NEXT: cmp w0, #0			; CHECK-NEXT: add w8, w0, w8, lsr #27
	; CHECK-NEXT: csel w8, w8, w0, lt
	; CHECK-NEXT: asr w0, w8, #5			; CHECK-NEXT: asr w0, w8, #5
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%div = sdiv i32 %x, 32			%div = sdiv i32 %x, 32
	ret i32 %div			ret i32 %div
	}			}

	define i64 @test4(i64 %x) {			define i64 @test4(i64 %x) {
	; CHECK-LABEL: test4:			; CHECK-LABEL: test4:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add x8, x0, #7			; CHECK-NEXT: asr x8, x0, #63
	; CHECK-NEXT: cmp x0, #0			; CHECK-NEXT: add x8, x0, x8, lsr #61
	; CHECK-NEXT: csel x8, x8, x0, lt
	; CHECK-NEXT: asr x0, x8, #3			; CHECK-NEXT: asr x0, x8, #3
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%div = sdiv i64 %x, 8			%div = sdiv i64 %x, 8
	ret i64 %div			ret i64 %div
	}			}

	define i64 @test5(i64 %x) {			define i64 @test5(i64 %x) {
	; CHECK-LABEL: test5:			; CHECK-LABEL: test5:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add x8, x0, #7			; CHECK-NEXT: asr x8, x0, #63
	; CHECK-NEXT: cmp x0, #0			; CHECK-NEXT: add x8, x0, x8, lsr #61
	; CHECK-NEXT: csel x8, x8, x0, lt
	; CHECK-NEXT: neg x0, x8, asr #3			; CHECK-NEXT: neg x0, x8, asr #3
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%div = sdiv i64 %x, -8			%div = sdiv i64 %x, -8
	ret i64 %div			ret i64 %div
	}			}

	define i64 @test6(i64 %x) {			define i64 @test6(i64 %x) {
	; CHECK-LABEL: test6:			; CHECK-LABEL: test6:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: add x8, x0, #63			; CHECK-NEXT: asr x8, x0, #63
	; CHECK-NEXT: cmp x0, #0			; CHECK-NEXT: add x8, x0, x8, lsr #58
	; CHECK-NEXT: csel x8, x8, x0, lt
	; CHECK-NEXT: asr x0, x8, #6			; CHECK-NEXT: asr x0, x8, #6
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%div = sdiv i64 %x, 64			%div = sdiv i64 %x, 64
	ret i64 %div			ret i64 %div
	}			}

	define i64 @test7(i64 %x) {			define i64 @test7(i64 %x) {
	; CHECK-LABEL: test7:			; CHECK-LABEL: test7:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov x8, #281474976710655			; CHECK-NEXT: asr x8, x0, #63
	; CHECK-NEXT: cmp x0, #0			; CHECK-NEXT: add x8, x0, x8, lsr #16
	; CHECK-NEXT: add x8, x0, x8
	; CHECK-NEXT: csel x8, x8, x0, lt
	; CHECK-NEXT: asr x0, x8, #48			; CHECK-NEXT: asr x0, x8, #48
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%div = sdiv i64 %x, 281474976710656			%div = sdiv i64 %x, 281474976710656
	ret i64 %div			ret i64 %div
	}			}

	define i64 @test8(i64 %x) {			define i64 @test8(i64 %x) {
	; ISEL-LABEL: test8:			; CHECK-LABEL: test8:
	; ISEL: // %bb.0:			; CHECK: // %bb.0:
	; ISEL-NEXT: cmp x0, #0			; CHECK-NEXT: add x8, x0, x0, lsr #63
	; ISEL-NEXT: cinc x8, x0, lt			; CHECK-NEXT: asr x0, x8, #1
	; ISEL-NEXT: asr x0, x8, #1			; CHECK-NEXT: ret
	; ISEL-NEXT: ret
	;
	; FAST-LABEL: test8:
	; FAST: // %bb.0:
	; FAST-NEXT: add x8, x0, #1
	; FAST-NEXT: cmp x0, #0
	; FAST-NEXT: csel x8, x8, x0, lt
	; FAST-NEXT: asr x0, x8, #1
	; FAST-NEXT: ret
	%div = sdiv i64 %x, 2			%div = sdiv i64 %x, 2
	ret i64 %div			ret i64 %div
	}			}

llvm/test/CodeGen/AArch64/srem-lkk.ll

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%3 = add i32 %1, %2		%3 = add i32 %1, %2
ret i32 %3		ret i32 %3
}		}

; Don't fold for divisors that are a power of two.		; Don't fold for divisors that are a power of two.
define i32 @dont_fold_srem_power_of_two(i32 %x) {		define i32 @dont_fold_srem_power_of_two(i32 %x) {
; CHECK-LABEL: dont_fold_srem_power_of_two:		; CHECK-LABEL: dont_fold_srem_power_of_two:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: add w8, w0, #63		; CHECK-NEXT: asr w8, w0, #31
; CHECK-NEXT: cmp w0, #0		; CHECK-NEXT: add w8, w0, w8, lsr #26
; CHECK-NEXT: csel w8, w8, w0, lt
; CHECK-NEXT: and w8, w8, #0xffffffc0		; CHECK-NEXT: and w8, w8, #0xffffffc0
; CHECK-NEXT: sub w0, w0, w8		; CHECK-NEXT: sub w0, w0, w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = srem i32 %x, 64		%1 = srem i32 %x, 64
ret i32 %1		ret i32 %1
}		}

; Don't fold if the divisor is one.		; Don't fold if the divisor is one.
define i32 @dont_fold_srem_one(i32 %x) {		define i32 @dont_fold_srem_one(i32 %x) {
; CHECK-LABEL: dont_fold_srem_one:		; CHECK-LABEL: dont_fold_srem_one:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w0, wzr		; CHECK-NEXT: mov w0, wzr
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = srem i32 %x, 1		%1 = srem i32 %x, 1
ret i32 %1		ret i32 %1
}		}

; Don't fold if the divisor is 2^31.		; Don't fold if the divisor is 2^31.
define i32 @dont_fold_srem_i32_smax(i32 %x) {		define i32 @dont_fold_srem_i32_smax(i32 %x) {
; CHECK-LABEL: dont_fold_srem_i32_smax:		; CHECK-LABEL: dont_fold_srem_i32_smax:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #2147483647		; CHECK-NEXT: asr w8, w0, #31
; CHECK-NEXT: cmp w0, #0		; CHECK-NEXT: add w8, w0, w8, lsr #1
; CHECK-NEXT: add w8, w0, w8
; CHECK-NEXT: csel w8, w8, w0, lt
; CHECK-NEXT: and w8, w8, #0x80000000		; CHECK-NEXT: and w8, w8, #0x80000000
; CHECK-NEXT: add w0, w0, w8		; CHECK-NEXT: add w0, w0, w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = srem i32 %x, 2147483648		%1 = srem i32 %x, 2147483648
ret i32 %1		ret i32 %1
}		}

; Don't fold i64 srem		; Don't fold i64 srem
Show All 16 Lines

llvm/test/CodeGen/AArch64/srem-seteq.ll

Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%ret = zext i1 %cmp to i32		%ret = zext i1 %cmp to i32
ret i32 %ret		ret i32 %ret
}		}

; We can lower remainder of division by powers of two much better elsewhere.		; We can lower remainder of division by powers of two much better elsewhere.
define i32 @test_srem_pow2(i32 %X) nounwind {		define i32 @test_srem_pow2(i32 %X) nounwind {
; CHECK-LABEL: test_srem_pow2:		; CHECK-LABEL: test_srem_pow2:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: add w8, w0, #15		; CHECK-NEXT: asr w8, w0, #31
; CHECK-NEXT: cmp w0, #0		; CHECK-NEXT: add w8, w0, w8, lsr #28
; CHECK-NEXT: csel w8, w8, w0, lt
; CHECK-NEXT: and w8, w8, #0xfffffff0		; CHECK-NEXT: and w8, w8, #0xfffffff0
; CHECK-NEXT: cmp w0, w8		; CHECK-NEXT: cmp w0, w8
; CHECK-NEXT: cset w0, eq		; CHECK-NEXT: cset w0, eq
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%srem = srem i32 %X, 16		%srem = srem i32 %X, 16
%cmp = icmp eq i32 %srem, 0		%cmp = icmp eq i32 %srem, 0
%ret = zext i1 %cmp to i32		%ret = zext i1 %cmp to i32
ret i32 %ret		ret i32 %ret
}		}

; The fold is only valid for positive divisors, and we can't negate INT_MIN.		; The fold is only valid for positive divisors, and we can't negate INT_MIN.
define i32 @test_srem_int_min(i32 %X) nounwind {		define i32 @test_srem_int_min(i32 %X) nounwind {
; CHECK-LABEL: test_srem_int_min:		; CHECK-LABEL: test_srem_int_min:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #2147483647		; CHECK-NEXT: asr w8, w0, #31
; CHECK-NEXT: cmp w0, #0		; CHECK-NEXT: add w8, w0, w8, lsr #1
; CHECK-NEXT: add w8, w0, w8
; CHECK-NEXT: csel w8, w8, w0, lt
; CHECK-NEXT: and w8, w8, #0x80000000		; CHECK-NEXT: and w8, w8, #0x80000000
; CHECK-NEXT: cmn w0, w8		; CHECK-NEXT: cmn w0, w8
; CHECK-NEXT: cset w0, eq		; CHECK-NEXT: cset w0, eq
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%srem = srem i32 %X, 2147483648		%srem = srem i32 %X, 2147483648
%cmp = icmp eq i32 %srem, 0		%cmp = icmp eq i32 %srem, 0
%ret = zext i1 %cmp to i32		%ret = zext i1 %cmp to i32
ret i32 %ret		ret i32 %ret
Show All 13 Lines

llvm/test/CodeGen/AArch64/srem-vector-lkk.ll

Show First 20 Lines • Show All 150 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
ret <4 x i16> %3		ret <4 x i16> %3
}		}

; Don't fold for divisors that are a power of two.		; Don't fold for divisors that are a power of two.
define <4 x i16> @dont_fold_srem_power_of_two(<4 x i16> %x) {		define <4 x i16> @dont_fold_srem_power_of_two(<4 x i16> %x) {
; CHECK-LABEL: dont_fold_srem_power_of_two:		; CHECK-LABEL: dont_fold_srem_power_of_two:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
; CHECK-NEXT: smov w9, v0.h[1]		; CHECK-NEXT: smov w9, v0.h[0]
; CHECK-NEXT: smov w10, v0.h[0]		; CHECK-NEXT: smov w10, v0.h[3]
		; CHECK-NEXT: smov w11, v0.h[1]
; CHECK-NEXT: mov w8, #37253		; CHECK-NEXT: mov w8, #37253
; CHECK-NEXT: movk w8, #44150, lsl #16		; CHECK-NEXT: movk w8, #44150, lsl #16
; CHECK-NEXT: add w11, w9, #31		; CHECK-NEXT: add w12, w9, w9, lsr #26
; CHECK-NEXT: cmp w9, #0		; CHECK-NEXT: smull x8, w10, w8
; CHECK-NEXT: add w12, w10, #63
; CHECK-NEXT: csel w11, w11, w9, lt
; CHECK-NEXT: cmp w10, #0
; CHECK-NEXT: and w11, w11, #0xffffffe0
; CHECK-NEXT: csel w12, w12, w10, lt
; CHECK-NEXT: sub w9, w9, w11
; CHECK-NEXT: and w12, w12, #0xffffffc0		; CHECK-NEXT: and w12, w12, #0xffffffc0
; CHECK-NEXT: sub w10, w10, w12		; CHECK-NEXT: add w13, w11, w11, lsr #27
; CHECK-NEXT: smov w12, v0.h[3]		; CHECK-NEXT: sub w9, w9, w12
; CHECK-NEXT: fmov s1, w10		; CHECK-NEXT: and w12, w13, #0xffffffe0
; CHECK-NEXT: smov w10, v0.h[2]		; CHECK-NEXT: smov w13, v0.h[2]
; CHECK-NEXT: smull x8, w12, w8
; CHECK-NEXT: mov v1.h[1], w9
; CHECK-NEXT: lsr x8, x8, #32		; CHECK-NEXT: lsr x8, x8, #32
; CHECK-NEXT: add w9, w10, #7		; CHECK-NEXT: sub w11, w11, w12
; CHECK-NEXT: cmp w10, #0		; CHECK-NEXT: add w8, w8, w10
; CHECK-NEXT: csel w9, w9, w10, lt		; CHECK-NEXT: fmov s0, w9
; CHECK-NEXT: add w8, w8, w12		; CHECK-NEXT: asr w9, w8, #6
		; CHECK-NEXT: add w8, w9, w8, lsr #31
		; CHECK-NEXT: add w9, w13, w13, lsr #29
		; CHECK-NEXT: mov v0.h[1], w11
; CHECK-NEXT: and w9, w9, #0xfffffff8		; CHECK-NEXT: and w9, w9, #0xfffffff8
; CHECK-NEXT: sub w9, w10, w9		; CHECK-NEXT: sub w9, w13, w9
; CHECK-NEXT: asr w10, w8, #6		; CHECK-NEXT: mov w11, #95
; CHECK-NEXT: add w8, w10, w8, lsr #31		; CHECK-NEXT: msub w8, w8, w11, w10
; CHECK-NEXT: mov w10, #95		; CHECK-NEXT: mov v0.h[2], w9
; CHECK-NEXT: mov v1.h[2], w9		; CHECK-NEXT: mov v0.h[3], w8
; CHECK-NEXT: msub w8, w8, w10, w12		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: mov v1.h[3], w8
; CHECK-NEXT: fmov d0, d1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = srem <4 x i16> %x, <i16 64, i16 32, i16 8, i16 95>		%1 = srem <4 x i16> %x, <i16 64, i16 32, i16 8, i16 95>
ret <4 x i16> %1		ret <4 x i16> %1
}		}

; Don't fold if the divisor is one.		; Don't fold if the divisor is one.
define <4 x i16> @dont_fold_srem_one(<4 x i16> %x) {		define <4 x i16> @dont_fold_srem_one(<4 x i16> %x) {
; CHECK-LABEL: dont_fold_srem_one:		; CHECK-LABEL: dont_fold_srem_one:
Show All 40 Lines

; Don't fold if the divisor is 2^15.		; Don't fold if the divisor is 2^15.
define <4 x i16> @dont_fold_srem_i16_smax(<4 x i16> %x) {		define <4 x i16> @dont_fold_srem_i16_smax(<4 x i16> %x) {
; CHECK-LABEL: dont_fold_srem_i16_smax:		; CHECK-LABEL: dont_fold_srem_i16_smax:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
; CHECK-NEXT: smov w8, v0.h[2]		; CHECK-NEXT: smov w8, v0.h[2]
; CHECK-NEXT: mov w9, #17097		; CHECK-NEXT: mov w9, #17097
; CHECK-NEXT: smov w10, v0.h[1]
; CHECK-NEXT: movk w9, #45590, lsl #16		; CHECK-NEXT: movk w9, #45590, lsl #16
; CHECK-NEXT: mov w11, #32767		; CHECK-NEXT: smov w10, v0.h[1]
; CHECK-NEXT: smov w12, v0.h[3]		; CHECK-NEXT: smov w12, v0.h[3]
; CHECK-NEXT: movi d1, #0000000000000000		; CHECK-NEXT: movi d0, #0000000000000000
		; CHECK-NEXT: mov w11, #23
; CHECK-NEXT: smull x9, w8, w9		; CHECK-NEXT: smull x9, w8, w9
; CHECK-NEXT: add w11, w10, w11
; CHECK-NEXT: cmp w10, #0
; CHECK-NEXT: lsr x9, x9, #32		; CHECK-NEXT: lsr x9, x9, #32
; CHECK-NEXT: csel w11, w11, w10, lt
; CHECK-NEXT: add w9, w9, w8		; CHECK-NEXT: add w9, w9, w8
; CHECK-NEXT: and w11, w11, #0xffff8000
; CHECK-NEXT: asr w13, w9, #4		; CHECK-NEXT: asr w13, w9, #4
; CHECK-NEXT: sub w10, w10, w11
; CHECK-NEXT: mov w11, #47143
; CHECK-NEXT: add w9, w13, w9, lsr #31		; CHECK-NEXT: add w9, w13, w9, lsr #31
; CHECK-NEXT: mov w13, #23		; CHECK-NEXT: add w13, w10, w10, lsr #17
; CHECK-NEXT: movk w11, #24749, lsl #16		; CHECK-NEXT: and w13, w13, #0xffff8000
; CHECK-NEXT: mov v1.h[1], w10		; CHECK-NEXT: sub w10, w10, w13
; CHECK-NEXT: msub w8, w9, w13, w8		; CHECK-NEXT: mov w13, #47143
; CHECK-NEXT: smull x9, w12, w11		; CHECK-NEXT: movk w13, #24749, lsl #16
		; CHECK-NEXT: msub w8, w9, w11, w8
		; CHECK-NEXT: smull x9, w12, w13
		; CHECK-NEXT: mov v0.h[1], w10
; CHECK-NEXT: lsr x10, x9, #63		; CHECK-NEXT: lsr x10, x9, #63
; CHECK-NEXT: asr x9, x9, #43		; CHECK-NEXT: asr x9, x9, #43
; CHECK-NEXT: add w9, w9, w10		; CHECK-NEXT: add w9, w9, w10
; CHECK-NEXT: mov w10, #5423		; CHECK-NEXT: mov w10, #5423
; CHECK-NEXT: mov v1.h[2], w8		; CHECK-NEXT: mov v0.h[2], w8
; CHECK-NEXT: msub w8, w9, w10, w12		; CHECK-NEXT: msub w8, w9, w10, w12
; CHECK-NEXT: mov v1.h[3], w8		; CHECK-NEXT: mov v0.h[3], w8
; CHECK-NEXT: fmov d0, d1		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = srem <4 x i16> %x, <i16 1, i16 32768, i16 23, i16 5423>		%1 = srem <4 x i16> %x, <i16 1, i16 32768, i16 23, i16 5423>
ret <4 x i16> %1		ret <4 x i16> %1
}		}

; Don't fold i64 srem.		; Don't fold i64 srem.
define <4 x i64> @dont_fold_srem_i64(<4 x i64> %x) {		define <4 x i64> @dont_fold_srem_i64(<4 x i64> %x) {
; CHECK-LABEL: dont_fold_srem_i64:		; CHECK-LABEL: dont_fold_srem_i64:
Show All 40 Lines