Download Raw Diff

Details

Reviewers

t.p.northover
greened
anhtuyen
efriedma
dmgreen
samparker
SjoerdMeijer
gchatelet

Commits

rG85342c27a303: [ARM] Optimize immediate selection

Summary

Optimize some specific immediates selection via breaking to two parts
against loading them from the constant pool.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

benshi001 created this revision.Jul 13 2020, 10:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2020, 10:34 PM

Herald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

Harbormaster failed remote builds in B64102: Diff 277667!Jul 13 2020, 11:10 PM

benshi001 updated this revision to Diff 277678.Jul 14 2020, 12:04 AM

benshi001 updated this revision to Diff 277681.Jul 14 2020, 12:14 AM

gchatelet resigned from this revision.Jul 15 2020, 1:23 AM

I am not entirely sure anymore, but I thought I had looked into once, or at least something similar. I think the problem here is that if the constant is used only once like in the test, it is a clear win, but has soon as there are multiple uses than it's not better and probably worse. Also, with only one use, the code-size difference is neutral, but that won't be the case with multiple uses. So, I guess this work needs benchmark numbers, unless other reviewers that I've added can immediately tell if this is good or bad.

benshi001 added a comment.Jul 15 2020, 7:04 AM

This comment was removed by benshi001.

benshi001 added a comment.Jul 15 2020, 7:13 AM

This comment was removed by benshi001.

Is there a benchmark test suite I can try?

If you think my change will be worse if there are multiple uses of the constant, then the same thing will happen for plus constant matches ARM_AM::isSOImmTwoPartVal. I think both positive and negative SOImmTwoPartVal constant should be handled in the same way.

My change also benefits the following code:

%x=%y - 0x2323
previous llvm will generate 12 bytes:
ldr ...
sub ...
an item in the constant pool

but my change simplifies it to only 8 bytes:
sub %x, %y, 0x2300
sub %x, %x, 0x23
(I did not add a test case for that)

In D83745#2152988, @SjoerdMeijer wrote:

I am not entirely sure anymore, but I thought I had looked into once, or at least something similar. I think the problem here is that if the constant is used only once like in the test, it is a clear win, but has soon as there are multiple uses than it's not better and probably worse. Also, with only one use, the code-size difference is neutral, but that won't be the case with multiple uses. So, I guess this work needs benchmark numbers, unless other reviewers that I've added can immediately tell if this is good or bad.

for C code

unsigned int f0(unsigned int a)
{
        return a-0x2323;
}
unsigned int f1(unsigned int a) 
{
        return a+0x2323;
}

GCC generates

f0:
      sub     r0, r0, #8960
      sub     r0, r0, #35
      bx      lr
f1:
      add     r0, r0, #8960
      add     r0, r0, #35
      bx      lr

but llvm generates

f0:
      ldr     r1, .LCPI0_0
      add     r0, r0, r1
      bx      lr
.LCPI0_0:
      .long   4294958301                      @ 0xffffdcdd
f1:
      add     r0, r0, #35
      add     r0, r0, #8960
      bx      lr

And my patch will make llvm follows GCC's behavior.

Okay, I had only a quick first look, perhaps multiple uses isn't a problem.
If it's not too much trouble, confirming this with the llvm test-suite would be good.

benshi001 added a comment.Jul 15 2020, 9:14 PM

This comment was removed by benshi001.

In D83745#2153885, @SjoerdMeijer wrote:

Okay, I had only a quick first look, perhaps multiple uses isn't a problem.
If it's not too much trouble, confirming this with the llvm test-suite would be good.

I have added a new test case, https://reviews.llvm.org/D83928.

We can merge that test first, then we will see how this patch improves armv6's code quality.

benshi001 updated this revision to Diff 279073.Jul 19 2020, 5:13 AM

benshi001 edited the summary of this revision. (Show Details)

ping

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

In D83745#2175668, @SjoerdMeijer wrote:

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Is there a performance test suite for llvm/arm I can try? Or just a small piece of code running 1000 times is OK enough?
Actually I have made similar optimization for golang, golang's official go1 benchmark score shows slight improvement.

https://github.com/golang/go/commit/6897030fe3de43bbed48adb72f21a6c2d00042cd

In D83745#2175668, @SjoerdMeijer wrote:

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Another improve in my patch is that

define i32 @sub2(i32 %0) {
  %2 = sub i32 %0, 8995
  ret i32 %2
}

Current llvm will put the imm 8995 in the constant pool and generate a LDR.

My patch will optimize it to

; CHECK-NEXT:    sub r0, r0, #35
; CHECK-NEXT:    sub r0, r0, #8960

That can be seen if https://reviews.llvm.org/D83928 is merged.

In D83745#2175933, @benshi001 wrote:

In D83745#2175668, @SjoerdMeijer wrote:

Eye balling the one test that was changed this indeed makes sense. But I would prefer to see some performance numbers first just to check we haven't missed anything.

Is there a performance test suite for llvm/arm I can try? Or just a small piece of code running 1000 times is OK enough?
Actually I have made similar optimization for golang, golang's official go1 benchmark score shows slight improvement.

https://github.com/golang/go/commit/6897030fe3de43bbed48adb72f21a6c2d00042cd

Briefly looking at those numbers, that shows a bit of up and down behaviour. And while the geomean's show a small improvement, I don't find it quite convincing... i.e., not yet. Hence my request for some numbers. But first of all I am curious to know what causes the regressions? In other words, what are we missing with this patch, and can that be mitigated, or what kind of knock on effects does this have?
There is the llvm test-suite, see e.g. https://llvm.org/docs/TestSuiteGuide.html. If you can run that, that would be good.

And kind of obsolete benchmarks, but easy to run are coremark and dhrystone. That would be good as a first finger on the pulse too.

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

In D83745#2178363, @SjoerdMeijer wrote:

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

Thanks so much for your help. I have another patch https://reviews.llvm.org/D84100 also does immediate optimization.

I should combine them into one, but smaller changes are easy to review and test.

So I quite appreciate you if you can also have a look at it.

In D83745#2178363, @SjoerdMeijer wrote:

I understood from your message to the llvm dev list that generating numbers isn't entirely straightforward for you.
I will take this patch, and generate some numbers, will see if I can do that today/tomorrow, and will let you know.

What's more, I do not expect there is any performance improvement by my patch, but I do expect there should be
code size reduction of the test suite by my patch. That's also a benefit to llvm-arm.

Think I shot myself a little bit in the foot here ;-). Our group focuses mostly on M-cores, and running A32 code wasn't as straightforward as I'd have hoped. I eyeballed codegen differences, and didn't see anything concerning. So I don't have data for making further objections.

So LGTM for now.

This revision is now accepted and ready to land.Jul 28 2020, 11:17 AM

Closed by commit rG85342c27a303: [ARM] Optimize immediate selection (authored by SjoerdMeijer). · Explain WhyJul 29 2020, 5:30 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG85342c27a303: [ARM] Optimize immediate selection.

SjoerdMeijer mentioned this in D83928: [ARM][TEST] Add a new test case of add-imm & sub-imm.Jul 29 2020, 5:32 AM

Diff 281533

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

Show First 20 Lines • Show All 5,507 Lines • ▼ Show 20 Lines	if (Subtarget->isThumb()) {
if (ARM_AM::getSOImmVal(Val) != -1) // MOV		if (ARM_AM::getSOImmVal(Val) != -1) // MOV
return ForCodesize ? 4 : 1;		return ForCodesize ? 4 : 1;
if (ARM_AM::getSOImmVal(~Val) != -1) // MVN		if (ARM_AM::getSOImmVal(~Val) != -1) // MVN
return ForCodesize ? 4 : 1;		return ForCodesize ? 4 : 1;
if (Subtarget->hasV6T2Ops() && Val <= 0xffff) // MOVW		if (Subtarget->hasV6T2Ops() && Val <= 0xffff) // MOVW
return ForCodesize ? 4 : 1;		return ForCodesize ? 4 : 1;
if (ARM_AM::isSOImmTwoPartVal(Val)) // two instrs		if (ARM_AM::isSOImmTwoPartVal(Val)) // two instrs
return ForCodesize ? 8 : 2;		return ForCodesize ? 8 : 2;
		if (ARM_AM::isSOImmTwoPartValNeg(Val)) // two instrs
		return ForCodesize ? 8 : 2;
}		}
if (Subtarget->useMovt()) // MOVW + MOVT		if (Subtarget->useMovt()) // MOVW + MOVT
return ForCodesize ? 8 : 2;		return ForCodesize ? 8 : 2;
return ForCodesize ? 8 : 3; // Literal pool load		return ForCodesize ? 8 : 3; // Literal pool load
}		}

bool llvm::HasLowerConstantMaterializationCost(unsigned Val1, unsigned Val2,		bool llvm::HasLowerConstantMaterializationCost(unsigned Val1, unsigned Val2,
const ARMSubtarget *Subtarget,		const ARMSubtarget *Subtarget,
▲ Show 20 Lines • Show All 531 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp

Show First 20 Lines • Show All 867 Lines • ▼ Show 20 Lines	void ARMExpandPseudo::ExpandMOV32BitImm(MachineBasicBlock &MBB,
MachineInstrBuilder LO16, HI16;		MachineInstrBuilder LO16, HI16;
LLVM_DEBUG(dbgs() << "Expanding: "; MI.dump());		LLVM_DEBUG(dbgs() << "Expanding: "; MI.dump());

if (!STI->hasV6T2Ops() &&		if (!STI->hasV6T2Ops() &&
(Opcode == ARM::MOVi32imm \|\| Opcode == ARM::MOVCCi32imm)) {		(Opcode == ARM::MOVi32imm \|\| Opcode == ARM::MOVCCi32imm)) {
// FIXME Windows CE supports older ARM CPUs		// FIXME Windows CE supports older ARM CPUs
assert(!STI->isTargetWindows() && "Windows on ARM requires ARMv7+");		assert(!STI->isTargetWindows() && "Windows on ARM requires ARMv7+");

// Expand into a movi + orr.		assert (MO.isImm() && "MOVi32imm w/ non-immediate source operand!");
		unsigned ImmVal = (unsigned)MO.getImm();
		unsigned SOImmValV1 = 0, SOImmValV2 = 0;

		if (ARM_AM::isSOImmTwoPartVal(ImmVal)) { // Expand into a movi + orr.
LO16 = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(ARM::MOVi), DstReg);		LO16 = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(ARM::MOVi), DstReg);
HI16 = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(ARM::ORRri))		HI16 = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(ARM::ORRri))
.addReg(DstReg, RegState::Define \| getDeadRegState(DstIsDead))		.addReg(DstReg, RegState::Define \| getDeadRegState(DstIsDead))
.addReg(DstReg);		.addReg(DstReg);
		SOImmValV1 = ARM_AM::getSOImmTwoPartFirst(ImmVal);
		SOImmValV2 = ARM_AM::getSOImmTwoPartSecond(ImmVal);
		} else { // Expand into a mvn + sub.
		LO16 = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(ARM::MVNi), DstReg);
		HI16 = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(ARM::SUBri))
		.addReg(DstReg, RegState::Define \| getDeadRegState(DstIsDead))
		.addReg(DstReg);
		SOImmValV1 = ARM_AM::getSOImmTwoPartFirst(-ImmVal);
		SOImmValV2 = ARM_AM::getSOImmTwoPartSecond(-ImmVal);
		SOImmValV1 = ~(-SOImmValV1);
		}

assert (MO.isImm() && "MOVi32imm w/ non-immediate source operand!");
unsigned ImmVal = (unsigned)MO.getImm();
unsigned SOImmValV1 = ARM_AM::getSOImmTwoPartFirst(ImmVal);
unsigned SOImmValV2 = ARM_AM::getSOImmTwoPartSecond(ImmVal);
unsigned MIFlags = MI.getFlags();		unsigned MIFlags = MI.getFlags();
LO16 = LO16.addImm(SOImmValV1);		LO16 = LO16.addImm(SOImmValV1);
HI16 = HI16.addImm(SOImmValV2);		HI16 = HI16.addImm(SOImmValV2);
LO16.cloneMemRefs(MI);		LO16.cloneMemRefs(MI);
HI16.cloneMemRefs(MI);		HI16.cloneMemRefs(MI);
LO16.setMIFlags(MIFlags);		LO16.setMIFlags(MIFlags);
HI16.setMIFlags(MIFlags);		HI16.setMIFlags(MIFlags);
LO16.addImm(Pred).addReg(PredReg).add(condCodeOp());		LO16.addImm(Pred).addReg(PredReg).add(condCodeOp());
▲ Show 20 Lines • Show All 1,989 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMInstrInfo.td

Show First 20 Lines • Show All 818 Lines • ▼ Show 20 Lines	def mod_imm_neg : Operand<i32>, PatLeaf<(imm), [{
}], imm_neg_XFORM> {		}], imm_neg_XFORM> {
let ParserMatchClass = ModImmNegAsmOperand;		let ParserMatchClass = ModImmNegAsmOperand;
}		}

/// arm_i32imm - True for +V6T2, or when isSOImmTwoParVal()		/// arm_i32imm - True for +V6T2, or when isSOImmTwoParVal()
def arm_i32imm : IntImmLeaf<i32, [{		def arm_i32imm : IntImmLeaf<i32, [{
if (Subtarget->useMovt())		if (Subtarget->useMovt())
return true;		return true;
return ARM_AM::isSOImmTwoPartVal(Imm.getZExtValue());		if (ARM_AM::isSOImmTwoPartVal(Imm.getZExtValue()))
		return true;
		return ARM_AM::isSOImmTwoPartValNeg(Imm.getZExtValue());
}]>;		}]>;

/// imm0_1 predicate - Immediate in the range [0,1].		/// imm0_1 predicate - Immediate in the range [0,1].
def Imm0_1AsmOperand: ImmAsmOperand<0,1> { let Name = "Imm0_1"; }		def Imm0_1AsmOperand: ImmAsmOperand<0,1> { let Name = "Imm0_1"; }
def imm0_1 : Operand<i32> { let ParserMatchClass = Imm0_1AsmOperand; }		def imm0_1 : Operand<i32> { let ParserMatchClass = Imm0_1AsmOperand; }

/// imm0_3 predicate - Immediate in the range [0,3].		/// imm0_3 predicate - Immediate in the range [0,3].
def Imm0_3AsmOperand: ImmAsmOperand<0,3> { let Name = "Imm0_3"; }		def Imm0_3AsmOperand: ImmAsmOperand<0,3> { let Name = "Imm0_3"; }
▲ Show 20 Lines • Show All 5,555 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/MCTargetDesc/ARMAddressingModes.h

Show First 20 Lines • Show All 199 Lines • ▼ Show 20 Lines	inline unsigned getSOImmTwoPartSecond(unsigned V) {
// Mask out the first hunk.		// Mask out the first hunk.
V = rotr32(~255U, getSOImmValRotate(V)) & V;		V = rotr32(~255U, getSOImmValRotate(V)) & V;

// Take what's left.		// Take what's left.
assert(V == (rotr32(255U, getSOImmValRotate(V)) & V));		assert(V == (rotr32(255U, getSOImmValRotate(V)) & V));
return V;		return V;
}		}

		/// isSOImmTwoPartValNeg - Return true if the specified value can be obtained
		/// by two SOImmVal, that -V = First + Second.
		/// "R+V" can be optimized to (sub (sub R, First), Second).
		/// "R=V" can be optimized to (sub (mvn R, ~(-First)), Second).
		inline bool isSOImmTwoPartValNeg(unsigned V) {
		unsigned First;
		if (!isSOImmTwoPartVal(-V))
		return false;
		// Return false if ~(-First) is not a SoImmval.
		First = getSOImmTwoPartFirst(-V);
		First = ~(-First);
		return !(rotr32(~255U, getSOImmValRotate(First)) & First);
		}

/// getThumbImmValShift - Try to handle Imm with a 8-bit immediate followed		/// getThumbImmValShift - Try to handle Imm with a 8-bit immediate followed
/// by a left shift. Returns the shift amount to use.		/// by a left shift. Returns the shift amount to use.
inline unsigned getThumbImmValShift(unsigned Imm) {		inline unsigned getThumbImmValShift(unsigned Imm) {
// 8-bit (or less) immediates are trivially immediate operand with a shift		// 8-bit (or less) immediates are trivially immediate operand with a shift
// of zero.		// of zero.
if ((Imm & ~255U) == 0) return 0;		if ((Imm & ~255U) == 0) return 0;

// Use CTZ to compute the shift amount.		// Use CTZ to compute the shift amount.
▲ Show 20 Lines • Show All 521 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/add-sub-imm.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=arm-eabi %s -o - \| FileCheck %s --check-prefix=CHECK

				;; Check how immediates are handled in add/sub.

				define i32 @sub0(i32 %0) {
				; CHECK-LABEL: sub0:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: sub r0, r0, #23
				; CHECK-NEXT: mov pc, lr
				%2 = sub i32 %0, 23
				ret i32 %2
				}

				define i32 @sub1(i32 %0) {
				; CHECK-LABEL: sub1:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: ldr r1, .LCPI1_0
				; CHECK-NEXT: add r0, r0, r1
				; CHECK-NEXT: mov pc, lr
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: @ %bb.1:
				; CHECK-NEXT: .LCPI1_0:
				; CHECK-NEXT: .long 4294836225 @ 0xfffe0001
				%2 = sub i32 %0, 131071
				ret i32 %2
				}

				define i32 @sub2(i32 %0) {
				; CHECK-LABEL: sub2:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: sub r0, r0, #35
				; CHECK-NEXT: sub r0, r0, #8960
				; CHECK-NEXT: mov pc, lr
				%2 = sub i32 %0, 8995
				ret i32 %2
				}

				define i32 @add0(i32 %0) {
				; CHECK-LABEL: add0:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: add r0, r0, #23
				; CHECK-NEXT: mov pc, lr
				%2 = add i32 %0, 23
				ret i32 %2
				}

				define i32 @add1(i32 %0) {
				; CHECK-LABEL: add1:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: ldr r1, .LCPI4_0
				; CHECK-NEXT: add r0, r0, r1
				; CHECK-NEXT: mov pc, lr
				; CHECK-NEXT: .p2align 2
				; CHECK-NEXT: @ %bb.1:
				; CHECK-NEXT: .LCPI4_0:
				; CHECK-NEXT: .long 131071 @ 0x1ffff
				%2 = add i32 %0, 131071
				ret i32 %2
				}

				define i32 @add2(i32 %0) {
				; CHECK-LABEL: add2:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: add r0, r0, #8960
				; CHECK-NEXT: add r0, r0, #2293760
				; CHECK-NEXT: mov pc, lr
				%2 = add i32 %0, 2302720
				ret i32 %2
				}

llvm/test/CodeGen/ARM/select-imm.ll

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	; THUMB2: lsrs r0, r0, #5
%0 = icmp eq i32 %a, 160		%0 = icmp eq i32 %a, 160
%1 = zext i1 %0 to i32		%1 = zext i1 %0 to i32
ret i32 %1		ret i32 %1
}		}

define i32 @t4(i32 %a, i32 %b, i32 %x) nounwind {		define i32 @t4(i32 %a, i32 %b, i32 %x) nounwind {
entry:		entry:
; ARM-LABEL: t4:		; ARM-LABEL: t4:
; ARM: ldr		; ARM: mvn [[R0:r[0-9]+]], #170
		; ARM: sub [[R0:r[0-9]+]], [[R0:r[0-9]+]], #11141120
; ARM: mov{{lt\|ge}}		; ARM: mov{{lt\|ge}}

; ARMT2-LABEL: t4:		; ARMT2-LABEL: t4:
; ARMT2: movwlt [[R0:r[0-9]+]], #65365		; ARMT2: movwlt [[R0:r[0-9]+]], #65365
; ARMT2: movtlt [[R0]], #65365		; ARMT2: movtlt [[R0]], #65365

; THUMB1-LABEL: t4:		; THUMB1-LABEL: t4:
; THUMB1: cmp r{{[0-9]+}}, r{{[0-9]+}}		; THUMB1: cmp r{{[0-9]+}}, r{{[0-9]+}}
▲ Show 20 Lines • Show All 300 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Optimize immediate selection
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 281533

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp

llvm/lib/Target/ARM/ARMInstrInfo.td

llvm/lib/Target/ARM/MCTargetDesc/ARMAddressingModes.h

llvm/test/CodeGen/ARM/add-sub-imm.ll

llvm/test/CodeGen/ARM/select-imm.ll

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Optimize immediate selectionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 281533

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp

llvm/lib/Target/ARM/ARMExpandPseudoInsts.cpp

llvm/lib/Target/ARM/ARMInstrInfo.td

llvm/lib/Target/ARM/MCTargetDesc/ARMAddressingModes.h

llvm/test/CodeGen/ARM/add-sub-imm.ll

llvm/test/CodeGen/ARM/select-imm.ll

[ARM] Optimize immediate selection
ClosedPublic