This is an archive of the discontinued LLVM Phabricator instance.

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
287–288	This is not optimized, since `0xff000` can be composed by a single `lu12i.w`.
325–326	This is not optimized, since `0xff000` can be composed by a single `lu12i.w`.
328–329	This is not optimized, since the constant is used twice.
384	This is not optimized, since the value `-31` can be composed by a single `ADDI.W`.

benshi001 updated this revision to Diff 510175.Mar 31 2023, 9:52 PM

benshi001 added inline comments.Mar 31 2023, 9:55 PM

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
384	This is not optimized, since the value `-2048` can be composed by a single `ADDI.W`.

Harbormaster completed remote builds in B223113: Diff 510175.Mar 31 2023, 10:28 PM

xen0n added inline comments.Mar 31 2023, 11:56 PM

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
319–321	Seems wrong? The intended operation is to retain only `a[15:4]` so we should `bstrpick.d $a0, $a0, 15, 4` to retain bits, then `slli.d $a0, $a0, 4` to restore the LSB position. (LoongArch `bstrpick` invariably stores to `rd`'s LSB side, and will not retain the original relative position.)

benshi001 updated this revision to Diff 510188.Apr 1 2023, 12:18 AM

benshi001 marked an inline comment as done.

benshi001 added inline comments.

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
319–321	Thanks. I should not make such as mistake :(

benshi001 marked an inline comment as done.Apr 1 2023, 12:18 AM

benshi001 updated this revision to Diff 510190.Apr 1 2023, 12:22 AM

Good catch, thanks!

IMO you could include as comments your explanations to existing cases where this optimization has not taken place (e.g. "This is not optimized into bstrpick + slli because the constant has multiple uses."), so future readers wouldn't have to do archaeology to see them. The code changes LGTM.

This revision is now accepted and ready to land.Apr 1 2023, 12:23 AM

I've just found there's no commit message. In general it could be helpful to include one or two short sentences describing the actual change, e.g. "Optimize bitfield extractions retaining bit positions with bstrpick + slli", because "optimize foo" otherwise doesn't carry useful information about the actual improvements.

llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
1386	nit: "constant"

In D147368#4238148, @xen0n wrote:

Good catch, thanks!

IMO you could include as comments your explanations to existing cases where this optimization has not taken place (e.g. "This is not optimized into bstrpick + slli because the constant has multiple uses."), so future readers wouldn't have to do archaeology to see them. The code changes LGTM.

Thanks. I have updated https://reviews.llvm.org/D147367 with your suggestion !

benshi001 edited the summary of this revision. (Show Details)Apr 1 2023, 12:46 AM

benshi001 marked an inline comment as done.

benshi001 added inline comments.

llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp
1386	I have fixed this typo in my local repo, and will be correct when committing. Thanks!

benshi001 marked an inline comment as done.Apr 1 2023, 12:47 AM

Harbormaster completed remote builds in B223124: Diff 510190.Apr 1 2023, 12:54 AM

SixWeining added inline comments.Apr 2 2023, 6:03 AM

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
339–344	It's also a win if optimized to `bstrpick + slli` since we can save a register `$a2`. Right? ; LA64-NEXT: bstrpick.d $a0, $a0, 15, 4 ; LA64-NEXT: slli.d $a0, $a0, 4 ; LA64-NEXT: bstrpick.d $a1, $a1, 15, 4 ; LA64-NEXT: slli.d $a1, $a1, 4 ; LA64-NEXT: mul.d $a0, $a0, $a1 But if the immediate `0xfff0` is used 3 times, we have 2 choices: Use less registers but one more instruction. Use less instructions but one more register. I'm not sure how to balance this.

benshi001 added inline comments.Apr 2 2023, 7:19 AM

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
339–344	I am also not sure about the case of more then 2 uses. Maybe we make a convervative choice, just make it unchanged? And we only handle 1 and 2 uses ?

SixWeining added inline comments.Apr 2 2023, 6:51 PM

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
339–344	I agree with you.

benshi001 updated this revision to Diff 510377.Apr 2 2023, 6:59 PM

benshi001 marked 2 inline comments as done.

benshi001 added inline comments.

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
339–344	I have updated my code 1 and 2 uses are handled, more uses are rejected; corresponding tests are added for the above logic.

benshi001 marked an inline comment as done.Apr 2 2023, 7:01 PM

benshi001 added inline comments.

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
392	This is the case the immediate has 3 uses, which is not optimized.

LGTM from a cursory look, thanks! Although putting your informative replies into the code as comments would be even better, as others would see the rationale along with the code and not have to do archaeology themselves.

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll
339–344	Fine with me. If we were to aim for the maximum performance possible then that'll have to be backed with actual micro-architecture details, so the optimizer can do the right thing for the right models. Less instructions executed usually can't hurt after all. (In my Go benchmarking experiences, loop alignment could have far more profound influence on the numbers than micro-optimization like this, so it's probably fine to not care as much here.)

Changes to test like llvm/test/CodeGen/LoongArch/alloca.ll are recovered?

In D147368#4239617, @SixWeining wrote:

Changes to test like llvm/test/CodeGen/LoongArch/alloca.ll are recovered?

I have supplemented llvm/test/CodeGen/LoongArch/alloca.ll and other affected tests.

I am working in multiple threads mode, sorry for so many typos and spots. :(

In D147368#4239614, @xen0n wrote:

LGTM from a cursory look, thanks! Although putting your informative replies into the code as comments would be even better, as others would see the rationale along with the code and not have to do archaeology themselves.

Thanks for your suggestion. I have added some informative comments to both c++ code and the test code.

In D147368#4239647, @benshi001 wrote:

In D147368#4239614, @xen0n wrote:

LGTM from a cursory look, thanks! Although putting your informative replies into the code as comments would be even better, as others would see the rationale along with the code and not have to do archaeology themselves.

Thanks for your suggestion. I have added some informative comments to both c++ code and the test code.

Ah. I actually meant putting the justification for the "> 2 uses" (being conservative when faced with register pressure vs. instruction count blah blah), and it's actually fine to not comment what you did when the code is self-documenting. In general just document why and not what you do for a piece of code that may warrant such explanation.

Harbormaster completed remote builds in B223276: Diff 510381.Apr 2 2023, 8:27 PM

In D147368#4239651, @xen0n wrote:

In D147368#4239647, @benshi001 wrote:

In D147368#4239614, @xen0n wrote:

LGTM from a cursory look, thanks! Although putting your informative replies into the code as comments would be even better, as others would see the rationale along with the code and not have to do archaeology themselves.

Thanks for your suggestion. I have added some informative comments to both c++ code and the test code.

Ah. I actually meant putting the justification for the "> 2 uses" (being conservative when faced with register pressure vs. instruction count blah blah), and it's actually fine to not comment what you did when the code is self-documenting. In general just document why and not what you do for a piece of code that may warrant such explanation.

I have commented in the c++ code as follow

// Omit if the constant has more than 2 uses. This a conservative
// decision. Whether it is a win depends on the HW microarchitecture.
// However it should always be better for 1 and 2 uses.
if (CN->use_size() > 2)
  return SDValue();

In D147368#4239678, @benshi001 wrote:
In D147368#4239651, @xen0n wrote:

In D147368#4239647, @benshi001 wrote:

In D147368#4239614, @xen0n wrote:

LGTM from a cursory look, thanks! Although putting your informative replies into the code as comments would be even better, as others would see the rationale along with the code and not have to do archaeology themselves.

Thanks for your suggestion. I have added some informative comments to both c++ code and the test code.

Ah. I actually meant putting the justification for the "> 2 uses" (being conservative when faced with register pressure vs. instruction count blah blah), and it's actually fine to not comment what you did when the code is self-documenting. In general just document why and not what you do for a piece of code that may warrant such explanation.

I have commented in the c++ code as follow
// Omit if the constant has more than 2 uses. This a conservative
// decision. Whether it is a win depends on the HW microarchitecture.
// However it should always be better for 1 and 2 uses.
if (CN->use_size() > 2)
  return SDValue();

Looks good, thanks!

SixWeining accepted this revision.Apr 2 2023, 9:04 PM

This revision was landed with ongoing or failed builds.Apr 2 2023, 9:11 PM

Closed by commit rG96dcd8cb9446: [LoongArch] Optimize bitwise and with immediates (authored by benshi001). · Explain Why

This revision was automatically updated to reflect the committed changes.

benshi001 mentioned this in rGd726c753886b: [LoongArch][NFC] Add tests of bitwise and with immediates (for D147368).

benshi001 added a commit: rG96dcd8cb9446: [LoongArch] Optimize bitwise and with immediates.

Harbormaster completed remote builds in B223281: Diff 510390.Apr 2 2023, 9:23 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

LoongArch/

LoongArchISelLowering.cpp

33 lines

test/

CodeGen/

LoongArch/

alloca.ll

17 lines

ir-instruction/

and.ll

56 lines

shrinkwrap.ll

10 lines

stack-realignment-with-variable-sized-objects.ll

5 lines

Diff 510394

llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp

Show First 20 Lines • Show All 1,366 Lines • ▼ Show 20 Lines	if (FirstOperandOpc == ISD::SRA \|\| FirstOperandOpc == ISD::SRL) {
// $dst = and $src, (2**len- 1) , if len > 12		// $dst = and $src, (2**len- 1) , if len > 12
// => BSTRPICK $dst, $src, msb, lsb		// => BSTRPICK $dst, $src, msb, lsb
// where lsb = 0 and msb = len - 1		// where lsb = 0 and msb = len - 1

// If the mask is <= 0xfff, andi can be used instead.		// If the mask is <= 0xfff, andi can be used instead.
if (CN->getZExtValue() <= 0xfff)		if (CN->getZExtValue() <= 0xfff)
return SDValue();		return SDValue();

// Return if the mask doesn't start at position 0.		// Return if the MSB exceeds.
if (SMIdx)		if (SMIdx + SMLen > ValTy.getSizeInBits())
return SDValue();		return SDValue();

lsb = 0;		if (SMIdx > 0) {
		// Omit if the constant has more than 2 uses. This a conservative
		// decision. Whether it is a win depends on the HW microarchitecture.
		// However it should always be better for 1 and 2 uses.
		if (CN->use_size() > 2)
		return SDValue();
		// Return if the constant can be composed by a single LU12I.W.
		if ((CN->getZExtValue() & 0xfff) == 0)
		xen0nUnsubmitted Done Reply Inline Actions nit: "constant" xen0n: nit: "constant"
		benshi001AuthorUnsubmitted Done Reply Inline Actions I have fixed this typo in my local repo, and will be correct when committing. Thanks! benshi001: I have fixed this typo in my local repo, and will be correct when committing. Thanks!
		return SDValue();
		// Return if the constand can be composed by a single ADDI with
		// the zero register.
		if (CN->getSExtValue() >= -2048 && CN->getSExtValue() < 0)
		return SDValue();
		}

		lsb = SMIdx;
NewOperand = FirstOperand;		NewOperand = FirstOperand;
}		}

msb = lsb + SMLen - 1;		msb = lsb + SMLen - 1;
return DAG.getNode(LoongArchISD::BSTRPICK, DL, ValTy, NewOperand,		SDValue NR0 = DAG.getNode(LoongArchISD::BSTRPICK, DL, ValTy, NewOperand,
DAG.getConstant(msb, DL, GRLenVT),		DAG.getConstant(msb, DL, GRLenVT),
DAG.getConstant(lsb, DL, GRLenVT));		DAG.getConstant(lsb, DL, GRLenVT));
		if (FirstOperandOpc == ISD::SRA \|\| FirstOperandOpc == ISD::SRL \|\| lsb == 0)
		return NR0;
		// Try to optimize to
		// bstrpick $Rd, $Rs, msb, lsb
		// slli $Rd, $Rd, lsb
		return DAG.getNode(ISD::SHL, DL, ValTy, NR0,
		DAG.getConstant(lsb, DL, GRLenVT));
}		}

static SDValue performSRLCombine(SDNode *N, SelectionDAG &DAG,		static SDValue performSRLCombine(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const LoongArchSubtarget &Subtarget) {		const LoongArchSubtarget &Subtarget) {
if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 1,808 Lines • Show Last 20 Lines

llvm/test/CodeGen/LoongArch/alloca.ll

	Show All 28 Lines
	; LA32-NEXT: ret			; LA32-NEXT: ret
	;			;
	; LA64-LABEL: simple_alloca:			; LA64-LABEL: simple_alloca:
	; LA64: # %bb.0:			; LA64: # %bb.0:
	; LA64-NEXT: addi.d $sp, $sp, -16			; LA64-NEXT: addi.d $sp, $sp, -16
	; LA64-NEXT: st.d $ra, $sp, 8 # 8-byte Folded Spill			; LA64-NEXT: st.d $ra, $sp, 8 # 8-byte Folded Spill
	; LA64-NEXT: st.d $fp, $sp, 0 # 8-byte Folded Spill			; LA64-NEXT: st.d $fp, $sp, 0 # 8-byte Folded Spill
	; LA64-NEXT: addi.d $fp, $sp, 16			; LA64-NEXT: addi.d $fp, $sp, 16
	; LA64-NEXT: addi.w $a1, $zero, -16
	; LA64-NEXT: lu32i.d $a1, 1
	; LA64-NEXT: bstrpick.d $a0, $a0, 31, 0			; LA64-NEXT: bstrpick.d $a0, $a0, 31, 0
	; LA64-NEXT: addi.d $a0, $a0, 15			; LA64-NEXT: addi.d $a0, $a0, 15
	; LA64-NEXT: and $a0, $a0, $a1			; LA64-NEXT: bstrpick.d $a0, $a0, 32, 4
				; LA64-NEXT: slli.d $a0, $a0, 4
	; LA64-NEXT: sub.d $a0, $sp, $a0			; LA64-NEXT: sub.d $a0, $sp, $a0
	; LA64-NEXT: move $sp, $a0			; LA64-NEXT: move $sp, $a0
	; LA64-NEXT: bl %plt(notdead)			; LA64-NEXT: bl %plt(notdead)
	; LA64-NEXT: addi.d $sp, $fp, -16			; LA64-NEXT: addi.d $sp, $fp, -16
	; LA64-NEXT: ld.d $fp, $sp, 0 # 8-byte Folded Reload			; LA64-NEXT: ld.d $fp, $sp, 0 # 8-byte Folded Reload
	; LA64-NEXT: ld.d $ra, $sp, 8 # 8-byte Folded Reload			; LA64-NEXT: ld.d $ra, $sp, 8 # 8-byte Folded Reload
	; LA64-NEXT: addi.d $sp, $sp, 16			; LA64-NEXT: addi.d $sp, $sp, 16
	; LA64-NEXT: ret			; LA64-NEXT: ret
	Show All 30 Lines
	;			;
	; LA64-LABEL: scoped_alloca:			; LA64-LABEL: scoped_alloca:
	; LA64: # %bb.0:			; LA64: # %bb.0:
	; LA64-NEXT: addi.d $sp, $sp, -32			; LA64-NEXT: addi.d $sp, $sp, -32
	; LA64-NEXT: st.d $ra, $sp, 24 # 8-byte Folded Spill			; LA64-NEXT: st.d $ra, $sp, 24 # 8-byte Folded Spill
	; LA64-NEXT: st.d $fp, $sp, 16 # 8-byte Folded Spill			; LA64-NEXT: st.d $fp, $sp, 16 # 8-byte Folded Spill
	; LA64-NEXT: st.d $s0, $sp, 8 # 8-byte Folded Spill			; LA64-NEXT: st.d $s0, $sp, 8 # 8-byte Folded Spill
	; LA64-NEXT: addi.d $fp, $sp, 32			; LA64-NEXT: addi.d $fp, $sp, 32
	; LA64-NEXT: addi.w $a1, $zero, -16			; LA64-NEXT: move $s0, $sp
	; LA64-NEXT: lu32i.d $a1, 1
	; LA64-NEXT: bstrpick.d $a0, $a0, 31, 0			; LA64-NEXT: bstrpick.d $a0, $a0, 31, 0
	; LA64-NEXT: addi.d $a0, $a0, 15			; LA64-NEXT: addi.d $a0, $a0, 15
	; LA64-NEXT: and $a0, $a0, $a1			; LA64-NEXT: bstrpick.d $a0, $a0, 32, 4
	; LA64-NEXT: move $s0, $sp			; LA64-NEXT: slli.d $a0, $a0, 4
	; LA64-NEXT: sub.d $a0, $sp, $a0			; LA64-NEXT: sub.d $a0, $sp, $a0
	; LA64-NEXT: move $sp, $a0			; LA64-NEXT: move $sp, $a0
	; LA64-NEXT: bl %plt(notdead)			; LA64-NEXT: bl %plt(notdead)
	; LA64-NEXT: move $sp, $s0			; LA64-NEXT: move $sp, $s0
	; LA64-NEXT: addi.d $sp, $fp, -32			; LA64-NEXT: addi.d $sp, $fp, -32
	; LA64-NEXT: ld.d $s0, $sp, 8 # 8-byte Folded Reload			; LA64-NEXT: ld.d $s0, $sp, 8 # 8-byte Folded Reload
	; LA64-NEXT: ld.d $fp, $sp, 16 # 8-byte Folded Reload			; LA64-NEXT: ld.d $fp, $sp, 16 # 8-byte Folded Reload
	; LA64-NEXT: ld.d $ra, $sp, 24 # 8-byte Folded Reload			; LA64-NEXT: ld.d $ra, $sp, 24 # 8-byte Folded Reload
	▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	; LA32-NEXT: ret			; LA32-NEXT: ret
	;			;
	; LA64-LABEL: alloca_callframe:			; LA64-LABEL: alloca_callframe:
	; LA64: # %bb.0:			; LA64: # %bb.0:
	; LA64-NEXT: addi.d $sp, $sp, -16			; LA64-NEXT: addi.d $sp, $sp, -16
	; LA64-NEXT: st.d $ra, $sp, 8 # 8-byte Folded Spill			; LA64-NEXT: st.d $ra, $sp, 8 # 8-byte Folded Spill
	; LA64-NEXT: st.d $fp, $sp, 0 # 8-byte Folded Spill			; LA64-NEXT: st.d $fp, $sp, 0 # 8-byte Folded Spill
	; LA64-NEXT: addi.d $fp, $sp, 16			; LA64-NEXT: addi.d $fp, $sp, 16
	; LA64-NEXT: addi.w $a1, $zero, -16
	; LA64-NEXT: lu32i.d $a1, 1
	; LA64-NEXT: bstrpick.d $a0, $a0, 31, 0			; LA64-NEXT: bstrpick.d $a0, $a0, 31, 0
	; LA64-NEXT: addi.d $a0, $a0, 15			; LA64-NEXT: addi.d $a0, $a0, 15
	; LA64-NEXT: and $a0, $a0, $a1			; LA64-NEXT: bstrpick.d $a0, $a0, 32, 4
				; LA64-NEXT: slli.d $a0, $a0, 4
	; LA64-NEXT: sub.d $a0, $sp, $a0			; LA64-NEXT: sub.d $a0, $sp, $a0
	; LA64-NEXT: move $sp, $a0			; LA64-NEXT: move $sp, $a0
	; LA64-NEXT: addi.d $sp, $sp, -32			; LA64-NEXT: addi.d $sp, $sp, -32
	; LA64-NEXT: ori $a1, $zero, 12			; LA64-NEXT: ori $a1, $zero, 12
	; LA64-NEXT: st.d $a1, $sp, 24			; LA64-NEXT: st.d $a1, $sp, 24
	; LA64-NEXT: ori $a1, $zero, 11			; LA64-NEXT: ori $a1, $zero, 11
	; LA64-NEXT: st.d $a1, $sp, 16			; LA64-NEXT: st.d $a1, $sp, 16
	; LA64-NEXT: ori $a1, $zero, 10			; LA64-NEXT: ori $a1, $zero, 10
	Show All 22 Lines

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll

	Show First 20 Lines • Show All 263 Lines • ▼ Show 20 Lines
	entry:			entry:
	%r = and i64 4096, %b			%r = and i64 4096, %b
	ret i64 %r			ret i64 %r
	}			}

	define signext i32 @and_i32_0xfff0(i32 %a) {			define signext i32 @and_i32_0xfff0(i32 %a) {
	; LA32-LABEL: and_i32_0xfff0:			; LA32-LABEL: and_i32_0xfff0:
	; LA32: # %bb.0:			; LA32: # %bb.0:
	; LA32-NEXT: lu12i.w $a1, 15			; LA32-NEXT: bstrpick.w $a0, $a0, 15, 4
	; LA32-NEXT: ori $a1, $a1, 4080			; LA32-NEXT: slli.w $a0, $a0, 4
	; LA32-NEXT: and $a0, $a0, $a1
	; LA32-NEXT: ret			; LA32-NEXT: ret
	;			;
	; LA64-LABEL: and_i32_0xfff0:			; LA64-LABEL: and_i32_0xfff0:
	; LA64: # %bb.0:			; LA64: # %bb.0:
	; LA64-NEXT: lu12i.w $a1, 15			; LA64-NEXT: bstrpick.d $a0, $a0, 15, 4
	; LA64-NEXT: ori $a1, $a1, 4080			; LA64-NEXT: slli.d $a0, $a0, 4
	; LA64-NEXT: and $a0, $a0, $a1
	; LA64-NEXT: ret			; LA64-NEXT: ret
	%b = and i32 %a, 65520			%b = and i32 %a, 65520
	ret i32 %b			ret i32 %b
	}			}

	define signext i32 @and_i32_0xfff0_twice(i32 %a, i32 %b) {			define signext i32 @and_i32_0xfff0_twice(i32 %a, i32 %b) {
	; LA32-LABEL: and_i32_0xfff0_twice:			; LA32-LABEL: and_i32_0xfff0_twice:
	; LA32: # %bb.0:			; LA32: # %bb.0:
	; LA32-NEXT: lu12i.w $a2, 15			; LA32-NEXT: bstrpick.w $a1, $a1, 15, 4
				benshi001AuthorUnsubmitted Done Reply Inline Actions This is not optimized, since `0xff000` can be composed by a single `lu12i.w`. benshi001: This is not optimized, since `0xff000` can be composed by a single `lu12i.w`.
	; LA32-NEXT: ori $a2, $a2, 4080			; LA32-NEXT: slli.w $a1, $a1, 4
	; LA32-NEXT: and $a1, $a1, $a2			; LA32-NEXT: bstrpick.w $a0, $a0, 15, 4
	; LA32-NEXT: and $a0, $a0, $a2			; LA32-NEXT: slli.w $a0, $a0, 4
	; LA32-NEXT: sub.w $a0, $a0, $a1			; LA32-NEXT: sub.w $a0, $a0, $a1
	; LA32-NEXT: ret			; LA32-NEXT: ret
	;			;
	; LA64-LABEL: and_i32_0xfff0_twice:			; LA64-LABEL: and_i32_0xfff0_twice:
	; LA64: # %bb.0:			; LA64: # %bb.0:
	; LA64-NEXT: lu12i.w $a2, 15			; LA64-NEXT: bstrpick.d $a1, $a1, 15, 4
	; LA64-NEXT: ori $a2, $a2, 4080			; LA64-NEXT: slli.d $a1, $a1, 4
	; LA64-NEXT: and $a1, $a1, $a2			; LA64-NEXT: bstrpick.d $a0, $a0, 15, 4
	; LA64-NEXT: and $a0, $a0, $a2			; LA64-NEXT: slli.d $a0, $a0, 4
	; LA64-NEXT: sub.d $a0, $a0, $a1			; LA64-NEXT: sub.d $a0, $a0, $a1
	; LA64-NEXT: ret			; LA64-NEXT: ret
	%c = and i32 %a, 65520			%c = and i32 %a, 65520
	%d = and i32 %b, 65520			%d = and i32 %b, 65520
	%e = sub i32 %c, %d			%e = sub i32 %c, %d
	ret i32 %e			ret i32 %e
	}			}

	define i64 @and_i64_0xfff0(i64 %a) {			define i64 @and_i64_0xfff0(i64 %a) {
	; LA32-LABEL: and_i64_0xfff0:			; LA32-LABEL: and_i64_0xfff0:
	; LA32: # %bb.0:			; LA32: # %bb.0:
	; LA32-NEXT: lu12i.w $a1, 15			; LA32-NEXT: bstrpick.w $a0, $a0, 15, 4
	; LA32-NEXT: ori $a1, $a1, 4080			; LA32-NEXT: slli.w $a0, $a0, 4
	; LA32-NEXT: and $a0, $a0, $a1
	; LA32-NEXT: move $a1, $zero			; LA32-NEXT: move $a1, $zero
	; LA32-NEXT: ret			; LA32-NEXT: ret
	;			;
	; LA64-LABEL: and_i64_0xfff0:			; LA64-LABEL: and_i64_0xfff0:
	; LA64: # %bb.0:			; LA64: # %bb.0:
	; LA64-NEXT: lu12i.w $a1, 15			; LA64-NEXT: bstrpick.d $a0, $a0, 15, 4
	; LA64-NEXT: ori $a1, $a1, 4080			; LA64-NEXT: slli.d $a0, $a0, 4
	; LA64-NEXT: and $a0, $a0, $a1
	; LA64-NEXT: ret			; LA64-NEXT: ret
				xen0nUnsubmitted Done Reply Inline Actions Seems wrong? The intended operation is to retain only `a[15:4]` so we should `bstrpick.d $a0, $a0, 15, 4` to retain bits, then `slli.d $a0, $a0, 4` to restore the LSB position. (LoongArch `bstrpick` invariably stores to `rd`'s LSB side, and will not retain the original relative position.) xen0n: Seems wrong? The intended operation is to retain only `a[15:4]` so we should `bstrpick.d $a0…
				benshi001AuthorUnsubmitted Done Reply Inline Actions Thanks. I should not make such as mistake :( benshi001: Thanks. I should not make such as mistake :(
	%b = and i64 %a, 65520			%b = and i64 %a, 65520
	ret i64 %b			ret i64 %b
	}			}

	define i64 @and_i64_0xfff0_twice(i64 %a, i64 %b) {			define i64 @and_i64_0xfff0_twice(i64 %a, i64 %b) {
				benshi001AuthorUnsubmitted Done Reply Inline Actions This is not optimized, since `0xff000` can be composed by a single `lu12i.w`. benshi001: This is not optimized, since `0xff000` can be composed by a single `lu12i.w`.
	; LA32-LABEL: and_i64_0xfff0_twice:			; LA32-LABEL: and_i64_0xfff0_twice:
	; LA32: # %bb.0:			; LA32: # %bb.0:
	; LA32-NEXT: lu12i.w $a1, 15			; LA32-NEXT: bstrpick.w $a1, $a2, 15, 4
				benshi001AuthorUnsubmitted Done Reply Inline Actions This is not optimized, since the constant is used twice. benshi001: This is not optimized, since the constant is used twice.
	; LA32-NEXT: ori $a1, $a1, 4080			; LA32-NEXT: slli.w $a1, $a1, 4
	; LA32-NEXT: and $a2, $a2, $a1			; LA32-NEXT: bstrpick.w $a0, $a0, 15, 4
	; LA32-NEXT: and $a1, $a0, $a1			; LA32-NEXT: slli.w $a2, $a0, 4
	; LA32-NEXT: sub.w $a0, $a1, $a2			; LA32-NEXT: sub.w $a0, $a2, $a1
	; LA32-NEXT: sltu $a1, $a1, $a2			; LA32-NEXT: sltu $a1, $a2, $a1
	; LA32-NEXT: sub.w $a1, $zero, $a1			; LA32-NEXT: sub.w $a1, $zero, $a1
	; LA32-NEXT: ret			; LA32-NEXT: ret
	;			;
	; LA64-LABEL: and_i64_0xfff0_twice:			; LA64-LABEL: and_i64_0xfff0_twice:
	; LA64: # %bb.0:			; LA64: # %bb.0:
	; LA64-NEXT: lu12i.w $a2, 15			; LA64-NEXT: bstrpick.d $a1, $a1, 15, 4
	; LA64-NEXT: ori $a2, $a2, 4080			; LA64-NEXT: slli.d $a1, $a1, 4
	; LA64-NEXT: and $a1, $a1, $a2			; LA64-NEXT: bstrpick.d $a0, $a0, 15, 4
	; LA64-NEXT: and $a0, $a0, $a2			; LA64-NEXT: slli.d $a0, $a0, 4
	; LA64-NEXT: sub.d $a0, $a0, $a1			; LA64-NEXT: sub.d $a0, $a0, $a1
				SixWeiningUnsubmitted Done Reply Inline Actions It's also a win if optimized to `bstrpick + slli` since we can save a register `$a2`. Right? ; LA64-NEXT: bstrpick.d $a0, $a0, 15, 4 ; LA64-NEXT: slli.d $a0, $a0, 4 ; LA64-NEXT: bstrpick.d $a1, $a1, 15, 4 ; LA64-NEXT: slli.d $a1, $a1, 4 ; LA64-NEXT: mul.d $a0, $a0, $a1 But if the immediate `0xfff0` is used 3 times, we have 2 choices: Use less registers but one more instruction. Use less instructions but one more register. I'm not sure how to balance this. SixWeining: It's also a win if optimized to `bstrpick + slli` since we can save a register `$a2`. Right?
				benshi001AuthorUnsubmitted Done Reply Inline Actions I am also not sure about the case of more then 2 uses. Maybe we make a convervative choice, just make it unchanged? And we only handle 1 and 2 uses ? benshi001: I am also not sure about the case of more then 2 uses. Maybe we make a convervative choice…
				SixWeiningUnsubmitted Done Reply Inline Actions I agree with you. SixWeining: I agree with you.
				benshi001AuthorUnsubmitted Done Reply Inline Actions I have updated my code 1 and 2 uses are handled, more uses are rejected; corresponding tests are added for the above logic. benshi001: I have updated my code 1. 1 and 2 uses are handled, more uses are rejected; 2. corresponding…
				xen0nUnsubmitted Not Done Reply Inline Actions Fine with me. If we were to aim for the maximum performance possible then that'll have to be backed with actual micro-architecture details, so the optimizer can do the right thing for the right models. Less instructions executed usually can't hurt after all. (In my Go benchmarking experiences, loop alignment could have far more profound influence on the numbers than micro-optimization like this, so it's probably fine to not care as much here.) xen0n: Fine with me. If we were to aim for the maximum performance possible then that'll have to be…
	; LA64-NEXT: ret			; LA64-NEXT: ret
	%c = and i64 %a, 65520			%c = and i64 %a, 65520
	%d = and i64 %b, 65520			%d = and i64 %b, 65520
	%e = sub i64 %c, %d			%e = sub i64 %c, %d
	ret i64 %e			ret i64 %e
	}			}

	;; This case is not optimized to `bstrpick + slli`,			;; This case is not optimized to `bstrpick + slli`,
	Show All 23 Lines
	; LA32-LABEL: and_i64_minus_2048:			; LA32-LABEL: and_i64_minus_2048:
	; LA32: # %bb.0:			; LA32: # %bb.0:
	; LA32-NEXT: addi.w $a2, $zero, -2048			; LA32-NEXT: addi.w $a2, $zero, -2048
	; LA32-NEXT: and $a0, $a0, $a2			; LA32-NEXT: and $a0, $a0, $a2
	; LA32-NEXT: ret			; LA32-NEXT: ret
	;			;
	; LA64-LABEL: and_i64_minus_2048:			; LA64-LABEL: and_i64_minus_2048:
	; LA64: # %bb.0:			; LA64: # %bb.0:
	; LA64-NEXT: addi.w $a1, $zero, -2048			; LA64-NEXT: addi.w $a1, $zero, -2048
				benshi001AuthorUnsubmitted Done Reply Inline Actions This is not optimized, since the value `-31` can be composed by a single `ADDI.W`. benshi001: This is not optimized, since the value `-31` can be composed by a single `ADDI.W`.
				benshi001AuthorUnsubmitted Done Reply Inline Actions This is not optimized, since the value `-2048` can be composed by a single `ADDI.W`. benshi001: This is not optimized, since the value `-2048` can be composed by a single `ADDI.W`.
	; LA64-NEXT: and $a0, $a0, $a1			; LA64-NEXT: and $a0, $a0, $a1
	; LA64-NEXT: ret			; LA64-NEXT: ret
	%b = and i64 %a, -2048			%b = and i64 %a, -2048
	ret i64 %b			ret i64 %b
	}			}

	;; This case is not optimized to `bstrpick + slli`,			;; This case is not optimized to `bstrpick + slli`,
	;; since the immediate 0xfff0 has more than 2 uses.			;; since the immediate 0xfff0 has more than 2 uses.
				benshi001AuthorUnsubmitted Done Reply Inline Actions This is the case the immediate has 3 uses, which is not optimized. benshi001: This is the case the immediate has 3 uses, which is not optimized.
	define i64 @and_i64_0xfff0_multiple_times(i64 %a, i64 %b, i64 %c) {			define i64 @and_i64_0xfff0_multiple_times(i64 %a, i64 %b, i64 %c) {
	; LA32-LABEL: and_i64_0xfff0_multiple_times:			; LA32-LABEL: and_i64_0xfff0_multiple_times:
	; LA32: # %bb.0:			; LA32: # %bb.0:
	; LA32-NEXT: lu12i.w $a1, 15			; LA32-NEXT: lu12i.w $a1, 15
	; LA32-NEXT: ori $a1, $a1, 4080			; LA32-NEXT: ori $a1, $a1, 4080
	; LA32-NEXT: and $a3, $a0, $a1			; LA32-NEXT: and $a3, $a0, $a1
	; LA32-NEXT: and $a0, $a4, $a1			; LA32-NEXT: and $a0, $a4, $a1
	; LA32-NEXT: and $a1, $a2, $a1			; LA32-NEXT: and $a1, $a2, $a1
	Show All 26 Lines

llvm/test/CodeGen/LoongArch/shrinkwrap.ll

	Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	; NOSHRINKW-NEXT: move $a1, $a0			; NOSHRINKW-NEXT: move $a1, $a0
	; NOSHRINKW-NEXT: st.d $a1, $fp, -24 # 8-byte Folded Spill			; NOSHRINKW-NEXT: st.d $a1, $fp, -24 # 8-byte Folded Spill
	; NOSHRINKW-NEXT: bstrpick.d $a1, $a0, 31, 0			; NOSHRINKW-NEXT: bstrpick.d $a1, $a0, 31, 0
	; NOSHRINKW-NEXT: ori $a0, $zero, 32			; NOSHRINKW-NEXT: ori $a0, $zero, 32
	; NOSHRINKW-NEXT: bltu $a0, $a1, .LBB1_2			; NOSHRINKW-NEXT: bltu $a0, $a1, .LBB1_2
	; NOSHRINKW-NEXT: b .LBB1_1			; NOSHRINKW-NEXT: b .LBB1_1
	; NOSHRINKW-NEXT: .LBB1_1: # %if.then			; NOSHRINKW-NEXT: .LBB1_1: # %if.then
	; NOSHRINKW-NEXT: ld.d $a0, $fp, -24 # 8-byte Folded Reload			; NOSHRINKW-NEXT: ld.d $a0, $fp, -24 # 8-byte Folded Reload
	; NOSHRINKW-NEXT: addi.w $a1, $zero, -16
	; NOSHRINKW-NEXT: lu32i.d $a1, 1
	; NOSHRINKW-NEXT: bstrpick.d $a0, $a0, 31, 0			; NOSHRINKW-NEXT: bstrpick.d $a0, $a0, 31, 0
	; NOSHRINKW-NEXT: addi.d $a0, $a0, 15			; NOSHRINKW-NEXT: addi.d $a0, $a0, 15
	; NOSHRINKW-NEXT: and $a1, $a0, $a1			; NOSHRINKW-NEXT: bstrpick.d $a0, $a0, 32, 4
				; NOSHRINKW-NEXT: slli.d $a1, $a0, 4
	; NOSHRINKW-NEXT: move $a0, $sp			; NOSHRINKW-NEXT: move $a0, $sp
	; NOSHRINKW-NEXT: sub.d $a0, $a0, $a1			; NOSHRINKW-NEXT: sub.d $a0, $a0, $a1
	; NOSHRINKW-NEXT: move $sp, $a0			; NOSHRINKW-NEXT: move $sp, $a0
	; NOSHRINKW-NEXT: bl %plt(notdead)			; NOSHRINKW-NEXT: bl %plt(notdead)
	; NOSHRINKW-NEXT: b .LBB1_2			; NOSHRINKW-NEXT: b .LBB1_2
	; NOSHRINKW-NEXT: .LBB1_2: # %if.end			; NOSHRINKW-NEXT: .LBB1_2: # %if.end
	; NOSHRINKW-NEXT: addi.d $sp, $fp, -32			; NOSHRINKW-NEXT: addi.d $sp, $fp, -32
	; NOSHRINKW-NEXT: ld.d $fp, $sp, 16 # 8-byte Folded Reload			; NOSHRINKW-NEXT: ld.d $fp, $sp, 16 # 8-byte Folded Reload
	; NOSHRINKW-NEXT: ld.d $ra, $sp, 24 # 8-byte Folded Reload			; NOSHRINKW-NEXT: ld.d $ra, $sp, 24 # 8-byte Folded Reload
	; NOSHRINKW-NEXT: addi.d $sp, $sp, 32			; NOSHRINKW-NEXT: addi.d $sp, $sp, 32
	; NOSHRINKW-NEXT: ret			; NOSHRINKW-NEXT: ret
	;			;
	; SHRINKW-LABEL: conditional_alloca:			; SHRINKW-LABEL: conditional_alloca:
	; SHRINKW: # %bb.0:			; SHRINKW: # %bb.0:
	; SHRINKW-NEXT: bstrpick.d $a0, $a0, 31, 0			; SHRINKW-NEXT: bstrpick.d $a0, $a0, 31, 0
	; SHRINKW-NEXT: ori $a1, $zero, 32			; SHRINKW-NEXT: ori $a1, $zero, 32
	; SHRINKW-NEXT: bltu $a1, $a0, .LBB1_2			; SHRINKW-NEXT: bltu $a1, $a0, .LBB1_2
	; SHRINKW-NEXT: # %bb.1: # %if.then			; SHRINKW-NEXT: # %bb.1: # %if.then
	; SHRINKW-NEXT: addi.d $sp, $sp, -16			; SHRINKW-NEXT: addi.d $sp, $sp, -16
	; SHRINKW-NEXT: st.d $ra, $sp, 8 # 8-byte Folded Spill			; SHRINKW-NEXT: st.d $ra, $sp, 8 # 8-byte Folded Spill
	; SHRINKW-NEXT: st.d $fp, $sp, 0 # 8-byte Folded Spill			; SHRINKW-NEXT: st.d $fp, $sp, 0 # 8-byte Folded Spill
	; SHRINKW-NEXT: addi.d $fp, $sp, 16			; SHRINKW-NEXT: addi.d $fp, $sp, 16
	; SHRINKW-NEXT: addi.w $a1, $zero, -16
	; SHRINKW-NEXT: lu32i.d $a1, 1
	; SHRINKW-NEXT: addi.d $a0, $a0, 15			; SHRINKW-NEXT: addi.d $a0, $a0, 15
	; SHRINKW-NEXT: and $a0, $a0, $a1			; SHRINKW-NEXT: bstrpick.d $a0, $a0, 32, 4
				; SHRINKW-NEXT: slli.d $a0, $a0, 4
	; SHRINKW-NEXT: sub.d $a0, $sp, $a0			; SHRINKW-NEXT: sub.d $a0, $sp, $a0
	; SHRINKW-NEXT: move $sp, $a0			; SHRINKW-NEXT: move $sp, $a0
	; SHRINKW-NEXT: bl %plt(notdead)			; SHRINKW-NEXT: bl %plt(notdead)
	; SHRINKW-NEXT: addi.d $sp, $fp, -16			; SHRINKW-NEXT: addi.d $sp, $fp, -16
	; SHRINKW-NEXT: ld.d $fp, $sp, 0 # 8-byte Folded Reload			; SHRINKW-NEXT: ld.d $fp, $sp, 0 # 8-byte Folded Reload
	; SHRINKW-NEXT: ld.d $ra, $sp, 8 # 8-byte Folded Reload			; SHRINKW-NEXT: ld.d $ra, $sp, 8 # 8-byte Folded Reload
	; SHRINKW-NEXT: addi.d $sp, $sp, 16			; SHRINKW-NEXT: addi.d $sp, $sp, 16
	; SHRINKW-NEXT: .LBB1_2: # %if.end			; SHRINKW-NEXT: .LBB1_2: # %if.end
	Show All 12 Lines

llvm/test/CodeGen/LoongArch/stack-realignment-with-variable-sized-objects.ll

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; LA64-NEXT: .cfi_offset 1, -8			; LA64-NEXT: .cfi_offset 1, -8
	; LA64-NEXT: .cfi_offset 22, -16			; LA64-NEXT: .cfi_offset 22, -16
	; LA64-NEXT: .cfi_offset 31, -24			; LA64-NEXT: .cfi_offset 31, -24
	; LA64-NEXT: addi.d $fp, $sp, 64			; LA64-NEXT: addi.d $fp, $sp, 64
	; LA64-NEXT: .cfi_def_cfa 22, 0			; LA64-NEXT: .cfi_def_cfa 22, 0
	; LA64-NEXT: srli.d $a1, $sp, 6			; LA64-NEXT: srli.d $a1, $sp, 6
	; LA64-NEXT: slli.d $sp, $a1, 6			; LA64-NEXT: slli.d $sp, $a1, 6
	; LA64-NEXT: move $s8, $sp			; LA64-NEXT: move $s8, $sp
	; LA64-NEXT: addi.w $a1, $zero, -16
	; LA64-NEXT: lu32i.d $a1, 1
	; LA64-NEXT: bstrpick.d $a0, $a0, 31, 0			; LA64-NEXT: bstrpick.d $a0, $a0, 31, 0
	; LA64-NEXT: addi.d $a0, $a0, 15			; LA64-NEXT: addi.d $a0, $a0, 15
	; LA64-NEXT: and $a0, $a0, $a1			; LA64-NEXT: bstrpick.d $a0, $a0, 32, 4
				; LA64-NEXT: slli.d $a0, $a0, 4
	; LA64-NEXT: sub.d $a0, $sp, $a0			; LA64-NEXT: sub.d $a0, $sp, $a0
	; LA64-NEXT: move $sp, $a0			; LA64-NEXT: move $sp, $a0
	; LA64-NEXT: addi.d $a1, $s8, 0			; LA64-NEXT: addi.d $a1, $s8, 0
	; LA64-NEXT: bl %plt(callee)			; LA64-NEXT: bl %plt(callee)
	; LA64-NEXT: addi.d $sp, $fp, -64			; LA64-NEXT: addi.d $sp, $fp, -64
	; LA64-NEXT: ld.d $s8, $sp, 40 # 8-byte Folded Reload			; LA64-NEXT: ld.d $s8, $sp, 40 # 8-byte Folded Reload
	; LA64-NEXT: ld.d $fp, $sp, 48 # 8-byte Folded Reload			; LA64-NEXT: ld.d $fp, $sp, 48 # 8-byte Folded Reload
	; LA64-NEXT: ld.d $ra, $sp, 56 # 8-byte Folded Reload			; LA64-NEXT: ld.d $ra, $sp, 56 # 8-byte Folded Reload
	; LA64-NEXT: addi.d $sp, $sp, 64			; LA64-NEXT: addi.d $sp, $sp, 64
	; LA64-NEXT: ret			; LA64-NEXT: ret
	%1 = alloca i8, i32 %n			%1 = alloca i8, i32 %n
	%2 = alloca i32, align 64			%2 = alloca i32, align 64
	call void @callee(ptr %1, ptr %2)			call void @callee(ptr %1, ptr %2)
	ret void			ret void
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[LoongArch] Optimize bitwise and with immediatesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 510394

llvm/lib/Target/LoongArch/LoongArchISelLowering.cpp

llvm/test/CodeGen/LoongArch/alloca.ll

llvm/test/CodeGen/LoongArch/ir-instruction/and.ll

llvm/test/CodeGen/LoongArch/shrinkwrap.ll

llvm/test/CodeGen/LoongArch/stack-realignment-with-variable-sized-objects.ll

[LoongArch] Optimize bitwise and with immediates
ClosedPublic