Download Raw Diff

Details

Reviewers

steven.zhang
nemanjai
jsji
qiucf

Group Reviewers

Restricted Project

Commits

rG6e0ad5bc8c34: [PowerPC] Add an ISEL pattern for Mul with Imm.

Summary

If the multiplicand is a constant with following formats:

mul with (2^N * int16_imm) -> mulli + rldicr
mul with (2^N + 2^M) -> rldicr + add + rldicr
mul with (2^N - 2^M) -> rldicr + sub + rldicr

Scenario 2 and 3 are moved to DAGCombiner D88201.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	40 ms	linux > LLVM.CodeGen/PowerPC::mulli.ll
	60 ms	windows > LLVM.CodeGen/PowerPC::mulli.ll

Event Timeline

Esme created this revision.Sep 9 2020, 8:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2020, 8:12 AM

Herald added subscribers: llvm-commits, shchenz, kbarton, hiraditya. · View Herald Transcript

Esme requested review of this revision.Sep 9 2020, 8:12 AM

Harbormaster completed remote builds in B71095: Diff 290740.Sep 9 2020, 9:55 AM

Esme edited the summary of this revision. (Show Details)Sep 9 2020, 7:45 PM

LGTM. Thank you for doing this. I assume that you have run it with bmk to make sure that the functionality is fine.

This revision is now accepted and ready to land.Sep 16 2020, 1:15 AM

Why this can NOT be done in DAGCombiner by implementing decomposeMulByConstant target hook?

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5039	Can we add comments about all these scenarios? With simple examples.
5049	`exactly two bits set` ? What do you mean?

In D87384#2279886, @jsji wrote:

Why this can NOT be done in DAGCombiner by implementing decomposeMulByConstant target hook?

Do you have evidence of profitability of (mul (shl %a, N) M) being preferable over (mul %a, C) on other targets? If so, I suppose extending the DAG combine that handles:

mul x, (2^N + 1) --> add (shl x, N), x
mul x, (2^N - 1) --> sub (shl x, N), x

Perhaps it wold be good to gauge interest from other RISC targets.

Scenarios 2 and 3 are quite similar to the existing DAG combine so this would be a straightforward implementation. However I am not sure how likely it is that two shifts and an add/sub are better than a multiply on other targets.

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5039	I agree 100%. We should illustrate this with some example bit patterns.
5049	2^N + 2^M has exactly two bits set. However, I don't think this is profitable as implemented. This implementation does not benefit from an ILP improvement since the instructions are dependent. What we probably want is to convert `(mul %a, (2^N + 2^M))` to `(add (shl %a, N), (shl %a, M))` since that improves ILP so the total latency of the sequence is actually reduced. That would also mean that for something like this: res1 = a 0x8800; res2 = a 0x8080; we only need three total shifts (we should also add that as a test case). Similar argument applies to the subtract case below.

Thank you for your comments! @jsji @nemanjai

For scenarios 2 and 3, I will modified it as @nemanjai 's hint, and move it to DAGCombiner since it is really similar to existing patterns and then implement our PPCTargetLowering::decomposeMulByConstant. If other targets are interested about this, they can add conditions for this in their decomposeMulByConstant.

But for scenario 1, I am not sure if there are benefits to other targets, since it is inspired by our HW instruction MULLI. So I prefer to keep it in ISEL.

Keep the scenario 1 implemented in ISEL.
For scenario 2 and 3, I will post another patch to implement them in DAGCombiner.

Harbormaster completed remote builds in B72499: Diff 293409.Sep 22 2020, 5:15 AM

I think we already have similar pattern for scenario 1 as well:

// Change (mul (shl X, C), Y) -> (shl (mul X, Y), C) when the shift has one
// use.

Just that (shl X,C) is not constant there..

So I would assume dealing with similar situation controlled by TLI.decomposeMulByConstantwill be easy and also no harm to other targets.
Targets can control and add this scenario in their TLI if they also want this.

BTW: looks like x86 also has imul.

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5038	Maybe it would be clearer if we use DAG expression in comments. eg: `(mul X, c2 << c1) -> (rldicr (mulli X, c2 >> c1) c1)`

Hi @jsji An infinite loop will occur if we handle the scenario 1 (mul X, c2 << c1) -> (shl (mul X, c2), c1) in DAGCombiner, because there exists a reverse conversion (shl (mul x, c1), c2) -> (mul x, c1 << c2).

In D87384#2289477, @Esme wrote:

Hi @jsji An infinite loop will occur if we handle the scenario 1 (mul X, c2 << c1) -> (mul (shl X, c1), c2) in DAGCombiner, because there exists a reverse conversion (mul (shl X, c1), c2) -> (mul X, c2 << c1).

Hmm.. then OK to keep this in ISEL, but please add comments about DAGcombiner prefer (mul X, c2 << c1). Thanks

Update comments.

Esme edited the summary of this revision. (Show Details)Sep 24 2020, 12:29 AM

Esme mentioned this in D88201: [DAGCombiner] Add decomposition patterns for Mul-by-Imm..Sep 24 2020, 12:54 AM

Harbormaster completed remote builds in B72765: Diff 293950.Sep 24 2020, 12:59 AM

Esme updated this revision to Diff 293968.Sep 24 2020, 1:32 AM

Esme edited the summary of this revision. (Show Details)

Thank you so much for your comments. :D @jsji @nemanjai
I have post another patch D88201 to handle scenario 2 and 3 in DAGCombiner.

Harbormaster completed remote builds in B72780: Diff 293968.Sep 24 2020, 2:24 AM

Format.

Harbormaster completed remote builds in B72791: Diff 293996.Sep 24 2020, 3:54 AM

Looking forward to your further comments :) @jsji @nemanjai

Esme mentioned this in rGe9fd8823baf5: [DAGCombiner] Add decomposition patterns for Mul-by-Imm..Oct 9 2020, 1:52 AM

ping

LGTM. Thanks for improving this!

Closed by commit rG6e0ad5bc8c34: [PowerPC] Add an ISEL pattern for Mul with Imm. (authored by Esme). · Explain WhyNov 9 2020, 10:53 PM

This revision was automatically updated to reflect the committed changes.

Esme added a commit: rG6e0ad5bc8c34: [PowerPC] Add an ISEL pattern for Mul with Imm..

Esme mentioned this in D129708: [PowerPC] Add an ISEL pattern for i32 MULLI..Jul 13 2022, 5:29 PM

Esme mentioned this in rG28b1ba1c0742: [PowerPC] Add an ISEL pattern for i32 MULLI..Jul 18 2022, 1:41 AM

Diff 293950

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,018 Lines • ▼ Show 20 Lines	if (isOpcWithIntImmediate(N->getOperand(0).getNode(), ISD::AND, Imm) &&
getI32Imm(ME, dl) };		getI32Imm(ME, dl) };
CurDAG->SelectNodeTo(N, PPC::RLWINM, MVT::i32, Ops);		CurDAG->SelectNodeTo(N, PPC::RLWINM, MVT::i32, Ops);
return;		return;
}		}

// Other cases are autogenerated.		// Other cases are autogenerated.
break;		break;
}		}

		case ISD::MUL: {
		SDValue Op0 = N->getOperand(0);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: unused variable 'Op0' [clang-diagnostic-unused-variable] not useful Lint: Pre-merge checks: clang-tidy: warning: unused variable 'Op0' [clang-diagnostic-unused-variable] [[https://github.
		SDValue Op1 = N->getOperand(1);
		if (Op1.getOpcode() != ISD::Constant \|\| Op1.getValueType() != MVT::i64)
		break;

		// If the multiplier fits int16, we can handle it with mulli.
		uint64_t Imm = cast<ConstantSDNode>(Op1)->getZExtValue();
		unsigned Shift = countTrailingZeros<uint64_t>(Imm);
		if (isInt<16>(Imm) \|\| !Shift)
		break;
		jsjiUnsubmitted Not Done Reply Inline Actions Maybe it would be clearer if we use DAG expression in comments. eg: `(mul X, c2 << c1) -> (rldicr (mulli X, c2 >> c1) c1)` jsji: Maybe it would be clearer if we use DAG expression in comments. eg: `(mul X, c2 << c1) ->…

		jsjiUnsubmitted Not Done Reply Inline Actions Can we add comments about all these scenarios? With simple examples. jsji: Can we add comments about all these scenarios? With simple examples.
		nemanjaiUnsubmitted Not Done Reply Inline Actions I agree 100%. We should illustrate this with some example bit patterns. nemanjai: I agree 100%. We should illustrate this with some example bit patterns.
		// If the shifted value fits int16, we can do this transformation:
		// (mul X, c1 << c2) -> (rldicr (mulli X, c1) c2). We do this in ISEL due to
		// DAGCombiner prefers (shl (mul X, c1), c2) -> (mul X, c1 << c2).
		uint64_t ImmSh = Imm >> Shift;
		if (isInt<16>(ImmSh)) {
		uint64_t SextImm = SignExtend64(ImmSh & 0xFFFF, 16);
		SDValue SDImm = CurDAG->getTargetConstant(SextImm, dl, MVT::i64);
		SDNode *MulNode = CurDAG->getMachineNode(PPC::MULLI8, dl, MVT::i64,
		N->getOperand(0), SDImm);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - N->getOperand(0), SDImm); + N->getOperand(0), SDImm); Lint: Pre-merge checks: clang-format: please reformat the code ``` - N->getOperand…
		CurDAG->SelectNodeTo(N, PPC::RLDICR, MVT::i64, SDValue(MulNode, 0),
		jsjiUnsubmitted Not Done Reply Inline Actions `exactly two bits set` ? What do you mean? jsji: `exactly two bits set `? What do you mean?
		nemanjaiUnsubmitted Not Done Reply Inline Actions 2^N + 2^M has exactly two bits set. However, I don't think this is profitable as implemented. This implementation does not benefit from an ILP improvement since the instructions are dependent. What we probably want is to convert `(mul %a, (2^N + 2^M))` to `(add (shl %a, N), (shl %a, M))` since that improves ILP so the total latency of the sequence is actually reduced. That would also mean that for something like this: res1 = a 0x8800; res2 = a 0x8080; we only need three total shifts (we should also add that as a test case). Similar argument applies to the subtract case below. nemanjai: 2^N + 2^M has exactly two bits set. However, I don't think this is profitable as implemented.
		getI32Imm(Shift, dl), getI32Imm(63 - Shift, dl));
		return;
		}
		break;
		}

// FIXME: Remove this once the ANDI glue bug is fixed:		// FIXME: Remove this once the ANDI glue bug is fixed:
case PPCISD::ANDI_rec_1_EQ_BIT:		case PPCISD::ANDI_rec_1_EQ_BIT:
case PPCISD::ANDI_rec_1_GT_BIT: {		case PPCISD::ANDI_rec_1_GT_BIT: {
if (!ANDIGlueBug)		if (!ANDIGlueBug)
break;		break;

EVT InVT = N->getOperand(0).getValueType();		EVT InVT = N->getOperand(0).getValueType();
assert((InVT == MVT::i64 \|\| InVT == MVT::i32) &&		assert((InVT == MVT::i64 \|\| InVT == MVT::i32) &&
▲ Show 20 Lines • Show All 1,788 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/mulli.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mtriple=powerpc64le-unknown-linux-gnu < %s \| FileCheck %s		; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mtriple=powerpc64le-unknown-linux-gnu < %s \| FileCheck %s

define i64 @test1(i64 %x) {		define i64 @test1(i64 %x) {
; CHECK-LABEL: test1:		; CHECK-LABEL: test1:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: li 4, 625		; CHECK-NEXT: mulli 3, 3, 625
; CHECK-NEXT: sldi 4, 4, 36		; CHECK-NEXT: sldi 3, 3, 36
; CHECK-NEXT: mulld 3, 3, 4
; CHECK-NEXT: blr		; CHECK-NEXT: blr
%y = mul i64 %x, 42949672960000		%y = mul i64 %x, 42949672960000
ret i64 %y		ret i64 %y
}		}

define i64 @test2(i64 %x) {		define i64 @test2(i64 %x) {
; CHECK-LABEL: test2:		; CHECK-LABEL: test2:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: li 4, -625		; CHECK-NEXT: mulli 3, 3, -625
; CHECK-NEXT: sldi 4, 4, 36		; CHECK-NEXT: sldi 3, 3, 36
; CHECK-NEXT: mulld 3, 3, 4
; CHECK-NEXT: blr		; CHECK-NEXT: blr
%y = mul i64 %x, -42949672960000		%y = mul i64 %x, -42949672960000
ret i64 %y		ret i64 %y
}		}

define i64 @test3(i64 %x) {		define i64 @test3(i64 %x) {
; CHECK-LABEL: test3:		; CHECK-LABEL: test3:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: lis 4, 74		; CHECK-NEXT: mulli 3, 3, 297
; CHECK-NEXT: ori 4, 4, 16384		; CHECK-NEXT: sldi 3, 3, 14
; CHECK-NEXT: mulld 3, 3, 4
; CHECK-NEXT: blr		; CHECK-NEXT: blr
%y = mul i64 %x, 4866048		%y = mul i64 %x, 4866048
ret i64 %y		ret i64 %y
}		}

define i64 @test4(i64 %x) {		define i64 @test4(i64 %x) {
; CHECK-LABEL: test4:		; CHECK-LABEL: test4:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: lis 4, -75		; CHECK-NEXT: mulli 3, 3, -297
; CHECK-NEXT: ori 4, 4, 49152		; CHECK-NEXT: sldi 3, 3, 14
; CHECK-NEXT: mulld 3, 3, 4
; CHECK-NEXT: blr		; CHECK-NEXT: blr
%y = mul i64 %x, -4866048		%y = mul i64 %x, -4866048
ret i64 %y		ret i64 %y
}		}

define i64 @test5(i64 %x) {		define i64 @test5(i64 %x) {
; CHECK-LABEL: test5:		; CHECK-LABEL: test5:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	; CHECK-NEXT: blr
%y = mul i64 %x, -17179860992		%y = mul i64 %x, -17179860992
ret i64 %y		ret i64 %y
}		}

define i64 @test9(i64 %x) {		define i64 @test9(i64 %x) {
; CHECK-LABEL: test9:		; CHECK-LABEL: test9:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: lis 4, 16		; CHECK-NEXT: lis 4, 16
; CHECK-NEXT: li 5, 8193
; CHECK-NEXT: ori 4, 4, 1		; CHECK-NEXT: ori 4, 4, 1
; CHECK-NEXT: sldi 5, 5, 19
; CHECK-NEXT: sldi 4, 4, 12		; CHECK-NEXT: sldi 4, 4, 12
; CHECK-NEXT: mulld 4, 3, 4		; CHECK-NEXT: mulld 4, 3, 4
; CHECK-NEXT: mulld 3, 3, 5		; CHECK-NEXT: mulli 3, 3, 8193
		; CHECK-NEXT: sldi 3, 3, 19
; CHECK-NEXT: sub 3, 4, 3		; CHECK-NEXT: sub 3, 4, 3
; CHECK-NEXT: blr		; CHECK-NEXT: blr
%y = mul i64 %x, 4294971392		%y = mul i64 %x, 4294971392
%z = mul i64 %x, 4295491584		%z = mul i64 %x, 4295491584
%res = sub i64 %y, %z		%res = sub i64 %y, %z
ret i64 %res		ret i64 %res
}		}

Show All 18 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Add ISEL patterns for Mul with Imm.
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 293950

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp

llvm/test/CodeGen/PowerPC/mulli.ll

This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Add ISEL patterns for Mul with Imm.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 293950

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp

llvm/test/CodeGen/PowerPC/mulli.ll

[PowerPC] Add ISEL patterns for Mul with Imm.
ClosedPublic