Download Raw Diff

Details

Reviewers

efriedma
dmgreen

Summary

The pattern Or(And(A, MaskValue), And(B, ~MaskValue)), where ~MaskValue = Xor(MaskValue, -1) gets lowered to bitselect instruction when NEON is available. However, when this pattern is in a loop and MaskValue lives outside of the immediate basic block, instruction selection isn't able to choose bitselect and we end up with sequence of ORs and ANDs. This patch sinks such MaskValue into the basic block to allow backend to select bit select instructions.

This will solve performance bugs mentioned in this comment: https://github.com/llvm/llvm-project/issues/49305#issuecomment-1440828393

VBSL intrinsics can be found here: https://developer.arm.com/architectures/instruction-sets/intrinsics/#q=vbsl

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,100 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

pranavk created this revision.Mar 30 2023, 1:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2023, 1:46 PM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

pranavk requested review of this revision.Mar 30 2023, 1:46 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 30 2023, 1:46 PM

Herald added subscribers: llvm-commits, cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B222624: Diff 509512.Mar 30 2023, 3:49 PM

scw added a reviewer: eli.friedman.May 1 2023, 4:23 PM

scw edited reviewers, added: efriedma; removed: eli.friedman.

scw added subscribers: scw, echristo.

The primary tradeoff here is that existing optimizations won't understand the intrinsic... for example, we can't constant-fold, or automatically invert the mask. But making the intrinsics more predictably produce efficient sequences is probably worthwhile.

(Any other opinions here?)

I guess I should note both the examples in https://github.com/llvm/llvm-project/issues/49305 could probably be fixed in other ways... we have heuristics to, for example, sink logic ops into loops when it's profitable. But that requires someone to notice the specific issues, and take the time to diagnose/fix them.

My preference would be for fixing the code we have, not introducing new intrinsics. Intrinsics act as black-boxes for the optimizer, and I'm pretty sure I've heard of cases in the past of the compiler optimizing the or/and/xor's to nicer sequences of instructions. It would be a shame to lose that.

The number of instructions in bsl is quite high compared to many intrinsics, they probably have a high chance of going wrong. But if we know of ways to optimize them (through shouldSinkOperands), then that would be my preference. It can end up helping all cases, not just the intrinsics.

Change shouldSinkOperand to allow backend to generate bitselect instructions

I agree. I changed the implementation to not introduce the intrinsic. I will need another change in InstCombine to handle case #1 mentioned on github bug report. I will have separate patch for it changing InstCombine. Thanks

pranavk requested review of this revision.May 10 2023, 10:59 AM

Harbormaster completed remote builds in B231128: Diff 521040.May 10 2023, 12:09 PM

[AArch64][InstCombine] Bail out for bitselect instructions

[AArch64] Change shouldSinkOperand to allow bitselect instructions

pranavk retitled this revision from [AArch64] Add IR intrinsics for vbsl* C intrinsics to [AArch64] Sink operands to allow for bitselect instructions.May 10 2023, 3:14 PM

pranavk edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B231187: Diff 521115.May 10 2023, 4:15 PM

Thanks for working on this. I noticed there was another instance of vbsl being reported recently in https://github.com/llvm/llvm-project/issues/62642. Hopefully it can be addresses via extra optimizations too.

Can you add a testcase for the issues in https://github.com/llvm/llvm-project/issues/49305? And look into the existing tests.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14343	That sounds like it might be a bug that happens if it tries to sink too many operands? From what I remember the order they are put into Ops might matter. And if it is sinking to the Or it might need to add both the And as well as the Not.
14346	I->hasOneUse()
14348	I->user_back();
14364	Can this be simplified with a m_Not matcher? In general instructions will be canonicalized so that constants are on the RHS. If we are sinking the Not, I feel this should want to test that the pattern makes up a bsl, and if it does then sink the operand of I. i.e. something like checking that OI is `m_c_Or(m_c_And(m_Value(A),m_Value(B)),m_specific(I))`, and then checking that `I` is `m_c_And(m_Not(m_Specific(A)),m_Value(D))` or the other way around.

pranavk planned changes to this revision.May 11 2023, 3:27 PM

pranavk marked 2 inline comments as done.

pranavk added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14343	When I tried to sink Or, I didn't add both Ands in the Vector. So it was sinking it just before Or even though it was used by And one instruction before the location where it sunk it. So, I don't think it's a bug. I just forgot to add Ands to the vector. I tried adding Ands to the vector and it works. So I have changed my implementation to switch-case under Instruction::Or as I think that makes it easier to read this code.
14364	Absolutely. I didn't look close enough it seems in PatternMatcher.h file. After discovering whole bunch of pattern matchers, I have shortened/simplified this implementation using m_not, etc. Please look at the latest patch. I have additionally added more guards/checks to prevent this from happening when one of the And, or its operands are not available in same basic block.

Address reviewer comments

tests coming

Harbormaster completed remote builds in B231461: Diff 521474.May 11 2023, 4:52 PM

add test

Harbormaster completed remote builds in B231486: Diff 521505.May 11 2023, 6:14 PM

More concise pattern matching

Harbormaster completed remote builds in B231520: Diff 521545.May 11 2023, 9:27 PM

Thanks. LGTM with a few extra suggestions.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14357	I'm not sure if this is necessary, so long as some of the operands can be sunk, but it is probably OK for the moment to keep as-is.
14361–14363	I think this can avoid the loop if we just use `Ops.push_back(&MainAnd->getOperandUse(MainAnd->getOperand(0) == IA ? 1 : 0));`
llvm/test/CodeGen/AArch64/aarch64-bit-gen.ll
148	Can you run utils/update_llc_test_checks.py on the file, to generate the runtime checks? There will be more of them but that should be OK in this case. It doesn't looks too large.

This revision is now accepted and ready to land.May 14 2023, 1:51 AM

address reviewer comments

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14361–14363	llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

pranavk added inline comments.May 15 2023, 1:01 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14361–14363	Done. this was left as part of my final refactoring. Uploaded new patch.

Harbormaster completed remote builds in B232094: Diff 522309.May 15 2023, 1:56 PM

Thanks. LGTM

I noticed there was another instance of vbsl being reported recently in https://github.com/llvm/llvm-project/issues/62642. Hopefully it can be addresses via extra optimizations too.

This is another InstCombine problem -- as soon as it sees constant, InstCombine runs demanded bits pass/analysis and tries to do clever tricks with sequence of and/and/or. We need to teach InstCombine here to not touch the sequence as we expect it to be vectorized in instruction lowering. I will try to fix this when looking at the other InstCombine problem (https://reviews.llvm.org/D150316) we are talking about.

pranavk added a comment.May 18 2023, 11:20 AM

This comment was removed by pranavk.

I forgot to update the differential link in the commit but this patch was merged as part of (https://github.com/llvm/llvm-project/commit/726785b1594c6b567c5c8ddd59075aee726590c6)

Diff 522309

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,326 Lines • ▼ Show 20 Lines	if (areExtractShuffleVectors(Ext1->getOperand(0), Ext2->getOperand(0))) {
Ops.push_back(&Ext2->getOperandUse(0));		Ops.push_back(&Ext2->getOperandUse(0));
}		}

Ops.push_back(&I->getOperandUse(0));		Ops.push_back(&I->getOperandUse(0));
Ops.push_back(&I->getOperandUse(1));		Ops.push_back(&I->getOperandUse(1));

return true;		return true;
}		}
		case Instruction::Or: {
		// Pattern: Or(And(MaskValue, A), And(Not(MaskValue), B)) ->
		// bitselect(MaskValue, A, B) where Not(MaskValue) = Xor(MaskValue, -1)
		if (Subtarget->hasNEON()) {
		Instruction OtherAnd, IA, *IB;
		Value *MaskValue;
		// MainAnd refers to And instruction that has 'Not' as one of its operands
		if (match(I, m_c_Or(m_OneUse(m_Instruction(OtherAnd)),
		m_OneUse(m_c_And(m_OneUse(m_Not(m_Value(MaskValue))),
		dmgreenUnsubmitted Not Done Reply Inline Actions That sounds like it might be a bug that happens if it tries to sink too many operands? From what I remember the order they are put into Ops might matter. And if it is sinking to the Or it might need to add both the And as well as the Not. dmgreen: That sounds like it might be a bug that happens if it tries to sink too many operands? From…
		pranavkAuthorUnsubmitted Not Done Reply Inline Actions When I tried to sink Or, I didn't add both Ands in the Vector. So it was sinking it just before Or even though it was used by And one instruction before the location where it sunk it. So, I don't think it's a bug. I just forgot to add Ands to the vector. I tried adding Ands to the vector and it works. So I have changed my implementation to switch-case under Instruction::Or as I think that makes it easier to read this code. pranavk: When I tried to sink Or, I didn't add both Ands in the Vector. So it was sinking it just before…
		m_Instruction(IA)))))) {
		if (match(OtherAnd,
		m_c_And(m_Specific(MaskValue), m_Instruction(IB)))) {
		dmgreenUnsubmitted Done Reply Inline Actions I->hasOneUse() dmgreen: I->hasOneUse()
		Instruction *MainAnd = I->getOperand(0) == OtherAnd
		? cast<Instruction>(I->getOperand(1))
		dmgreenUnsubmitted Done Reply Inline Actions I->user_back(); dmgreen: I->user_back();
		: cast<Instruction>(I->getOperand(0));

		// Both Ands should be in same basic block as Or
		if (I->getParent() != MainAnd->getParent() \|\|
		I->getParent() != OtherAnd->getParent())
		return false;

		// Non-mask operands of both Ands should also be in same basic block
		if (I->getParent() != IA->getParent() \|\|
		dmgreenUnsubmitted Done Reply Inline Actions I'm not sure if this is necessary, so long as some of the operands can be sunk, but it is probably OK for the moment to keep as-is. dmgreen: I'm not sure if this is necessary, so long as some of the operands can be sunk, but it is…
		I->getParent() != IB->getParent())
		return false;

		Ops.push_back(&MainAnd->getOperandUse(MainAnd->getOperand(0) == IA ? 1 : 0));
		Ops.push_back(&I->getOperandUse(0));
		Ops.push_back(&I->getOperandUse(1));
		dmgreenUnsubmitted Done Reply Inline Actions I think this can avoid the loop if we just use `Ops.push_back(&MainAnd->getOperandUse(MainAnd->getOperand(0) == IA ? 1 : 0));` dmgreen: I think this can avoid the loop if we just use `Ops.push_back(&MainAnd->getOperandUse(MainAnd…
		pranavkAuthorUnsubmitted Done Reply Inline Actions llvm/lib/Target/AArch64/AArch64ISelLowering.cpp pranavk: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
		pranavkAuthorUnsubmitted Done Reply Inline Actions Done. this was left as part of my final refactoring. Uploaded new patch. pranavk: Done. this was left as part of my final refactoring. Uploaded new patch.

		dmgreenUnsubmitted Not Done Reply Inline Actions Can this be simplified with a m_Not matcher? In general instructions will be canonicalized so that constants are on the RHS. If we are sinking the Not, I feel this should want to test that the pattern makes up a bsl, and if it does then sink the operand of I. i.e. something like checking that OI is `m_c_Or(m_c_And(m_Value(A),m_Value(B)),m_specific(I))`, and then checking that `I` is `m_c_And(m_Not(m_Specific(A)),m_Value(D))` or the other way around. dmgreen: Can this be simplified with a m_Not matcher? In general instructions will be canonicalized so…
		pranavkAuthorUnsubmitted Done Reply Inline Actions Absolutely. I didn't look close enough it seems in PatternMatcher.h file. After discovering whole bunch of pattern matchers, I have shortened/simplified this implementation using m_not, etc. Please look at the latest patch. I have additionally added more guards/checks to prevent this from happening when one of the And, or its operands are not available in same basic block. pranavk: Absolutely. I didn't look close enough it seems in PatternMatcher.h file. After discovering…
		return true;
		}
		}
		}

		return false;
		}
case Instruction::Mul: {		case Instruction::Mul: {
int NumZExts = 0, NumSExts = 0;		int NumZExts = 0, NumSExts = 0;
for (auto &Op : I->operands()) {		for (auto &Op : I->operands()) {
// Make sure we are not already sinking this operand		// Make sure we are not already sinking this operand
if (any_of(Ops, [&](Use *U) { return U->get() == Op; }))		if (any_of(Ops, [&](Use *U) { return U->get() == Op; }))
continue;		continue;

if (match(&Op, m_SExt(m_Value()))) {		if (match(&Op, m_SExt(m_Value()))) {
▲ Show 20 Lines • Show All 10,947 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-bit-gen.ll

	Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: bit v0.16b, v1.16b, v2.16b			; CHECK-NEXT: bit v0.16b, v1.16b, v2.16b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%and = and <16 x i8> %C, %B			%and = and <16 x i8> %C, %B
	%neg = xor <16 x i8> %C, <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>			%neg = xor <16 x i8> %C, <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
	%and1 = and <16 x i8> %neg, %A			%and1 = and <16 x i8> %neg, %A
	%or = or <16 x i8> %and, %and1			%or = or <16 x i8> %and, %and1
	ret <16 x i8> %or			ret <16 x i8> %or
	}			}

				define <4 x i32> @test_bit_sink_operand(<4 x i32> %src, <4 x i32> %dst, <4 x i32> %mask, i32 %scratch) {
				dmgreenUnsubmitted Done Reply Inline Actions Can you run utils/update_llc_test_checks.py on the file, to generate the runtime checks? There will be more of them but that should be OK in this case. It doesn't looks too large. dmgreen: Can you run utils/update_llc_test_checks.py on the file, to generate the runtime checks? There…
				; CHECK-LABEL: test_bit_sink_operand:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: sub sp, sp, #32
				; CHECK-NEXT: .cfi_def_cfa_offset 32
				; CHECK-NEXT: cmp w0, #0
				; CHECK-NEXT: mov w8, wzr
				; CHECK-NEXT: cinc w9, w0, lt
				; CHECK-NEXT: asr w9, w9, #1
				; CHECK-NEXT: .LBB11_1: // %do.body
				; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: bit v1.16b, v0.16b, v2.16b
				; CHECK-NEXT: add x10, sp, #16
				; CHECK-NEXT: bfi x10, x8, #2, #2
				; CHECK-NEXT: mov x11, sp
				; CHECK-NEXT: bfi x11, x8, #2, #2
				; CHECK-NEXT: add w8, w8, #1
				; CHECK-NEXT: cmp w8, #5
				; CHECK-NEXT: str q1, [sp, #16]
				; CHECK-NEXT: str w0, [x10]
				; CHECK-NEXT: ldr q1, [sp, #16]
				; CHECK-NEXT: str q0, [sp]
				; CHECK-NEXT: str w9, [x11]
				; CHECK-NEXT: ldr q0, [sp]
				; CHECK-NEXT: b.ne .LBB11_1
				; CHECK-NEXT: // %bb.2: // %do.end
				; CHECK-NEXT: mov v0.16b, v1.16b
				; CHECK-NEXT: add sp, sp, #32
				; CHECK-NEXT: ret

				entry:
				%0 = xor <4 x i32> %mask, <i32 -1, i32 -1, i32 -1, i32 -1>
				%div = sdiv i32 %scratch, 2
				br label %do.body

				do.body:
				%dst.addr.0 = phi <4 x i32> [ %dst, %entry ], [ %vecins, %do.body ]
				%src.addr.0 = phi <4 x i32> [ %src, %entry ], [ %vecins1, %do.body ]
				%i.0 = phi i32 [ 0, %entry ], [ %inc, %do.body ]
				%vbsl3.i = and <4 x i32> %src.addr.0, %mask
				%vbsl4.i = and <4 x i32> %dst.addr.0, %0
				%vbsl5.i = or <4 x i32> %vbsl3.i, %vbsl4.i
				%vecins = insertelement <4 x i32> %vbsl5.i, i32 %scratch, i32 %i.0
				%vecins1 = insertelement <4 x i32> %src.addr.0, i32 %div, i32 %i.0
				%inc = add nuw nsw i32 %i.0, 1
				%exitcond.not = icmp eq i32 %inc, 5
				br i1 %exitcond.not, label %do.end, label %do.body

				do.end:
				ret <4 x i32> %vecins
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Sink operands to allow for bitselect instructions
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 522309

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/aarch64-bit-gen.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Sink operands to allow for bitselect instructionsClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 522309

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/aarch64-bit-gen.ll

[AArch64] Sink operands to allow for bitselect instructions
ClosedPublic