This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Transforms/InstCombine/
-
lib/
-
Transforms/
-
InstCombine/
1/1
InstCombineVectorOps.cpp

Differential D150316

[AArch64][InstCombine] Don't scalarize for bitselet instructions
Needs ReviewPublic

Authored by pranavk on May 10 2023, 3:14 PM.

Download Raw Diff

Details

Reviewers

scw
efriedma
echristo
saugustine
dmgreen
nikic

Summary

The pattern Or(And(A, MaskValue), And(B, ~MaskValue)), where ~MaskValue = Xor(MaskValue, -1) gets lowered to bitselect instruction when NEON is available. However, when this pattern is indexed into, we scalarize it assuming that it's cheaper to scalarize it. It's not.

This will solve performance bugs mentioned in this comment: https://github.com/llvm/llvm-project/issues/49305#issue-1077369905

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pranavk created this revision.May 10 2023, 3:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2023, 3:14 PM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

pranavk requested review of this revision.May 10 2023, 3:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2023, 3:14 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

pranavk retitled this revision from [AArch64][InstCombine] Bail out for bitselet instructions to [AArch64][InstCombine] Don't scalarize for bitselet instructions.May 10 2023, 3:17 PM

pranavk edited the summary of this revision. (Show Details)

pranavk added reviewers: scw, efriedma, echristo, saugustine, dmgreen.

pranavk added inline comments.May 10 2023, 3:21 PM

llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
74	oops. this doesn't belong here

clang-format and remove print statement

Harbormaster completed remote builds in B231193: Diff 521121.May 10 2023, 5:21 PM

Hmm. I suspect that people will not like something target-specific in instcombine. I'm a little surprised to see InstCombine trying to do scalarization without any form of a costmodel though. I would expect extract(binop(a,b)) to generally be better than binop(extract(a),extract(b)) if the extracts were expensive. More so if the binop was cheaper on the vector side. It would be better if the extracts were optimized further though.

In the example in https://gcc.godbolt.org/z/YPe3TK79P there are variable lane extracts, which makes them even more expensive that usual. If you are interested in that case specifically then it might make sense to prevent this optimization for variable lane extracts as they are often expensive.

Otherwise we might need to fix this in the backend, perhaps by adding an AArch64 DAG combine for converted these vector operations with a single extract.

    t12: i32 = extract_vector_elt t4, t11
      t13: i32 = extract_vector_elt t6, t11
    t15: i32 = xor t13, Constant:i32<-1>
  t16: i32 = and t12, t15
    t10: v4i32 = and t2, t6
  t17: i32 = extract_vector_elt t10, t11
t18: i32 = or t16, t17

I agree that it's not ideal to have target-specific logic in this file.

Current InstCombine implementation is too simple and uses cheapToScalarize function to find if it should scalarize the IR. This function (cheapToScalarize) recursively calls itself until it finds one of the operands that's really cheap to scalarize. The problem is that it returns true for whole instruction DAG. That's what's happening here as well. Recursion reaches Xor (A, {-1, -1, -1, -1}) which looks at {-1, -1, -1, -1} and immediately concludes that whole Instruction DAG is cheap to scalarize.

pranavk mentioned this in D147266: [AArch64] Sink operands to allow for bitselect instructions.May 17 2023, 2:54 PM

@dmgreen I think the right solution to this problem is to come up with some sort of cost model here so that these things are handled appropriately. But I would like your or someone's input on this as well. Let me know.

In D150316#4361813, @pranavk wrote:

@dmgreen I think the right solution to this problem is to come up with some sort of cost model here so that these things are handled appropriately. But I would like your or someone's input on this as well. Let me know.

Do you mean to get cheapToScalarize to not introduce more extracts than it removes? As in the FIXME a few lines up? That sounds sensible to me on paper. I looked at it briefly before, there may be some issues getting it working well but if the performance looks OK then that sounds like a good way to go. Instcombine shouldn't really be inserting more instructions and some extracts are not expected to be cheap.

In terms of using the actual costmodel through TTI getInstructionCost, then InstCombine is expected to be a target independent canonicalization pass, and shouldn't use those methods. But hopefully that isn't needed in this case.

Otherwise it might be possible to fix up in the backend, if improving this code looks like it will cause too many other issues.

Herald added a subscriber: StephenFan. · View Herald TranscriptMay 23 2023, 2:23 AM

I don't have the necessary context here, but if limiting this to not insert extra instructions works, that sounds reasonable. If we do need target cost-modelling, the transform should be moved into VectorCombine.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineVectorOps.cpp

23 lines

Diff 521121

llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	STATISTIC(NumAggregateReconstructionsSimplified,
"original aggregate");		"original aggregate");

/// Return true if the value is cheaper to scalarize than it is to leave as a		/// Return true if the value is cheaper to scalarize than it is to leave as a
/// vector operation. If the extract index \p EI is a constant integer then		/// vector operation. If the extract index \p EI is a constant integer then
/// some operations may be cheap to scalarize.		/// some operations may be cheap to scalarize.
///		///
/// FIXME: It's possible to create more instructions than previously existed.		/// FIXME: It's possible to create more instructions than previously existed.
static bool cheapToScalarize(Value V, Value EI) {		static bool cheapToScalarize(Value V, Value EI) {
		// Pattern: or(and(a, mask), and(b, ~mask)) can be efficiently folded to
		// bitselect instructions where ~mask = xor mask, -1
		if (isa<Instruction>(V)) {
		Instruction *TI = cast<Instruction>(V);
		if (TI->getOpcode() == Instruction::Or) {
		Value *LHS = TI->getOperand(0);
		Value *RHS = TI->getOperand(1);
		Value *MaskValue = nullptr;
		Value *MaskConst = nullptr;

		if (match(LHS, m_And(m_Value(), m_Value(MaskValue))) &&
		match(RHS, m_And(m_Value(),
		m_Xor(m_Specific(MaskValue), m_Value(MaskConst))))) {
		if (auto *CI = dyn_cast<ConstantDataVector>(MaskConst)) {
		Constant *C = CI->getSplatValue();
		if (C->isAllOnesValue()) {
		pranavkAuthorUnsubmitted Done Reply Inline Actions oops. this doesn't belong here pranavk: oops. this doesn't belong here
		return false;
		}
		}
		}
		}
		}

ConstantInt *CEI = dyn_cast<ConstantInt>(EI);		ConstantInt *CEI = dyn_cast<ConstantInt>(EI);

// If we can pick a scalar constant value out of a vector, that is free.		// If we can pick a scalar constant value out of a vector, that is free.
if (auto *C = dyn_cast<Constant>(V))		if (auto *C = dyn_cast<Constant>(V))
return CEI \|\| C->getSplatValue();		return CEI \|\| C->getSplatValue();

if (CEI && match(V, m_Intrinsic<Intrinsic::experimental_stepvector>())) {		if (CEI && match(V, m_Intrinsic<Intrinsic::experimental_stepvector>())) {
ElementCount EC = cast<VectorType>(V->getType())->getElementCount();		ElementCount EC = cast<VectorType>(V->getType())->getElementCount();
▲ Show 20 Lines • Show All 3,062 Lines • Show Last 20 Lines