This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][InstCombine] Don't scalarize for bitselet instructions
Needs ReviewPublic

Authored by pranavk on May 10 2023, 3:14 PM.

Details

Summary

The pattern Or(And(A, MaskValue), And(B, ~MaskValue)), where ~MaskValue = Xor(MaskValue, -1) gets lowered to bitselect instruction when NEON is available. However, when this pattern is indexed into, we scalarize it assuming that it's cheaper to scalarize it. It's not.

This will solve performance bugs mentioned in this comment: https://github.com/llvm/llvm-project/issues/49305#issue-1077369905

Diff Detail

Event Timeline

pranavk created this revision.May 10 2023, 3:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2023, 3:14 PM
pranavk requested review of this revision.May 10 2023, 3:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2023, 3:14 PM
pranavk retitled this revision from [AArch64][InstCombine] Bail out for bitselet instructions to [AArch64][InstCombine] Don't scalarize for bitselet instructions.May 10 2023, 3:17 PM
pranavk edited the summary of this revision. (Show Details)
pranavk added inline comments.May 10 2023, 3:21 PM
llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
74

oops. this doesn't belong here

pranavk updated this revision to Diff 521121.May 10 2023, 3:22 PM
pranavk edited the summary of this revision. (Show Details)

clang-format and remove print statement

Hmm. I suspect that people will not like something target-specific in instcombine. I'm a little surprised to see InstCombine trying to do scalarization without any form of a costmodel though. I would expect extract(binop(a,b)) to generally be better than binop(extract(a),extract(b)) if the extracts were expensive. More so if the binop was cheaper on the vector side. It would be better if the extracts were optimized further though.

In the example in https://gcc.godbolt.org/z/YPe3TK79P there are variable lane extracts, which makes them even more expensive that usual. If you are interested in that case specifically then it might make sense to prevent this optimization for variable lane extracts as they are often expensive.

Otherwise we might need to fix this in the backend, perhaps by adding an AArch64 DAG combine for converted these vector operations with a single extract.

    t12: i32 = extract_vector_elt t4, t11
      t13: i32 = extract_vector_elt t6, t11
    t15: i32 = xor t13, Constant:i32<-1>
  t16: i32 = and t12, t15
    t10: v4i32 = and t2, t6
  t17: i32 = extract_vector_elt t10, t11
t18: i32 = or t16, t17

I agree that it's not ideal to have target-specific logic in this file.

Current InstCombine implementation is too simple and uses cheapToScalarize function to find if it should scalarize the IR. This function (cheapToScalarize) recursively calls itself until it finds one of the operands that's really cheap to scalarize. The problem is that it returns true for whole instruction DAG. That's what's happening here as well. Recursion reaches Xor (A, {-1, -1, -1, -1}) which looks at {-1, -1, -1, -1} and immediately concludes that whole Instruction DAG is cheap to scalarize.

@dmgreen I think the right solution to this problem is to come up with some sort of cost model here so that these things are handled appropriately. But I would like your or someone's input on this as well. Let me know.

@dmgreen I think the right solution to this problem is to come up with some sort of cost model here so that these things are handled appropriately. But I would like your or someone's input on this as well. Let me know.

Do you mean to get cheapToScalarize to not introduce more extracts than it removes? As in the FIXME a few lines up? That sounds sensible to me on paper. I looked at it briefly before, there may be some issues getting it working well but if the performance looks OK then that sounds like a good way to go. Instcombine shouldn't really be inserting more instructions and some extracts are not expected to be cheap.

In terms of using the actual costmodel through TTI getInstructionCost, then InstCombine is expected to be a target independent canonicalization pass, and shouldn't use those methods. But hopefully that isn't needed in this case.

Otherwise it might be possible to fix up in the backend, if improving this code looks like it will cause too many other issues.

nikic added a comment.May 23 2023, 5:34 AM

I don't have the necessary context here, but if limiting this to not insert extra instructions works, that sounds reasonable. If we do need target cost-modelling, the transform should be moved into VectorCombine.