This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Compare BFI and ORR with left-shifted operand for OR instruction selection.
ClosedPublic

Authored by mingmingl on Oct 3 2022, 1:52 PM.

Details

Summary

Before this patch:

  • For r = or op0, op1, tryBitfieldInsertOpFromOr combines it to BFI when
    1. one of the two operands is bit-field-positioning or bit-field-extraction op; and
    2. bits from the two operands don't overlap

After this patch:

  • Right before OR is combined to BFI, evaluates if ORR with left-shifted operand is better.

A motivating example (https://godbolt.org/z/rnMrzs5vn, which is added as a test case in test_orr_not_bfi in CodeGen/AArch64/bitfield-insert.ll)

For IR:

define i64 @test_orr_not_bfxil(i64 %0) {
  %2 = and i64 %0, 127
  %3 = lshr i64 %0, 1
  %4 = and i64 %3, 16256
  %5 = or i64 %4, %2
  ret i64 %5
}

Before:

lsr     x8, x0, #1
and     x8, x8, #0x3f80
bfxil   x8, x0, #0, #7

After:

ubfx x8, x0, #8, #7
and x9, x0, #0x7f
orr x0, x9, x8, lsl #7

Diff Detail

Event Timeline

mingmingl created this revision.Oct 3 2022, 1:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2022, 1:52 PM
mingmingl requested review of this revision.Oct 3 2022, 1:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2022, 1:52 PM
mingmingl updated this revision to Diff 464896.Oct 4 2022, 12:15 AM
mingmingl retitled this revision from [AArch64] Combine or(and(val,shifted-mask), op) to or(shl(and(val>>N, mask), N), op) to [AArch64] Combine or(and(val,shifted-mask), op) to or(shl(and(srl(val,N), mask), N), op).
mingmingl edited the summary of this revision. (Show Details)

Fix code style issues (nested if, etc)

mingmingl edited the summary of this revision. (Show Details)Oct 4 2022, 12:16 AM
mingmingl planned changes to this revision.Oct 4 2022, 9:59 AM

Actually the current implementation causes indefinite loop for 'test_nouseful_strb' in 'llvm/test/CodeGen/AArch64/bitfield-insert.ll' (check-llvm hangs at 99%). Will make fixes for that. Sorry about it.

mingmingl added a comment.EditedOct 11 2022, 10:46 AM

An update when trying to figure out the cause of indefinite loop:

  1. How indefinite loop take place (diff as it is shown)
    1. By rewriting AND(val, shifted-mask) to shl(and(srl(val,N), mask), N), the patch creates two more SDNode in dag-combiner; the added nodes could easily interact badly with existing combining logic, causing a repeat of {node expansion (as this patch does), node combination (existing logic)} in general. One two-line LLMV IR is attached in [1] to exemplify this.
    2. Indefinite loop could be solved by adding these lines [2] (atop current patch), but it's hard to prove rewriting one dag node to three dag nodes (AND(val, shifted-mask) to shl(and(srl(val,N), mask), N)) is not fragile (i.e., interact badly with future combination logic)
  1. The motivating test case (including {5-line C++ =, current codegen, optimal codegen}) is https://godbolt.org/z/h96b1sGco
    • The source code of over-eager BFM usage is this in AArch64DAGToDAGISel::Select -> there is no DAG node for BFM instruction -> the BFM selection happens after dag-combiner, and is written in C++ to see through bit-simplification opportunities between two operands
    • What bit-simplification opportunities means -> getUsefulBits scans uses of ISD::OR (with a limited recursion depth) to shrink the number of useful bits -> if bits could be proved not useful (by users), usage of BFM eliminates DAG nodes and thereby reducing the number of instructions (code)

What I learnt from 1 and 2

  • bit-simplification opportunities from BFM should be retained, since in the best cases it eliminates instructions
  • Instruction selection needs to be enhanced to choose between ORR (with shifted register) and BFM (for the motivating test case in 2); one way to do it, is to introduce SelectionDAG node for 'BFM' instruction and let instructions go through DAG-Combiner for evaluation, and re-write getUsefulBits function (which relies on MachineOpCode (i.e., users of a SDNode already being selected) now) so it could analyze useful bits when SDNodes are not selected yet.

To enhance ISel to choose between 'ORR' and 'BFM', I'm planning on changes to adding SelectionDAG Node for 'BFM' (and probably UBFM since UBFM is helpful to see through useful bits ). Going to make some revisions and send them out in stacked diffs..

[1] https://godbolt.org/z/qexqzYx1W (AND with shifted-mask operand is rewritten by this patch, and combined back by shouldFoldConstantShiftPairToMask to an AND with shifted-mask again, causing indefinite loop

[2]

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index fc1893b6d61d..a2d4fe280134 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -14220,6 +14220,13 @@ bool AArch64TargetLowering::shouldFoldConstantShiftPairToMask(
     return (!C1 || !C2 || C1->getZExtValue() >= C2->getZExtValue());
   }
 
+  // 1. It's know that N is a shl or srl
+  // 2. If N has one use, and that use could fold shift, return false
+  // FIXME: Extend this from ORR to other ops
+  if(N->hasNUsesOfValue(1, 0) && N->use_begin()->getOpcode() == ISD::OR) {
+    return false;
+  }
+
   return true;
 }
khchen added a subscriber: khchen.Oct 12 2022, 7:52 AM
mingmingl updated this revision to Diff 472121.Oct 31 2022, 3:28 PM
mingmingl retitled this revision from [AArch64] Combine or(and(val,shifted-mask), op) to or(shl(and(srl(val,N), mask), N), op) to [AArch64] Compare BFI and ORR with left-shifted operand for OR instruction selection..
mingmingl edited the summary of this revision. (Show Details)

In order to generate orr with shifted operand (not bfi) when orr is better, this patch does a comparison inside tryBitfieldInsertOpFromOr (i.e., after dag-combiner, alongside cpp-based instruction selection).

This patch optimizes many existing test cases (as shown by updated tests), but introduces one regression.

  • Before the patch, the generation of BFM tells the bits being used in Rd and Rm respectively (see getUsefulBitsFromBFM)
  • After this patch, ORRWrs op0, op1, lsl #imm (with shifted register) is generated rather than BFM in some cases; however, the bit field usage information in op0 is not preserved (and there isn't a way to express this in class SDNode without adding a specialized that derives from SDNode) (for example LoadSDNode)

A side question, is it a typical use case to convey metadata (e.g., op0 and op1 inside ORR op0, op1 contributes bits that doesn't overlap) in the SDNode class?

Two alternative options:

  1. Introduce a DAG node for BFM, so as to compare BFM and ORR in dag-combiner.
      • The drawback of generating BFM earlier (i.e., in dag-combiner) is that, all other bit-field processing nodes (AND, SHL, etc) need to be taught to combine with BFM. In other words, introducing a BFM dag-node without regressing existing combination requires a lot of work.
    • I had a local patch that actually adds the BFM node, where missed combinations of BFM with existing nodes manifest.
  2. Do the transformation in aarch64-mi-peephole-opt pass.
    • Since aarch64-mi-peephole-opt optimizes based on ISel output, it's a net optimization (compared with current patch, i.e., no drawback of lost information).
    • However, the implementation inside aarch64-mi-peephole-opt would handle MachineInstructions (and MachineOpcode, i.e., ISD::AND fleshes out in many forms, like AndWrs, AndWrr, etc), and cannot reuse helper functions in the ISel pass.

I think alternative #2 (inside aarch64-mi-peephole-opt) is better than alternative #1 (could be a can of worms due to missed combinations between BFM and existing AND/OR nodes), and in some sense better than the current patch (at the cost of more code work)

Feedback/thoughts on where (peephole or the current patch) to pursue this optimization would be appreciated!

Hmm. I had not considered am ORR with shift to be cheaper than a BFM before. From what I can tell it doesn't seem to be universal across all cpus, but does look like it will be faster or equal.

A side question, is it a typical use case to convey metadata (e.g., op0 and op1 inside ORR op0, op1 contributes bits that doesn't overlap) in the SDNode class?

That sounds like it would usually be calculated with KnownBits, like in haveNoCommonBitsSet. Unfortunately post-isel the amount of information we can extract is much less than from the generic DAG nodes.

[About the two/three options]

Is the motivating pattern just the one from the commit message, or any bfm that could be a shifted orr? aarch64-mi-peephole-opt is an option - we always run into problems implementing things there but if it is easier to write that is always an option. (The machine combiner too, if scheduling info is useful). Larger patterns might be more difficult though. The (existing) code in DAG2DAG doesn't feel like it scales super well. But equally like you say adding ISel nodes has downsides. What would this look like from GlobalISel? How much code would need to be added to make aarch64-mi-peephole-opt work?

Are the only regressions on uaddo_v4i1 and umulo_v4i1? I'm not against ignoring those, if they are just overflowing nodes on i1 types being awkwardly expanded and it doesn't come up in other places.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2835 ↗(On Diff #472121)

You say #7 here, but #8 elsewhere, including the test. I think 7 is correct for this example.

mingmingl updated this revision to Diff 472519.Nov 2 2022, 12:47 AM
mingmingl edited the summary of this revision. (Show Details)

fix a bug around width computation (updated commit message and patch summary as well), and adjust whitespace in a few test cases (mainly those not updated by utils/update_llc_test_checks.py) to remove irrelevant diff.

Hmm. I had not considered am ORR with shift to be cheaper than a BFM before. From what I can tell it doesn't seem to be universal across all cpus, but does look like it will be faster or equal.

Yes it's true that ORR with shift could be the same as BFM (e.g. Cortex A57), or faster (e.g. NeoverseN1, CortexA77)

A side question, is it a typical use case to convey metadata (e.g., op0 and op1 inside ORR op0, op1 contributes bits that doesn't overlap) in the SDNode class?

That sounds like it would usually be calculated with KnownBits, like in haveNoCommonBitsSet. Unfortunately post-isel the amount of information we can extract is much less than from the generic DAG nodes.

Yes, SelectionDAG::computeKnownBits handles generic DAG nodes but doesn't handle machine-op-code (with NodeType < 0); as a result, computing something similar to 'haveNoCommonBitSet' won't work out of the box.

[About the two/three options]

Is the motivating pattern just the one from the commit message, or any bfm that could be a shifted orr?

Yes, the motivating test case is just the one from commit message; and the other updated tests are results of other lines (that actually look simpler, and added to show ORR-not-BFM is a more generic question to solve).

aarch64-mi-peephole-opt is an option - we always run into problems implementing things there but if it is easier to write that is always an option. (The machine combiner too, if scheduling info is useful). Larger patterns might be more difficult though. The (existing) code in DAG2DAG doesn't feel like it scales super well. But equally like you say adding ISel nodes has downsides. What would this look like from GlobalISel? How much code would need to be added to make aarch64-mi-peephole-opt work?

BFM is not used in GlobalISel (four instructions generated https://godbolt.org/z/MMvMe34zv), so a BFM pattern matcher (inside peephole or machine-combiner) won't optimize GlobalISel output in this case.

Regarding the amount of work inside peephole (or machine-combiner), I don't have a demo at hand but the number of lines should be within a few hundred (not thousand) just for the motivating test case.

However, without the context that this BFI is from ISD::OR, building up this context (that it's correct to convert BFI back to ORR) and fixing the other affected tests in this patch would require some implementation.

Are the only regressions on uaddo_v4i1 and umulo_v4i1? I'm not against ignoring those, if they are just overflowing nodes on i1 types being awkwardly expanded and it doesn't come up in other places.

In the affected test cases, only uaddo_v4i1 and umulo_v4i1 regressed -> more generally, useful-bit info (from BFM, lost in ORR) simplifies away one AND node from Dst (as shown in code link, when AND zeros exactly the bits that are going to be inserted from Src) -> in this sense, other cases might show up (not type-extended small integers)

Maybe I could write a working demo in peephole or machine-combiner for one motivating case as a start? For the rest of tests, I could file Github PR to track them.

mingmingl marked an inline comment as done.Nov 2 2022, 12:50 AM
mingmingl added inline comments.
llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2835 ↗(On Diff #472121)

Thanks for the good catch!

Fixed it (overlooked the width is 'immr - imms + 1' for UBFX)

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll
238 ↗(On Diff #472519)

This test case is generated by utils/update_llc_test_checks.py; but for some reason, the whitespaces cause more diff than expected.

I'm going to run auto updater in a clean branch, and see if the whitespace diff is expected without this patch.

I read through the code. I'm not the biggest expert on this DAGToDAG code, but what is here seems sensible to me. All the tests look OK too.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
2872 ↗(On Diff #472519)

LLVM usually adds message to all assert messages. It can help distinguish them especially when the condition is fairly generic.

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll
238 ↗(On Diff #472519)

Feel free to regenerate the files that need it and check those in to reduce the differences here.

mingmingl updated this revision to Diff 472802.Nov 2 2022, 4:33 PM
mingmingl marked 3 inline comments as done.
mingmingl edited the summary of this revision. (Show Details)

resolve comments.

Thanks for reviews! PTAL.

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll
238 ↗(On Diff #472519)
mingmingl updated this revision to Diff 472846.Nov 2 2022, 10:43 PM

This update runs git clang-format HEAD~1 only, no functional change --> without this, pre-merge checks fails due to ERROR git-clang-format returned an non-zero exit code 1

mingmingl updated this revision to Diff 472988.Nov 3 2022, 11:24 AM
mingmingl edited the summary of this revision. (Show Details)

rebase after D137296

dmgreen accepted this revision.Nov 3 2022, 12:11 PM

Thanks. LGTM

This revision is now accepted and ready to land.Nov 3 2022, 12:11 PM

thanks for reviews! Going to submit and implement the FIXME (for SRL) in follow-up patches.