Page MenuHomePhabricator

[DAGCombiner] Create rotates more aggressively
Needs RevisionPublic

Authored by kparzysz on Jun 4 2018, 12:04 PM.

Details

Summary

The DAG combiner can recognize a pattern of ORed shifts that evaluate to a bit rotation. When the rotation is ORed with another value, the OR operations can get reassociated in such a way that the rotation will no longer be identified. This patch implements a more aggressive analysis of OR operations to detect rotation patterns.

Diff Detail

Repository
rL LLVM

Event Timeline

kparzysz updated this revision to Diff 149980.Jun 5 2018, 7:25 AM

Rebased on top of D47725.

efriedma added inline comments.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5498

Please use while (!WorkQ.empty()) instead of this for loop.

5533–5534

This check is redundant.

5538

Could you sort the nodes so that shifts with the same LHS end up together, and use that to change this loop into something that isn't O(N^2)?

5547

Use of "break" here is weird; it breaks out of the inner loop, but not the outer loop.

I'd like to see a testcase with multiple rotates or'ed together.

kparzysz marked 4 inline comments as done.Jun 7 2018, 8:29 AM
kparzysz updated this revision to Diff 150345.Jun 7 2018, 8:43 AM

Changed pair-matching to work on smaller segments (only on shifts of the same value).

Added a testcase of rotates ORed with other rotates, or with unrelated operations.

kparzysz retitled this revision from [SelectionDAG] Create rotates more aggressively to [DAGCombiner] Create rotates more aggressively.Jun 7 2018, 8:45 AM

Is this related to D47681 ?
This only has tests for Hexagon, can you please also add test[s] for X86, maybe AArch64?

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5497

You can do

std::queue<SDValue, SmallVector<SDValue, 8>> WorkQ;

to get the usual small-size-optimization benefits.

5513

I would think you'd want DenseMap here.

kparzysz marked an inline comment as done.Jun 18 2018, 10:45 AM

This patch and the one you mentioned coincidentally both apply to rotates, but there wasn't any coordination between them.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5497

SmallVector doesn't have pop_front, so that won't work.

kparzysz updated this revision to Diff 151750.Jun 18 2018, 10:46 AM

Responded to comments, added x86-64 testcase.

This patch and the one you mentioned coincidentally both apply to rotates, but there wasn't any coordination between them.

What i guess i was asking, there is no overlap, they are working on a slightly different problems, although related to rotates.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5497

Right, i was thinking of std::stack, not queue, sorry.

test/CodeGen/X86/rotate-multi.ll
1

X86(and most others) tests mostly use utils/update_llc_test_checks.py

kparzysz added inline comments.Jun 18 2018, 11:21 AM
test/CodeGen/X86/rotate-multi.ll
1

The output looks worse: it checks almost every instruction and has specific register names (outside of argument registers).

RKSimon added inline comments.Jun 18 2018, 12:31 PM
test/CodeGen/X86/rotate-multi.ll
1

But its much less likely to allow mistakes to get through. Plus on x86 its often very useful to check all the surrounding code as that often hides issues.

Once the test file is finalised best practice is to use update_llc_test_checks on it against trunk, commit that and then update the patch to show the codegen diff.

Also, your RUN line should use a triple, not an arch.

kparzysz updated this revision to Diff 151777.Jun 18 2018, 1:06 PM

Updated x86 testcase.

Please can you add the tests to trunk with the current codegen and rebase this patch to show the codegen diff.

kparzysz updated this revision to Diff 155698.Jul 16 2018, 9:00 AM

Rebased on top of trunk.

Assuming we're not too far off from adding IR intrinsics to represent rotate ops (D49242), would transforming to those intrinsics in IR take care of the motivating problem?

At the moment D49242 expands these intrinsics into individual DAG operations. If the intrinsics were transformed into ROTL, and if instcombine could reassociate "or" operations to expose more fshl opportunities, then I guess it would be sufficient.

A benefit of having it in the DAG combiner is that it could handle IR generators that have not generated funnel shifts.

At the moment D49242 expands these intrinsics into individual DAG operations. If the intrinsics were transformed into ROTL, and if instcombine could reassociate "or" operations to expose more fshl opportunities, then I guess it would be sufficient.

A benefit of having it in the DAG combiner is that it could handle IR generators that have not generated funnel shifts.

The next steps as I see it after D49242:

  1. Convert the intrinsics directly to ROTL/ROTR nodes (this should be a ~2 line patch in SelectionDAGBuilder).
  2. Expose the intrinsic as clang or other front-end builtins (the first of these will be builtin_rotate* rather than the more general builtin_funnel_shift).
  3. Add simplifications/folds/analysis for the intrinsics to IR passes.
  4. Canonicalize to the intrinsics in instcombine.

So if these rotate ops can be created/matched sooner (and likely more easily) in IR, then I think it's a better investment to get those intrinsics into the IR rather than trying to put the patterns back together again here in the DAG.

@kparzysz Do we still need this? Does the IR funnel shift work that @spatel did last year make this redundant?

Herald added a project: Restricted Project. · View Herald TranscriptFeb 19 2019, 10:34 AM
kparzysz added a comment.EditedFeb 20 2019, 2:02 PM

One goal was to be able to generate rol-and-accumulate instruction (on Hexagon), specifically for the accumulate operation being | (see f11 in rotate.ll). For the C code we still don't generate it:

unsigned blah(unsigned s, unsigned x) {
  return s | (x << 27) | (x >> 5);
}

Using clang -S -target hexagon -O2 fs.c -o - gives

{
        r0 |= asl(r1,#25)
}
{
        r0 |= lsr(r1,#7)
        jumpr r31
}

What we'd want is r0 |= rol(r1,#7).

One goal was to be able to generate rol-and-accumulate instruction (on Hexagon), specifically for the accumulate operation being | (see f11 in rotate.ll). For the C code we still don't generate it:

unsigned blah(unsigned s, unsigned x) {
  return s | (x << 27) | (x >> 5);
}

We need reassociation to happen in the IR optimizer if we want to recognize this as rotate (there could be any number of intermediate ops separating the 2 halves of the rotate):

define i32 @blah(i32 %x, i32 %s) {
  %shl = shl i32 %x, 27
  %or = or i32 %shl, %s
  %shr = lshr i32 %x, 5
  %or1 = or i32 %or, %shr <--- the 1st 'or' is between the rotated halves
  ret i32 %or1
}

D45842 was hoping to do something like that too. Note also that we don't currently canonicalize shl/shr/or with a constant shift amount to a rotate because that wasn't noted as a problem case, but this example suggests that we should do that.

Are you opposed to having this done in the DAG combiner?

Are you opposed to having this done in the DAG combiner?

Looking back at the comments from July 2018 - we have completed most of those tasks (in particular, we don't just expand the intrinsics now). So I think it would be nicer to get this part done with 'opt -reassociate', but I can't commit to doing that work myself immediately, so I can't be too opposed. :)

@efriedma - you started reviewing this, so I assume you're ok with a DAG patch. Does the introduction of the funnel shift intrinsics change your opinion of the implementation strategy?

I think the general approach is still fine. Given we have funnel shifts, we might want to reassociate to form funnel shifts, rather than just rotates, on targets which have native funnel shift instructions. (We'd still want to prefer rotates where possible, I think.)

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5518

getNodeId()?

kparzysz updated this revision to Diff 187864.Feb 21 2019, 2:22 PM

Changed getIROrder to getNodeId.

This patch uses MatchRotate to do the actual matching, so it's only going to look for opportunities to create a rotate. Maybe that function should be replaced with MatchRotateOrFunnelShift in a future patch.

Please can you commit the rotate-multi.ll test files with trunk's current codegen, and rebase this patch to show the codegen delta?

Please can you commit the rotate-multi.ll test files with trunk's current codegen, and rebase this patch to show the codegen delta?

And the additional rotate.ll tests as well - cheers!

Oof, I somehow missed the comments... :o

Committed the test cases (with checks against current trunk) in r356683.

Oof, I somehow missed the comments... :o

Committed the test cases (with checks against current trunk) in r356683.

.. and rebase this diff to show the changes?

I have a question: why do we want to do this here, in the backend?
Does back-end itself create these patterns?
Now that we have funnel-shift, we really really should be doing this in the middle-end.
In particular, yes, the reassoc pass may need some work. That did come up previously.

kparzysz updated this revision to Diff 191736.Mar 21 2019, 10:25 AM

Rebased on top of the pre-committed testcases.

There can be many changes to the compiled code between the IR combiner and the DAG combiner, so these patterns can certainly appear before DAG combining takes place. Also, we already combine for rotates in the DAG, this patch only makes it more comprehensive.

RKSimon edited reviewers, added: efriedma, lebedev.ri; removed: eli.friedman.Apr 26 2019, 1:07 AM
lebedev.ri requested changes to this revision.Jul 10 2019, 3:29 PM

After thinking about it, i think it may make sense to have this after all.
I left some comments.

Also, it would be good to have some compile-time stats comparison here.
I really think you want do add some safety limits from the getgo.
E.g. would it make sense to have more than 8 levels of these ors?
More than 256 nodes? Etc.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5499–5500

Do you want to assert that you start with ISD::OR?

5504

This comment should be next to the loop itself.
It would be best to explain the data structures here instead.

5516–5527

Can you not do this

if (Opc == ISD::SHL || Opc == ISD::SRL)
  OpMap[V.getOperand(0)->getNodeId()].push_back(I);

inside

} else
  OredOps.push_back(V);

?

One less loop.

5521
// for each shifted value, create a list of shifts.
5522

This data structure really needs a better name/description.

5538–5539

I'm not sure whether or not auto is good here..

5541

Do we care about the order within those two groups? Is this still good in reverse-iteration mode?

5542–5545

llvm::partition_point()?

5547–5549

I'm not sure what is going on here, would be good to have a high-level description comment.

5552

Early continue?

5569

Likewise, this really needs to have a high-level description comment.

This revision now requires changes to proceed.Jul 10 2019, 3:29 PM