The DAG combiner can recognize a pattern of ORed shifts that evaluate to a bit rotation. When the rotation is ORed with another value, the OR operations can get reassociated in such a way that the rotation will no longer be identified. This patch implements a more aggressive analysis of OR operations to detect rotation patterns.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4920 | Please use while (!WorkQ.empty()) instead of this for loop. | |
4955–4956 | This check is redundant. | |
4960 | Could you sort the nodes so that shifts with the same LHS end up together, and use that to change this loop into something that isn't O(N^2)? | |
4969 | Use of "break" here is weird; it breaks out of the inner loop, but not the outer loop. I'd like to see a testcase with multiple rotates or'ed together. |
Changed pair-matching to work on smaller segments (only on shifts of the same value).
Added a testcase of rotates ORed with other rotates, or with unrelated operations.
Is this related to D47681 ?
This only has tests for Hexagon, can you please also add test[s] for X86, maybe AArch64?
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4919 | You can do std::queue<SDValue, SmallVector<SDValue, 8>> WorkQ; to get the usual small-size-optimization benefits. | |
4935 | I would think you'd want DenseMap here. |
This patch and the one you mentioned coincidentally both apply to rotates, but there wasn't any coordination between them.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4919 | SmallVector doesn't have pop_front, so that won't work. |
What i guess i was asking, there is no overlap, they are working on a slightly different problems, although related to rotates.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4919 | Right, i was thinking of std::stack, not queue, sorry. | |
test/CodeGen/X86/rotate-multi.ll | ||
2 | X86(and most others) tests mostly use utils/update_llc_test_checks.py |
test/CodeGen/X86/rotate-multi.ll | ||
---|---|---|
2 | The output looks worse: it checks almost every instruction and has specific register names (outside of argument registers). |
test/CodeGen/X86/rotate-multi.ll | ||
---|---|---|
2 | But its much less likely to allow mistakes to get through. Plus on x86 its often very useful to check all the surrounding code as that often hides issues. Once the test file is finalised best practice is to use update_llc_test_checks on it against trunk, commit that and then update the patch to show the codegen diff. Also, your RUN line should use a triple, not an arch. |
Please can you add the tests to trunk with the current codegen and rebase this patch to show the codegen diff.
Assuming we're not too far off from adding IR intrinsics to represent rotate ops (D49242), would transforming to those intrinsics in IR take care of the motivating problem?
At the moment D49242 expands these intrinsics into individual DAG operations. If the intrinsics were transformed into ROTL, and if instcombine could reassociate "or" operations to expose more fshl opportunities, then I guess it would be sufficient.
A benefit of having it in the DAG combiner is that it could handle IR generators that have not generated funnel shifts.
The next steps as I see it after D49242:
- Convert the intrinsics directly to ROTL/ROTR nodes (this should be a ~2 line patch in SelectionDAGBuilder).
- Expose the intrinsic as clang or other front-end builtins (the first of these will be builtin_rotate* rather than the more general builtin_funnel_shift).
- Add simplifications/folds/analysis for the intrinsics to IR passes.
- Canonicalize to the intrinsics in instcombine.
So if these rotate ops can be created/matched sooner (and likely more easily) in IR, then I think it's a better investment to get those intrinsics into the IR rather than trying to put the patterns back together again here in the DAG.
One goal was to be able to generate rol-and-accumulate instruction (on Hexagon), specifically for the accumulate operation being | (see f11 in rotate.ll). For the C code we still don't generate it:
unsigned blah(unsigned s, unsigned x) { return s | (x << 27) | (x >> 5); }
Using clang -S -target hexagon -O2 fs.c -o - gives
{ r0 |= asl(r1,#25) } { r0 |= lsr(r1,#7) jumpr r31 }
What we'd want is r0 |= rol(r1,#7).
We need reassociation to happen in the IR optimizer if we want to recognize this as rotate (there could be any number of intermediate ops separating the 2 halves of the rotate):
define i32 @blah(i32 %x, i32 %s) { %shl = shl i32 %x, 27 %or = or i32 %shl, %s %shr = lshr i32 %x, 5 %or1 = or i32 %or, %shr <--- the 1st 'or' is between the rotated halves ret i32 %or1 }
D45842 was hoping to do something like that too. Note also that we don't currently canonicalize shl/shr/or with a constant shift amount to a rotate because that wasn't noted as a problem case, but this example suggests that we should do that.
Looking back at the comments from July 2018 - we have completed most of those tasks (in particular, we don't just expand the intrinsics now). So I think it would be nicer to get this part done with 'opt -reassociate', but I can't commit to doing that work myself immediately, so I can't be too opposed. :)
@efriedma - you started reviewing this, so I assume you're ok with a DAG patch. Does the introduction of the funnel shift intrinsics change your opinion of the implementation strategy?
I think the general approach is still fine. Given we have funnel shifts, we might want to reassociate to form funnel shifts, rather than just rotates, on targets which have native funnel shift instructions. (We'd still want to prefer rotates where possible, I think.)
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4940 | getNodeId()? |
Changed getIROrder to getNodeId.
This patch uses MatchRotate to do the actual matching, so it's only going to look for opportunities to create a rotate. Maybe that function should be replaced with MatchRotateOrFunnelShift in a future patch.
Please can you commit the rotate-multi.ll test files with trunk's current codegen, and rebase this patch to show the codegen delta?
Oof, I somehow missed the comments... :o
Committed the test cases (with checks against current trunk) in r356683.
I have a question: why do we want to do this here, in the backend?
Does back-end itself create these patterns?
Now that we have funnel-shift, we really really should be doing this in the middle-end.
In particular, yes, the reassoc pass may need some work. That did come up previously.
There can be many changes to the compiled code between the IR combiner and the DAG combiner, so these patterns can certainly appear before DAG combining takes place. Also, we already combine for rotates in the DAG, this patch only makes it more comprehensive.
After thinking about it, i think it may make sense to have this after all.
I left some comments.
Also, it would be good to have some compile-time stats comparison here.
I really think you want do add some safety limits from the getgo.
E.g. would it make sense to have more than 8 levels of these ors?
More than 256 nodes? Etc.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
5011–5012 | Do you want to assert that you start with ISD::OR? | |
5016 | This comment should be next to the loop itself. | |
5028–5039 | Can you not do this if (Opc == ISD::SHL || Opc == ISD::SRL) OpMap[V.getOperand(0)->getNodeId()].push_back(I); inside } else OredOps.push_back(V); ? One less loop. | |
5033 | // for each shifted value, create a list of shifts. | |
5034 | This data structure really needs a better name/description. | |
5050–5051 | I'm not sure whether or not auto is good here.. | |
5053 | Do we care about the order within those two groups? Is this still good in reverse-iteration mode? | |
5054–5057 | llvm::partition_point()? | |
5059–5061 | I'm not sure what is going on here, would be good to have a high-level description comment. | |
5064 | Early continue? | |
5081 | Likewise, this really needs to have a high-level description comment. |
You can do
to get the usual small-size-optimization benefits.