Now, that we have funnel shift intrinsics, it should be safe to convert this form of rotate to it. In the worst case (a target that doesn't have rotate instructions), we will expand this into a branch-less sequence of ALU ops (neg/and/and/lshr/shl/or) in the backend, so it's still very likely to be a perf improvement over the original code.
The motivating source code pattern for this is shown in:
I looked at several different options before deciding where to try this - instcombine, simplifycfg, CGP - because it doesn't fit cleanly anywhere AFAIK.
The backend (CGP, SDAG, GlobalIsel?) is too late for what we're trying to accomplish. We want to have the IR converted before we reach things like vectorization because the simplified code can make a loop much simpler to transform.
Technically, this could be included in instcombine, but it's a large pattern match that includes control-flow, so it just felt wrong to stuff into there (although I have a draft of that patch). Similarly, this could be part of simplifycfg, but all of this pattern matching is a stretch.
So we're left with our relatively new dumping ground for homeless transforms: aggressive-instcombine. This only runs at -O3, but that seems like a reasonable limitation given that source code has many options to avoid this pattern (including the recently added clang intrinsics for rotates).
I'm including a PhaseOrdering test because we require the teamwork of 3 passes (aggressive-instcombine, instcombine, simplifycfg) to get this into the minimal IR form that we want.
And that showed a surprise - the new pass manager fails to reduce this. That's because the new PM includes this:
// Speculative execution if the target has divergent branches; otherwise nop. FPM.addPass(SpeculativeExecutionPass());
...in the default simplification pipeline. That pass hoists all of the instructions into the entry block which allows simplifycfg to turn this sequence into straight-line code with a select. That then allows instcombine to recognize this sequence as a rotate and remove the select. But instcombine is not yet canonicalizing the 6 op ALU sequence to funnel-shift yet. I'm planning to add that, but this really looks like a bug in the new pass manager because hoisting a sequence of ops isn't supposed to be happening for all targets - from SpeculativeExecution.h:
// This pass hoists instructions to enable speculative execution on // targets where branches are expensive. This is aimed at GPUs.