[Intrinsics] define funnel shift IR intrinsics + DAG builder support

Authored by spatel on Jul 12 2018, 8:27 AM.



As discussed here:

We want to add rotate intrinsics because the IR expansion of that pattern is 4+ instructions, and we can lose pieces of the pattern before it gets to the backend. Generalizing the operation by allowing 2 different input values (plus the 3rd shift/rotate amount) gives us a "funnel shift" operation which may also be a single hardware instruction.

Initially, I thought we needed to define new DAG nodes for these ops, and I spent time working on that (much larger patch), but then I concluded that we don't need that. At least as a first step, we have all of the backend support necessary to match these ops...because it was required. And shepherding these through the IR optimizer is the primary concern, so the IR intrinsics are likely all that we'll ever need.

There was also a question about converting the intrinsics to the existing ROTL/ROTR DAG nodes (along with improving the oversized shift documentation). Again, I don't think that's strictly necessary (as the test results here prove). That can be an efficiency improvement as a small follow-up patch.

So all we're left with is documentation, definition of the IR intrinsics, and DAG builder support. I've included tests for AArch64, PowerPC, and x86 because those are the targets I'm most familiar with.

Diff Detail

spatel created this revision.Jul 12 2018, 8:27 AM
lebedev.ri added inline comments.Jul 12 2018, 9:09 AM
11917 ↗(On Diff #155179)

Missing newline.
Might be a good idea to check that these docs build

RKSimon added inline comments.Jul 12 2018, 9:48 AM
11915 ↗(On Diff #155179)

Is the shift amount going to be treated as a modulo value?

spatel added inline comments.Jul 12 2018, 9:57 AM
11915 ↗(On Diff #155179)

Yes - at least that's what I think we want. It's the last line under 'Semantics' because this Overview/Arguments/Semantics organization is what was copy/pasted from the existing intrinsic docs.

Let me move/repeat that bit, so we state that from the beginning.

And I'll make a docs build target as suggested to avoid formatting bugs.

spatel updated this revision to Diff 155251.Jul 12 2018, 12:22 PM

Patch updated:

  1. Moved unsigned modulo description of the shift amount to the 'Overview' which makes 'Semantics' unnecessary, so those paragraphs are gone.
  2. Fixed formatting bug.
davezarzycki resigned from this revision.Jul 12 2018, 12:27 PM
fabiang added inline comments.
11903 ↗(On Diff #155251)

"Concatenated" is ambiguous, it should say explicitly which of %a or %b is the low (less significant) half and which is the high half. (From the example, looks like %a is more, %b is less significant.)

11947 ↗(On Diff #155251)

Clarify concatenation here too. (Looks like in this one, %a is low and %b is high.)

fabiang added inline comments.Jul 13 2018, 12:40 AM
5695 ↗(On Diff #155251)

Wait, I don't think this lowering works when Z=0 (mod BitWidth) when BitWidth is the width of the register; not on shift-amounts-modulo-register-width architectures anyway. (It should work on PPC.)

e.g. for 32-bit fshl and with shift widths taken modulo 32, we get (X << 0) | (Y >> 32) == (X >> 0) | (Y >> 0) == X | Y, not the expected X.

For compile-time constant Z the Z=0 case can be handled, and when X=Y (i.e. rotate) it turns out to be harmless, but for X!=Y and variable Z that happens to be congruent to 0 I think this is trouble.

spatel added inline comments.Jul 13 2018, 8:50 AM
5695 ↗(On Diff #155251)

Oops...I stepped in the UB this was supposed to avoid. And even if we fix it for rotates, we still have the problem for proper funnel shifts (X != Y) as you've noted. We'd have to select the 0-shift case out I think.

Should we just go back to the original plan and make rotate intrinsics instead of funnel shift? I don't see much demand for the more general funnel shift.

spatel updated this revision to Diff 155445.Jul 13 2018, 11:58 AM

Patch updated:
Given the problems with the general funnel shift, I didn't bother to update the LangRef as suggested, but I (hopefully) fixed the translation to be safe, so we can see what that code and current codegen looks like.

We modulo both shift amounts now, and if it's not a rotate, we have to cmp/select the 0 case. I added tests for constant shifting by the bitwidth to verify that we're getting the expected output in those cases.

spatel updated this revision to Diff 155481.Jul 13 2018, 2:03 PM

Patch updated:

  1. Made LangRef text more specific.
  2. Added/changed examples to show modulo shift behavior.
  3. Added tests (and fixed copy/paste naming inaccuracies).

It might not be as bad as I feared . We should be able to match the variable funnel cases (or create the DAG nodes for those if we really need to). And we're not overselling the capabilities to end users until clang or other front-ends adds their own builtin/intrinsics that maps to this, so we should have time to adjust if needed. Exposing the more common and specific rotate functionality as a builtin can happen first I think.

fabiang added inline comments.Jul 13 2018, 2:03 PM
5678 ↗(On Diff #155445)

It's definitely rare and, perhaps more importantly, not nearly as commonly supported as rotates, especially for the variable-amount case.

OTOH, even if the resulting DAG is a bit weird, any backends that want to match it can do so easily _if_ it has a canonical form. It's the almost-but-not-quite-canonical cases where the general shape is clear but there are lots of equivalent minor variations that turn out to be brittle and over-specific in my experience. Plus there's always the option of adding real DAG funnel shift ops in a later patch, should the demand arise.

It might be better to split the tests down into separate files a little - illegal types, rotation patterns etc? I have a feeling these test files might grow and become unwieldy. Possibly add i686 tests as well after that.

spatel updated this revision to Diff 155560.Jul 14 2018, 10:04 AM

Patch updated:

  1. Split the test files according to general funnel-shift vs. rotate (repeated input operand).
  2. Added i686 run for x86. I let that one show SSE2 and x64 show AVX2 rather than adding every permutation because there's not much meaningful difference in the outputs.

No more comments from me - anyone else?

For reference, let me repeat the next steps assuming we're close to good here (I just wrote this in D47735 ):

  1. Convert the intrinsics directly to ROTL/ROTR nodes (this should be a ~2 line patch in SelectionDAGBuilder).
  2. Expose the intrinsic as clang or other front-end builtins (the first of these will be builtin_rotate* rather than the more general builtin_funnel_shift).
  3. Add simplifications/folds/analysis for the intrinsics to IR passes.
  4. Canonicalize to the intrinsics in instcombine.

Let me know if I'm missing anything.

I added a few more notes on the docs.

11924 ↗(On Diff #155560)

I think this should be extract((contact(%x, %y) << %z), 8, 15) since it grabs the more significant half of the shifted double-width value.

11970 ↗(On Diff #155560)

Conversely, this one should extract bits 0 through 7 not 8 through 15.

11971 ↗(On Diff #155560)

I think this example needs the first and second arguments swapped. (Alternatively, the result should be 0xfe = 0b11111110)

spatel added inline comments.Jul 16 2018, 11:27 AM
11924 ↗(On Diff #155560)

Do we have any precedence/documentation for bit numbering of LLVM IR values?
IIUC, you're seeing this:
MSB:15 -> LSB:0

but I was referencing this:
MSB:0 -> LSB 15

Not sure if this is relevant, but we count vector elements from 0-on-the-left:
"The elements of the two input vectors are numbered from left to right across both of the vectors."

Either way, I will add an explanatory comment.

11970 ↗(On Diff #155560)

Same problem as above - we need to clarify the bit numbering that we're using.

11971 ↗(On Diff #155560)

Yes - good catch.

fabiang added inline comments.Jul 16 2018, 11:44 AM
11924 ↗(On Diff #155560)

A quick search doesn't turn up anything. What I can find in other parts of the language reference seems to carefully sidestep the issue of bit numbering, which is probably for the best given the holy wars involved.

Just rewrite as the non-ambiguous (... >> 8) & 0xff (high extract) and ... & 0xff (low extract)? It's just an example anyway.

spatel updated this revision to Diff 155739.Jul 16 2018, 12:28 PM

Patch updated:

  1. Fixed LangRef example text to avoid bit-numbering ambiguity.
  2. Fixed LangRef example that showed the wrong result.
  3. Added regression tests (to the AArch file) for the LangRef examples to verify that the examples match the code.
kparzysz added inline comments.Jul 16 2018, 1:07 PM
5685 ↗(On Diff #155739)

You can use TargetLowering::getSetCCResultType instead.

kparzysz accepted this revision.Jul 16 2018, 1:23 PM

The setcc type can be addressed later. This looks good to me.

This revision is now accepted and ready to land.Jul 16 2018, 1:23 PM
spatel added a comment.EditedJul 16 2018, 1:32 PM

The setcc type can be addressed later. This looks good to me.


I did look at using TLI.getSetCCResultType(), but it's not clear to me that that is ideal. I think we want to keep the type i1 at the start to allow general DAG folds that are written like their IR counterparts. Eg, this is the builder code for an icmp/fcmp:

EVT DestVT = DAG.getTargetLoweringInfo().getValueType(DAG.getDataLayout(), I.getType());

so I think that will always be an i1? (and vectors are N x i1)

But there doesn't appear to be any consistent usage here. Sometimes, it's written like that, others hardcode as i1, and others use getSetCCResultType().

This revision was automatically updated to reflect the committed changes.