Page MenuHomePhabricator

[DAG] convert vector select-of-constants to logic/math
ClosedPublic

Authored by spatel on Aug 17 2017, 11:44 AM.

Details

Summary

This goes back to a discussion about IR canonicalization. We'd like to preserve and convert more IR to 'select' than we currently do because that's likely the best choice in IR:
http://lists.llvm.org/pipermail/llvm-dev/2016-September/105335.html
...but that's often not true for codegen, so we need to account for this pattern coming in to the backend and transform it to better DAG ops.

Steps in this patch:

  1. Add an EVT param to the existing convertSelectOfConstantsToMath() TLI hook to more finely enable this transform. Other targets will probably want that anyway to distinguish scalars from vectors, but we need it here because AVX512 vectors infinite loop with these folds.
  2. Convert vselect to math or logic. Credit to @RKSimon for suggesting the xor/and/xor hack instead of add/sub for the general case ( also see https://graphics.stanford.edu/~seander/bithacks.html#MaskedMerge ). Bitwise ops should always be more amenable to further folding, so we don't need to add even more special cases for -1/0 constants here.

Try to verify the logic in Alive:
http://rise4fun.com/Alive/yFR

For x86, blendv* is always a multi-uop / multi-cycle instruction according to Agner's docs, so it always makes sense to replace that with simpler instructions. I'm not sure if the same is true for PPC.

Ie, if this:
-; CHECK-NEXT: xxsel 34, 51, 50, 34
+; CHECK-NEXT: xxland 0, 34, 50
+; CHECK-NEXT: xxlxor 34, 0, 51

...is not a good optimization in general, we could make the TLI hook distinguish between constants as a further refinement. Another possibility is to convert that back into a select as a machine-instruction-level fold predicated on uarch details?

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Aug 17 2017, 11:44 AM

> For x86, blendv* is always a multi-uop / multi-cycle instruction according to Agner's docs

Are you sure?

Bulldozer, Piledriver, Ryzen, and Skylake seem to list PBLENDVB and BLENDVPS as 1 uop.

Oops...maybe it's not so clear anymore. Page references to the doc dated May 2, 2017:

Bulldozer, Piledriver: 1 uop and 2 cycle latency for 128-bit; 2 uops and 2 cycles for 256-bit. (p. 48, 61)
Ryzen: 1 uop and 1 cycle for 128-bit; 2 uops and 1 cycle for 256-bit. (p. 88)
Skylake: 1 uop and 1 cycle for 128-bit (or just legacy encoded xmm?); 2 uops and 2 cycles for 256-bit (or any vex version?) (p. 239, 243)

So yes, it seems the recent uarch are putting more effort into making blendv a fast op. I think logic ops are still the better choice for default x86 (~SandyBridge). As I suggested in the description, we could re-form a select in machine combiner or some other machine pass if that's the better choice for a particular uarch. WDYT?

Adding some more x86 experts.

If the general case is controversial, another possibility would be to split that off as its own patch. All of the special case (0, -1, off-by-one) diffs are clear wins?

On Thu, Aug 17, 2017 at 3:52 PM, <escha@apple.com> wrote:

btw, to note, blendv *is* a “fast op” on most arches; what’s one extra uop is the fact that the V form takes 3 inputs. iirc, it’s the same reason that ADC, CMOV, etc can’t be 1 uop (2 inputs + flags). the forms that don’t take 3 inputs are usually 1 uop.

Yeah, my reality has been warped by being on x86 too long. But there's apparently good news if I'm reading Agner's numbers correctly this time: both Ryzen and Skylake can do cmov or blendv in a single cycle - getting wider vector ops to be one uop just requires widening the vector unit implementations to actually match the ISA...no problem, right? :)

That means we can canonicalize to 'select' in IR, and we're mostly done: 'select' IR becomes a (v)select node and gets lowered to the matching select instruction.

We can still detect and optimize the special cases here. Optimizing to bit-logic-ops for weak x86 becomes the uarch-specific MI transform.

Let me remove that "general" case from this patch.
IOW, gcc is doing the wrong thing here for skylake by extending the dependency chain:
https://godbolt.org/g/5WJytM

spatel updated this revision to Diff 111583.Aug 17 2017, 4:36 PM

Patch updated:
Remove the general case fold which actually isn't a win if vselect is fast. We only convert the vselect when there's a clear optimization now.

Patch updated:
Remove the general case fold which actually isn't a win if vselect is fast. We only convert the vselect when there's a clear optimization now.

That means we shouldn't need a TLI hook now, but I'd like to make that a follow-up because I'm not sure how other targets will change.

nemanjai edited edge metadata.Aug 20 2017, 12:13 AM

This patch certainly looks good from the PPC perspective as far as I can tell.

This patch certainly looks good from the PPC perspective as far as I can tell.

Thanks. I saw one potential oddity, so I filed:
https://bugs.llvm.org/show_bug.cgi?id=34246

Ping.

Can someone have a look at the x86 diffs? I can then remove the TLI hook as a follow-up and see what effect this has on other targets. Or I can go straight to that step if people think that's better.

craig.topper edited edge metadata.Aug 24 2017, 3:02 PM

I think the X86 diffs looks reasonable.

aaboud edited edge metadata.Aug 24 2017, 3:13 PM

These changes looks good to me.

I think the X86 diffs looks reasonable.

Thanks! So both affected targets are approved. Can I take that as approval of the patch?

aaboud accepted this revision.Aug 24 2017, 3:25 PM
This revision is now accepted and ready to land.Aug 24 2017, 3:25 PM
This revision was automatically updated to reflect the committed changes.