This is an archive of the discontinued LLVM Phabricator instance.

[x86] use a single shufps when it can save instructions
ClosedPublic

Authored by spatel on Dec 12 2016, 4:40 PM.

Details

Summary

This is a tiny patch with a big pile of test changes.

My motivating case looks like this:

  • vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,2]
  • vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
  • vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]

+ vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]

And this happens several times in the diffs. I think the instruction count and size reduction overcomes any potential domain-crossing penalty due to using an FP op in a sequence of int ops, but let me know if you see problem cases.

I think these are all improvements except one test in vector-shuffle-combining.ll where we miss an opportunity to use a shift to generate zero elements and one test in combine-sra.ll where I'm not sure what is happening yet.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 81159.Dec 12 2016, 4:40 PM
spatel retitled this revision from to [x86] use a single shufps when it can save instructions.
spatel updated this object.
spatel added a subscriber: llvm-commits.
RKSimon edited edge metadata.Dec 13 2016, 3:59 AM

Thanks for looking at this - the domain fixes on D27684 are an early step for adding shufps combining with target shuffles. It should all eventually help us efficiently switch domains if the reduced number of shuffles outweighs any penalty.

combine-sra.ll appears to run into target shuffle combining bailing out if an input has multiple uses - we need a better metric for this (when is it worth keeping both shuffle paths?) but haven't spent much time on it yet.

Thanks, Simon. I see 4 potential outcomes for this patch:

  1. Abandon; we should take a different approach starting at a different point in the lowering sequence.
  2. Add some limitations/refinements (eg: opt-for-size, check the input/output to see if they can avoid a domain switch?).
  3. Accept it, but add fixes for the known problem cases before this lands.
  4. Accept it as-is; the wins outweigh the losses. We can convert a shufps back to integer-equivalent shuffles in MachineCombiner as needed.

Any other options/thoughts?

There's a summary of Agner's findings on this here:
https://llvm.org/bugs/show_bug.cgi?id=27885#c3
...it seems the cross-domain bubble does not actually exist in a lot of common cases (or it's hard to measure).

zansari edited edge metadata.Dec 13 2016, 10:10 AM

Thanks for the link, Sanjay. Yes, I was just about to comment on this in the other review as i just got confirmation.. The info in that link is right. The h/w shufflers cross both domains after IVB, therefore, not suffering the bypass penalty when switching through such instructions (perm/shuf/unpack...).

I think this should go in, just forget the domain penalties, as it shouldn't be an issue with most cpus. When I looked at this in Agner Fog's guides, my conclusion was that it is probably only really an issue with Nehalem and Via Nanos. If a cpu has just one clock additional latency back and forth it's still worth it replacing 3 shuffle instructions with one from the wrong domain (albeit the latency chain will be the same then) - and if it manages to only replace 2 shuffle instructions from the right domain it might be worse or better in such a case. (If it is actually worse with Nehalem with its 2 clock penalty back and forth would of course depend if some instruction mix is latency bound or throughput bound.)
Also, plenty of the more odd cpus either place all shuffles in int domain anyway or even do something more odd (like the original core2 merom). I suppose ideally the shuffle lowering code would take into account such hw cost differences, but the truth is right now it doesn't really model any of this (e.g. on some cpus unpacks might not have the same cost as pshufd neither and so on).
So, I'm all for it (and suggested fixing it the same in https://llvm.org/bugs/show_bug.cgi?id=27885).

Thanks for the link, Sanjay. Yes, I was just about to comment on this in the other review as i just got confirmation.. The info in that link is right. The h/w shufflers cross both domains after IVB, therefore, not suffering the bypass penalty when switching through such instructions (perm/shuf/unpack...).

So does that mean both this and D27684 can be committed safely? For recent hardware it makes no difference (and D27684 possibly saves a few instruction bytes). For older hardware we still save cycles compared to performing the extra shuffles and we should fix up the domain switches where possible to help a little more.

mkuper edited edge metadata.Dec 13 2016, 11:06 AM
mkuper added a subscriber: wmi.

Just wanted to point out the other direction for this also exists.

@wmi ran into this:

#include <pmmintrin.h>

__m128 c, d, e;

void foo(__m128 a, __m128 b) {
  e = a;
  a = _mm_shuffle_ps(a, a, 0x0);
  c = _mm_mul_ps(e, b);
  d = _mm_add_ps(a, b);
}

We generate:

movaps  %xmm0, e(%rip)
movaps  %xmm0, %xmm2
shufps  $0, %xmm2, %xmm2        # xmm2 = xmm2[0,0,0,0]
mulps   %xmm1, %xmm0
movaps  %xmm0, c(%rip)
addps   %xmm1, %xmm2
movaps  %xmm2, d(%rip)
retq

Because we don't even try to match a pshufd in the float domain, even though we could do something like:

movaps  %xmm0, e(%rip)
pshufd  $0, %xmm0, %xmm2        # xmm2 = xmm0[0,0,0,0]
mulps   %xmm1, %xmm0
movaps  %xmm0, c(%rip)
addps   %xmm1, %xmm2
movaps  %xmm2, d(%rip)
retq

So does that mean both this and D27684 can be committed safely? For recent hardware it makes no difference (and D27684 possibly saves a few instruction bytes). For older hardware we still save cycles compared to performing the extra shuffles and we should fix up the domain switches where possible to help a little more.

I have no objections to this, or D27684.

Just wanted to point out the other direction for this also exists.

Because we don't even try to match a pshufd in the float domain, even though we could do something like:

That is quite true however it is imho much less severe. Because shufps can do everything pshufd can do, minus the destructive 2-op syntax. Hence this is only a) an issue with pre-avx targets (whereas missing shufps is definitely an issue with both avx-128 and avx-256 too, albeit this patch in question doesn't address the v8i32 cases yet, but it works all the same). And b) even then, it is only one additional mov, the most simple instruction, subject to never reaching execution units and being handled in renaming stage on some cpus even (with 0 latency, albeit of course there's still some cost with having additional instructions). And last d) it is unclear how often that additional mov is actually needed - I don't think it's even possible the shuffle lowering code knows about additional movs needed to preserve regs?
In any case, not arguing this wouldn't be worth looking at, but the benefits look much smaller to me, it only makes a difference with pre-avx target, and you really want to make sure you do this only if there's zero domain transition penalties for using pshufd in a float sequence.

Just wanted to point out the other direction for this also exists.

Because we don't even try to match a pshufd in the float domain, even though we could do something like:

That is quite true however it is imho much less severe. Because shufps can do everything pshufd can do, minus the destructive 2-op syntax. Hence this is only a) an issue with pre-avx targets (whereas missing shufps is definitely an issue with both avx-128 and avx-256 too, albeit this patch in question doesn't address the v8i32 cases yet, but it works all the same). And b) even then, it is only one additional mov, the most simple instruction, subject to never reaching execution units and being handled in renaming stage on some cpus even (with 0 latency, albeit of course there's still some cost with having additional instructions). And last d) it is unclear how often that additional mov is actually needed - I don't think it's even possible the shuffle lowering code knows about additional movs needed to preserve regs?
In any case, not arguing this wouldn't be worth looking at, but the benefits look much smaller to me, it only makes a difference with pre-avx target, and you really want to make sure you do this only if there's zero domain transition penalties for using pshufd in a float sequence.

I'm not sure I completely agree on the details (In particular, I'm not certain the older non-AVX arches where the mov is needed have zero-latency movs), but I mostly agree. Especially since it looks like the same platforms that need the mov are the ones that *do* have transition penalty. So using a shufps may be worth it only with something like -march=nehalem -mtune=ivb, where we generate SSE code, but expect to run it on a newer platform.

So this is definitely lower priority, just wanted to point out it exists.

I'm not sure I completely agree on the details (In particular, I'm not certain the older non-AVX arches where the mov is needed have zero-latency movs), but I mostly agree. Especially since it looks like the same platforms that need the mov are the ones that *do* have transition penalty. So using a shufps may be worth it only with something like -march=nehalem -mtune=ivb, where we generate SSE code, but expect to run it on a newer platform.

You are right, I shouldn't have said "pre-avx targets", there are no cpus which can't do avx but can do move elimination. But code compiled with just sse flags but running on newer cps will still benefit from move elimination.
But using pshufd instead of shufps would actually be beneficial on some cpus which do have transition penalties (I neglected that previously) - e.g. core2 wolfdale, because shufps (along with all other float, but not double shuffles) is in the int domain anyway, just like pshufd, so the penalties are all the same (all the bulldozers also fall into that category but they of course can do avx, and they also support mov elimination). But expressing these things correctly would really require very model specific shuffle lowering. Which right now just isn't there - the code simply assumes that the int/float domains apply to shuffles as well as arithmetic instructions (hence should avoid using shuffles from wrong domain), which is woefully inaccurate for quite some cpus.

So this is definitely lower priority, just wanted to point out it exists.

I'd like to propose the following:

1 - we get this patch and D27684 approved and committed, providing v4i32 lowering to shufps and avoiding some of the more unnecessary domain switches.
2 - get shufps lowering added to target shuffle combining, I added shufpd recently and it's just been the domain issues that I wanted to tidyup up before adding shufps as well
3 - add support for v8i32 (and v16i32?) lowering to shufps
4 - other missing domain switch patterns (scalar stores and vpermilps/vpshufd come to mind)
5 - add support for domain switching to target shuffle combine when the shuffle depth is 3 or more - this will allow pshufd use on pre-AVX targets and seems to introduce some good uses of insertps as well.

That seems within scope for 4.0 and doesn't involve anything too exotic. After 4.0 we should be in a better position to begin work on moving some of this work to MC combines to better make use of specific scheduler models

I'd like to propose the following:

1 - we get this patch and D27684 approved and committed, providing v4i32 lowering to shufps and avoiding some of the more unnecessary domain switches.
2 - get shufps lowering added to target shuffle combining, I added shufpd recently and it's just been the domain issues that I wanted to tidyup up before adding shufps as well
3 - add support for v8i32 (and v16i32?) lowering to shufps
4 - other missing domain switch patterns (scalar stores and vpermilps/vpshufd come to mind)
5 - add support for domain switching to target shuffle combine when the shuffle depth is 3 or more - this will allow pshufd use on pre-AVX targets and seems to introduce some good uses of insertps as well.

That seems within scope for 4.0 and doesn't involve anything too exotic. After 4.0 we should be in a better position to begin work on moving some of this work to MC combines to better make use of specific scheduler models

Sounds like a good plan to me. As for 3) it is pretty trivial (as seen by my patch) albeit I only did it for v8i32, not v16i32. The latter can always use another native perm shuffle I think though that might be more expensive (well it will have a memory op for sure for the shuffle mask, but beyond that I have no idea neither for KNL nor SKL-E - for that matter I have absolutely no idea if KNL would have domain transition penalties...)

I'd love to see https://llvm.org/bugs/show_bug.cgi?id=31151 addressed as well, either something along the lines of the patch there or differently, then I'm happy with all the shuffles we need :-).

I'd like to propose the following:

1 - we get this patch and D27684 approved and committed, providing v4i32 lowering to shufps and avoiding some of the more unnecessary domain switches.
2 - get shufps lowering added to target shuffle combining, I added shufpd recently and it's just been the domain issues that I wanted to tidyup up before adding shufps as well
3 - add support for v8i32 (and v16i32?) lowering to shufps
4 - other missing domain switch patterns (scalar stores and vpermilps/vpshufd come to mind)
5 - add support for domain switching to target shuffle combine when the shuffle depth is 3 or more - this will allow pshufd use on pre-AVX targets and seems to introduce some good uses of insertps as well.

That seems within scope for 4.0 and doesn't involve anything too exotic. After 4.0 we should be in a better position to begin work on moving some of this work to MC combines to better make use of specific scheduler models

FWIW, this sounds like a very good plan to me too.

Thanks, everyone. I think all comments so far are in favor of committing this patch. If someone will officially approve, I'll check it in. :)

OK to commit this and D27684?

andreadb accepted this revision.Dec 15 2016, 6:32 AM
andreadb added a reviewer: andreadb.

OK to commit this and D27684?

Both patches LGTM.

This revision is now accepted and ready to land.Dec 15 2016, 6:32 AM
This revision was automatically updated to reflect the committed changes.