This is an archive of the discontinued LLVM Phabricator instance.

[x86] narrow a shuffle that doesn't use or set any high elements
ClosedPublic

Authored by spatel on Jan 24 2019, 7:19 AM.

Details

Summary

This isn't the final fix for our reduction/horizontal codegen, but it takes care of a lot of the problems. After we narrow the shuffle, existing combines for insert/extract and binops kick in, and we end up with cheaper 128-bit ops.

The avg and mul reduction tests show an existing shuffle lowering hole for AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right. I could try to patch that first to avoid the regression if that seems like a bigger loss than the wins in the other tests.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jan 24 2019, 7:19 AM
spatel planned changes to this revision.Jan 24 2019, 8:09 AM

Simon mentioned offline that we have "lowerVectorShuffleWithUndefHalf()", which I managed to not find.
It seems like we could reuse at least part of that during combining rather than lowering and get this effect.

spatel updated this revision to Diff 183420.Jan 24 2019, 3:47 PM

Patch updated:
Reduced logic duplication by refactoring the existing lowering code that does this transform and more.

I tried to share even more code by just making the existing code take a mode param to know if we were combining or lowering, but that doesn't work. During lowering, we require a web of subtarget checks to avoid fighting with other shuffle lowerings (infinite looping). We don't want those during this combine, or we'll lose optimizations.

I made this rev of the patch have the same constraints as the earlier version, so the test diffs are identical. It may be possible to allow some of the other shuffle narrowing transforms that we do during lowering, but I think that should be a follow-up (we may need other transforms to avoid regressions).

Note: Tests with AVX512 and 512-bit vectors are not showing the same improvements as other targets/tests because we don't know how to collapse an (extract (extract V, 0), 0) into 1 extract node yet. I'm imagining that patch is similar to the insert-of-insert that in D56604. Once we have that fold, we'll get several more narrowing of binops like we see here. But like this patch, that one also appears to expose a shuffle lowering hole, so it might need a preliminary patch to avoid a regression.

RKSimon accepted this revision.Jan 25 2019, 12:40 AM

LGTM - thanks!

Note: Tests with AVX512 and 512-bit vectors are not showing the same improvements as other targets/tests because we don't know how to collapse an (extract (extract V, 0), 0) into 1 extract node yet. I'm imagining that patch is similar to the insert-of-insert that in D56604. Once we have that fold, we'll get several more narrowing of binops like we see here. But like this patch, that one also appears to expose a shuffle lowering hole, so it might need a preliminary patch to avoid a regression.

Please can you raise a bug about this?

This revision is now accepted and ready to land.Jan 25 2019, 12:40 AM

LGTM - thanks!

Note: Tests with AVX512 and 512-bit vectors are not showing the same improvements as other targets/tests because we don't know how to collapse an (extract (extract V, 0), 0) into 1 extract node yet. I'm imagining that patch is similar to the insert-of-insert that in D56604. Once we have that fold, we'll get several more narrowing of binops like we see here. But like this patch, that one also appears to expose a shuffle lowering hole, so it might need a preliminary patch to avoid a regression.

Please can you raise a bug about this?

Yes - I still need to minimize that one. If I find an obvious fix, I'll post a patch.

This revision was automatically updated to reflect the committed changes.