Page MenuHomePhabricator

[X86] Rewrite the max and min reduction intrinsics to make better use of other functions and to reduce width to 256 and 128 bits were possible.
ClosedPublic

Authored by craig.topper on May 25 2018, 5:24 PM.

Details

Summary

We only need to use 512 bit vectors all the way through v8i64 reductions since those max instructions are new to avx512f and only available in 512 bits until SKX.

For v16i32 and floating point we have legacy 128/256 bit instructions we can use.

I've tried to use other intrinsics to reduce the verbosity of the code and avoid having to mention all the shuffles. I've also removed all the -1 shuffle indices so the output sequence is fully specified and not left to backend optimization.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper created this revision.May 25 2018, 5:24 PM
This revision was not accepted when it landed; it landed in state Needs Review.May 26 2018, 11:59 AM
This revision was automatically updated to reflect the committed changes.
craig.topper reopened this revision.May 26 2018, 12:04 PM
RKSimon added inline comments.May 26 2018, 2:06 PM
cfe/trunk/lib/Headers/avx512fintrin.h
9589

Would it be dumb to allow VLX capable CPUs to use 128/256 variants of the VPMAXUQ etc ? Or is it better to focus on improving SimplifyDemandedElts to handle this (and many other reduction cases that all tend to keep to the original vector width)?

craig.topper added inline comments.May 26 2018, 6:03 PM
cfe/trunk/lib/Headers/avx512fintrin.h
9589

I'm not sure how to do that from clang. Should we be using a reduction intrinsic and do custom lowering in the backend?

Is there any llvm side fast isel tests for these?

cfe/trunk/lib/Headers/avx512fintrin.h
9589

I believe we're still missing the clang reduction intrinsics.

We also have the problem with the 'fast math only' accumulator argument on the 'experimental' reductions that we can't actually change because arm64 already relies on them in production code.......

But we could maybe propose clang integer reduction intrinsics?

TBH I don't think any of that needs to be done for this patch, its just an indication that x86 reductions aren't great yet.

Rebase. It doesn't look like we have fast-isel tests so I'll add those and post a separate review.

RKSimon accepted this revision.Jun 19 2018, 6:01 AM

OK, if the llvm side tests are incoming I'm happy with this patch

This revision is now accepted and ready to land.Jun 19 2018, 6:01 AM
This revision was automatically updated to reflect the committed changes.

Fast-isel tests were added for previous codegen in r335068 and updated for new codegen in r335071.

One addtiional observation I didn't catch before. The epi32 and epu32 min/max intrinsics were doing a 64-bit element extract as the final step previously because they just did a [0] on _m128i which is really v2di. It didn't functionally matter because it would be truncated after the extract. The new code uses [0] on a v4si type so we get a 32-bit extract.