Page MenuHomePhabricator

[SLP] respect target register width for GEP vectorization (PR43578)
ClosedPublic

Authored by spatel on Tue, Oct 8, 2:58 PM.

Details

Summary

We failed to account for the target register width (max vector factor) when vectorizing starting from GEPs. This causes vectorization to proceed to obviously illegal widths as in:
https://bugs.llvm.org/show_bug.cgi?id=43578

For x86, this also means that SLP can produce rogue AVX or AVX512 code even when the user specifies a narrower vector width.

The AArch64 test in ext-trunc.ll appears to be better using the narrower width. I'm not exactly sure what getelementptr.ll is trying to do, but it's testing with "-slp-threshold=-18", so I'm not worried about those diffs. The x86 test is an over-reduction from SPEC h264; this patch appears to restore the perf loss caused by SLP when using -march=haswell.

Diff Detail

Event Timeline

spatel created this revision.Tue, Oct 8, 2:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptTue, Oct 8, 2:58 PM

I can confirm that h264 benchmark is now atleast as good as plain -O3.

Generally, I think there are more bugs for -march=haswell. Only in rare cases the perf of binaries with -march=haswell is better than plain -O3.
I tried this patch with zstd but nothing improved.

Plain -O3
./zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 182.0 MB/s , 923.2 MB/s

-O3 -march=haswell
/zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 185.7 MB/s , 866.9 MB/s

-O3 -march=haswell -mprefer-vector-width=128
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.5 MB/s , 806.8 MB/s

for example gcc-10's results for -march=haswell
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.7 MB/s ,1032.8 MB/s

spatel added a comment.EditedWed, Oct 9, 4:18 AM

Generally, I think there are more bugs for -march=haswell. Only in rare cases the perf of binaries with -march=haswell is better than plain -O3.
I tried this patch with zstd but nothing improved.

Plain -O3
./zstd -b selesiafiles/* -f

3# 13 files         : 251919670 ->  97724903 (2.578), 182.0 MB/s , 923.2 MB/s

-O3 -march=haswell
/zstd -b selesiafiles/* -f

3# 13 files         : 251919670 ->  97724903 (2.578), 185.7 MB/s , 866.9 MB/s

-O3 -march=haswell -mprefer-vector-width=128
./zstd -b bench/* -f

3# 13 files         : 251919670 ->  97724903 (2.578), 188.5 MB/s , 806.8 MB/s

for example gcc-10's results for -march=haswell
./zstd -b bench/* -f

3# 13 files         : 251919670 ->  97724903 (2.578), 188.7 MB/s ,1032.8 MB/s

Thanks for testing! I suspect that this problem (ignoring the target-based register width) is more widespread than only the transform starting from GEP, but I want to make sure we have proper tests in place if we change the behavior in other places. Can you file another bug for "zstd"?

Yes, I will do.

ABataev accepted this revision.Wed, Oct 9, 7:58 AM

Looks good.

This revision is now accepted and ready to land.Wed, Oct 9, 7:58 AM
This revision was automatically updated to reflect the committed changes.