This is an archive of the discontinued LLVM Phabricator instance.

[SLP] respect target register width for GEP vectorization (PR43578)
ClosedPublic

Authored by spatel on Oct 8 2019, 2:58 PM.

Details

Summary

We failed to account for the target register width (max vector factor) when vectorizing starting from GEPs. This causes vectorization to proceed to obviously illegal widths as in:
https://bugs.llvm.org/show_bug.cgi?id=43578

For x86, this also means that SLP can produce rogue AVX or AVX512 code even when the user specifies a narrower vector width.

The AArch64 test in ext-trunc.ll appears to be better using the narrower width. I'm not exactly sure what getelementptr.ll is trying to do, but it's testing with "-slp-threshold=-18", so I'm not worried about those diffs. The x86 test is an over-reduction from SPEC h264; this patch appears to restore the perf loss caused by SLP when using -march=haswell.

Diff Detail

Event Timeline

spatel created this revision.Oct 8 2019, 2:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 8 2019, 2:58 PM

I can confirm that h264 benchmark is now atleast as good as plain -O3.

Generally, I think there are more bugs for -march=haswell. Only in rare cases the perf of binaries with -march=haswell is better than plain -O3.
I tried this patch with zstd but nothing improved.

Plain -O3
./zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 182.0 MB/s , 923.2 MB/s

-O3 -march=haswell
/zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 185.7 MB/s , 866.9 MB/s

-O3 -march=haswell -mprefer-vector-width=128
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.5 MB/s , 806.8 MB/s

for example gcc-10's results for -march=haswell
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.7 MB/s ,1032.8 MB/s

spatel added a comment.EditedOct 9 2019, 4:18 AM

Generally, I think there are more bugs for -march=haswell. Only in rare cases the perf of binaries with -march=haswell is better than plain -O3.
I tried this patch with zstd but nothing improved.

Plain -O3
./zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 182.0 MB/s , 923.2 MB/s

-O3 -march=haswell
/zstd -b selesiafiles/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 185.7 MB/s , 866.9 MB/s

-O3 -march=haswell -mprefer-vector-width=128
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.5 MB/s , 806.8 MB/s

for example gcc-10's results for -march=haswell
./zstd -b bench/* -f
3# 13 files : 251919670 -> 97724903 (2.578), 188.7 MB/s ,1032.8 MB/s

Thanks for testing! I suspect that this problem (ignoring the target-based register width) is more widespread than only the transform starting from GEP, but I want to make sure we have proper tests in place if we change the behavior in other places. Can you file another bug for "zstd"?

Yes, I will do.

ABataev accepted this revision.Oct 9 2019, 7:58 AM

Looks good.

This revision is now accepted and ready to land.Oct 9 2019, 7:58 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a subscriber: fhahn.Jun 22 2020, 9:40 AM

I tracked down a 7% regression in h264 on AArch64 -O3 LTO & PGO to this commit. The regressions in the aarch64 tests seem a bit suspicious and from the description the changes seem unintentional (4 x i32 vectors should be perfectly legal on AArch64). I'll take a look to see what's going on.

I tracked down a 7% regression in h264 on AArch64 -O3 LTO & PGO to this commit. The regressions in the aarch64 tests seem a bit suspicious and from the description the changes seem unintentional (4 x i32 vectors should be perfectly legal on AArch64). I'll take a look to see what's going on.

Oh I now see what's going on. The actual compute is done on i64 x 4.

I tracked down a 7% regression in h264 on AArch64 -O3 LTO & PGO to this commit. The regressions in the aarch64 tests seem a bit suspicious and from the description the changes seem unintentional (4 x i32 vectors should be perfectly legal on AArch64). I'll take a look to see what's going on.

Oh I now see what's going on. The actual compute is done on i64 x 4.

But not in the getelementptr_4x32() test, right? Maybe we need to refine getVectorElementSize() in some way.

fhahn added a comment.Jun 23 2020, 3:47 PM

I tracked down a 7% regression in h264 on AArch64 -O3 LTO & PGO to this commit. The regressions in the aarch64 tests seem a bit suspicious and from the description the changes seem unintentional (4 x i32 vectors should be perfectly legal on AArch64). I'll take a look to see what's going on.

Oh I now see what's going on. The actual compute is done on i64 x 4.

But not in the getelementptr_4x32() test, right? Maybe we need to refine getVectorElementSize() in some way.

Initially I was looking at the @test2, but the more interesting one is indeed getelementptr_4x32. I think the issue might be that we use the width of the pointer to limit the list size, rather than the width of the index computations. IIUC we only vectorize the index computations, so I think it would make sense to limit the width based on the GEP index width, rather than the GEP itself. I put up D82418, which restores getelementptr_4x32 and also catches the important h264 pattern on AArch64, while not regressing the test case on X86.

fhahn added a comment.Jun 23 2020, 3:49 PM

(there are also a bunch of slightly odd cost-model decision I ran into (e.g. all extracts on AArch64 for indices > 0 have a cost of 3 for some reason...); I'll prepare some patches for some of the issue I encountered when I have a bit more time)