This is an archive of the discontinued LLVM Phabricator instance.

[Passes][VectorCombine] enable early run generally and try load folds
ClosedPublic

Authored by spatel on Nov 19 2022, 7:47 AM.

Details

Summary

An early run of VectorCombine was added with D102496 specifically to deal with unnecessary vector ops produced with the C matrix extension. This patch is proposing to try those folds in general and add a pair of load folds to the menu.

The load transform will partly solve (see PhaseOrdering diffs) a longstanding vectorization perf bug by removing redundant loads via GVN:
issue #17113

The main reason for not enabling the extra pass generally in the initial patch was compile-time cost. The cost of VectorCombine was significantly (surprisingly) improved with:
87debdadaf18
https://llvm-compile-time-tracker.com/compare.php?from=ffe05b8f57d97bc4340f791cb386c8d00e0739f2&to=87debdadaf18f8a5c7e5d563889e10731dc3554d&stat=instructions:u

...so the extra run is going to cost very little now - the total cost of the 2 runs should be less than the 1 run before that micro-optimization:
https://llvm-compile-time-tracker.com/compare.php?from=5e8c2026d10e8e2c93c038c776853bed0e7c8fc1&to=2c4b68eab5ae969811f422714e0eba44c5f7eefb&stat=instructions:u

It may be possible to reduce the cost slightly more with a few more earlier-exits like that, but it's probably in the noise based on timing experiments.

Diff Detail

Event Timeline

spatel created this revision.Nov 19 2022, 7:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 19 2022, 7:47 AM
spatel requested review of this revision.Nov 19 2022, 7:47 AM
lebedev.ri added inline comments.Nov 19 2022, 7:54 AM
llvm/lib/Passes/PassBuilderPipelines.cpp
618–621

What does "reduced" here mean? "obscured"?

spatel added inline comments.Nov 19 2022, 8:15 AM
llvm/lib/Passes/PassBuilderPipelines.cpp
618–621

No, that was supposed to mean "enable more folds".
In the motivating example from #17113, we have:

%2 = load float, ptr %0, align 16
%3 = insertelement <4 x float> undef, float %2, i64 0
%4 = getelementptr inbounds [4 x float], ptr %0, i64 0, i64 1
%5 = load float, ptr %4, align 4

VectorCombine can widen the first load (with legality/profitability constraints):

%2 = load <4 x float>, ptr %0, align 16
%3 = shufflevector <4 x float> %2, <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
%4 = getelementptr inbounds [4 x float], ptr %0, i64 0, i64 1
%5 = load float, ptr %4, align 4

And GVN then replaces the redundant 2nd load:

%2 = load <4 x float>, ptr %0, align 16
%3 = shufflevector <4 x float> %2, <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
%4 = getelementptr inbounds [4 x float], ptr %0, i64 0, i64 1
%5 = bitcast <4 x float> %2 to i128
%6 = lshr i128 %5, 32
%7 = trunc i128 %6 to i32
%8 = bitcast i32 %7 to float

And then InstCombine manages to remove all of those extra instructions.

lebedev.ri added inline comments.Nov 19 2022, 8:23 AM
llvm/lib/Passes/PassBuilderPipelines.cpp
618–621

I would suggest something closer to

// Try vectorization/scalarization transforms
// that are both improvements themselves,
// and can allow further GVN and InstCombine folds.

then, maybe, optionally.

spatel updated this revision to Diff 476681.Nov 19 2022, 8:30 AM
spatel marked an inline comment as done.

Patch updated:
Improved code comment in PassBuilder, so it's more clear why VectorCombine is run (and anchor the position ahead of GVN/InstCombine).

Sounds reasonable to me.

LGTM - any other comments?

lebedev.ri accepted this revision.Nov 21 2022, 7:36 AM
This revision is now accepted and ready to land.Nov 21 2022, 7:36 AM
This revision was landed with ongoing or failed builds.Nov 21 2022, 10:58 AM
This revision was automatically updated to reflect the committed changes.