This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] Add option to only run scalarization transforms.
ClosedPublic

Authored by fhahn on Oct 14 2021, 6:09 AM.

Details

Summary

This patch adds a pass option to only run transforms that scalarize
vector operations and do not create new vector instructions.

When running VectorCombine early in the pipeline introducing new vector
operations can have negative effects, like blocking loop or SLP
vectorization. To avoid regressions, restrict the early VectorCombine
run (when using -enable-matrix) to only perform scalarization and not
introduce new vector operations.

This is done as option to the pass directly, which is then set when
adding the pass to the pipeline. This is done for the new pass manager
only.

Diff Detail

Event Timeline

fhahn created this revision.Oct 14 2021, 6:09 AM
fhahn requested review of this revision.Oct 14 2021, 6:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2021, 6:09 AM

What if we just did better in VectorCombine?

We'd need to chain a bunch of combines together on paper or just implement this:
https://llvm.org/PR52178
...to know if it gets us to the minimal set of shuffles in IR and/or codegen for the 'hadd' example, but it might be enough?

What if we just did better in VectorCombine?

We'd need to chain a bunch of combines together on paper or just implement this:
https://llvm.org/PR52178
...to know if it gets us to the minimal set of shuffles in IR and/or codegen for the 'hadd' example, but it might be enough?

Yes, I think that would get us a bit further, especially on the ARM64 test case. For X86 the shuffles/add chains are a bit more difficult to tackle: it converts the 4 scalar adds to 4 vector adds which each process a single lane. I'm not sure if we will be able to cover this in VectorCombine.

What if we just did better in VectorCombine?

We'd need to chain a bunch of combines together on paper or just implement this:
https://llvm.org/PR52178
...to know if it gets us to the minimal set of shuffles in IR and/or codegen for the 'hadd' example, but it might be enough?

Yes, I think that would get us a bit further, especially on the ARM64 test case. For X86 the shuffles/add chains are a bit more difficult to tackle: it converts the 4 scalar adds to 4 vector adds which each process a single lane. I'm not sure if we will be able to cover this in VectorCombine.

I just drafted a patch for PR51278, and it got the hadd example down to:

What if we just did better in VectorCombine?

We'd need to chain a bunch of combines together on paper or just implement this:
https://llvm.org/PR52178
...to know if it gets us to the minimal set of shuffles in IR and/or codegen for the 'hadd' example, but it might be enough?

Yes, I think that would get us a bit further, especially on the ARM64 test case. For X86 the shuffles/add chains are a bit more difficult to tackle: it converts the 4 scalar adds to 4 vector adds which each process a single lane. I'm not sure if we will be able to cover this in VectorCombine.

I drafted a patch for PR52178 , and I see that we get the first pair folded, but we're stuck on the next pair:

define <4 x float> @reverse_hadd_v4f32(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
  %shift = shufflevector <4 x float> %a, <4 x float> poison, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %shift1 = shufflevector <4 x float> %a, <4 x float> poison, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
  %1 = shufflevector <4 x float> %shift, <4 x float> %shift1, <4 x i32> <i32 undef, i32 undef, i32 6, i32 0>
  %2 = shufflevector <4 x float> %a, <4 x float> poison, <4 x i32> <i32 undef, i32 undef, i32 2, i32 0>
  %3 = fadd <4 x float> %1, %2
  %shift2 = shufflevector <4 x float> %b, <4 x float> poison, <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef>
  %4 = fadd <4 x float> %shift2, %b
  %5 = shufflevector <4 x float> %3, <4 x float> %4, <4 x i32> <i32 undef, i32 5, i32 2, i32 3>
  %shift3 = shufflevector <4 x float> %b, <4 x float> poison, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
  %6 = fadd <4 x float> %shift3, %b
  %7 = shufflevector <4 x float> %5, <4 x float> %6, <4 x i32> <i32 6, i32 1, i32 2, i32 3>
  ret <4 x float> %7
}
spatel accepted this revision.Oct 15 2021, 10:56 AM

I posted the proposed vector-combine change here:
D111901

But this patch LGTM anyway. We don't want vector-combine to make things harder for LV or SLP.
Those passes should have a better view of total costs, so it seems right to have vector-combine only trying its minor changes after the others have run.

This revision is now accepted and ready to land.Oct 15 2021, 10:56 AM

I drafted a patch for PR52178 , and I see that we get the first pair folded, but we're stuck on the next pair:

Thanks for putting up that patch! I'll take a closer look in the next few days. Is it possible that we need to add additional instructions to the worklist to catch the earlier cases?

I posted the proposed vector-combine change here:
D111901

But this patch LGTM anyway. We don't want vector-combine to make things harder for LV or SLP.
Those passes should have a better view of total costs, so it seems right to have vector-combine only trying its minor changes after the others have run.

Sounds good, happy to iterate on improvements to vector-combine separately :)