This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Excluded external uses from the reordering estimation.
ClosedPublic

Authored by ABataev on Jan 5 2022, 12:16 PM.

Details

Summary

Compiler adds the estimation for the external uses during operands
reordering analysis, which makes it tend to prefer duplicates in the
lanes rather than diamond/shuffled match in the graph. It changes the sizes of
the vector operands and may prevent some vectorization. We don't need
this kind of estimation for the analysis phase, because we just need to
choose the most compatible instruction and it does not matter if it has
external user or used in the non-matching lane. Instead, we count the number
of unique instruction in the lane and see if the reassociation changes
the number of unique scalars to be power of 2 or not. If we have power
of 2 unique scalars in the lane, it is considered more profitable rather
than having non-power-of-2 number of unique scalars.

Metric: SLP.NumVectorInstructions

Program results results0 diff

        test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test   70.00   86.00   22.9%
                    test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4527.00 4630.00    2.3%
           test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test  346.00  353.00    2.0%
          test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test  346.00  353.00    2.0%
     test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9275.00    1.9%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test  235.00  239.00    1.7%
       test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test  235.00  239.00    1.7%
   test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8737.00 8859.00    1.4%
               test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1064.00    1.2%
        test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1628.00 1646.00    1.1%
       test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1628.00 1646.00    1.1%
  test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3577.00    0.3%
   test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3577.00    0.3%
     test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4240.00 4250.00    0.2%
            test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 1998.00    0.1%
               test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1672.00    0.1%

test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 783.00 782.00 -0.1%

              test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   69.00   68.00   -1.4%
test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test  207.00  192.00   -7.2%
 test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test  207.00  192.00   -7.2%

test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 89.00 80.00 -10.1%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 89.00 80.00 -10.1%

test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  260.00  215.00  -17.3%

test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 256.00 211.00 -17.6%

MultiSource/Benchmarks/Prolangs-C/TimberWolfMC - pretty the same.
SingleSource/Benchmarks/Misc/oourafft.test - 2 <2 x > loads replaced by
one <4 x> load.
External/SPEC/CINT2017speed/641.leela_s - function gets vectorized and
not inlined anymore.
External/SPEC/CINT2017rate/541.leela_r - same
xternal/SPEC/CINT2017rate/531.deepsjeng_r - changed the order in
multi-block tree, the result is pretty the same.
External/SPEC/CINT2017speed/631.deepsjeng_s - same.
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a - the result is the same
as before.
MultiSource/Benchmarks/MiBench/consumer-jpeg - same.

Diff Detail

Unit TestsFailed

Event Timeline

ABataev created this revision.Jan 5 2022, 12:16 PM
ABataev requested review of this revision.Jan 5 2022, 12:16 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 5 2022, 12:16 PM
vporpo added a comment.Jan 5 2022, 6:32 PM

I don't fully understand why you are completely removing the cost of external uses. This is modeling the additional extract instructions that will be needed if we decide to proceed with that specific operand order. This is useful when you have to break ties, for example if we to choose between instructions of the same opcode, one with external uses and the other without. In that case we would prefer to vectorize the instructions without uses. Perhaps the current implementation is causing issues because the external cost is always subtracted (line 1233) and is not just used as a tie breaker only when the costs are the same.
If I understand correctly the splat cost is orthogonal to the the external uses? Can't we have both?

Also could you split the MainAlt changes into a separate patch?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1143–1144

Please explain in the comments why it is more profitable.

1144

Could you also add to the comment what OpIdx and Idx are.

1146

This needs a better name because it gets a bit confusing later on when you introduce OpV. Perhaps call this OpIdxV and the other one IdxV ?

1150

Could you rename I to Ln ?

1159–1161

Perhaps replace these lines with:
int UniquesCountWithV = Uniques.contains(V) ? UniquesCount : UniquesCount + 1;

1163–1165

same

1270

Could you add to the comment about the what each unsigned is in the pair? I think it is operand index and lane.

1327

Why skip if score is 0 ?

1328

Is it Score or Cost? Score usually suggests the higher the better (and Cost the lower the better).

3047

Why is it better to do this bottom-up ?
Could this be a separate patch?

I don't fully understand why you are completely removing the cost of external uses. This is modeling the additional extract instructions that will be needed if we decide to proceed with that specific operand order. This is useful when you have to break ties, for example if we to choose between instructions of the same opcode, one with external uses and the other without. In that case we would prefer to vectorize the instructions without uses. Perhaps the current implementation is causing issues because the external cost is always subtracted (line 1233) and is not just used as a tie breaker only when the costs are the same.
If I understand correctly the splat cost is orthogonal to the the external uses? Can't we have both?

Also could you split the MainAlt changes into a separate patch?

Sure, I will split the patch (already did it for the first part).

There are several reasons.

  1. External uses. It will affect the cost in any case (does not matter in which lane the scalar is used) so we can just ignore it.
  2. In-tree use or use in graph. Currently it is too pessimistic. We subtract 1 for each such use and the total cost of the lane becomes NumLanes less, though in the worst case it should be just 1-2 less because we end up just with a single shuffle, not with the gather of extracts. Better just to ignore all these uses and just count number of unique scalar to see if we can vectorize the lane after removing reused scalars.
ABataev added inline comments.Jan 6 2022, 7:18 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1143–1144

Do you suggest to ad thу explanation here?

1146

Ok

1150

Ok

1159–1161

Ok

1270

Sure, will do

1327

Score 0 means failed match, it won't be vectorized for sure so no need to check for the reused scalars.

1328

It is cost currently, will rework it to score.

3047

Yes, see D116740

I agree with your second point, it is too pessimistic and should be fixed.
But I don't fully follow your first point. What do you mean that it will affect the cost in all lanes?

Here is an example where the external uses cost can help:

%1 = load A[0]
%2 = load A[1] // %2 has external use
%3 = load B[0]
%4 = load A[1]
%Ln1 = add %1, %3
%Ln2 = add %2, %4
...
... = %2 // External use of %2

While doing the operand reordering we can choose to vectorize either {%1, %2} or {%1, %4}.
Both have the same opcodes etc. so the rest of the cost calculation will give them the exact same score.
But wouldn't we prefer to vectorize {%1, %4} rather than {%1, %2} to avoid the extract instruction?
How would we do this without taking into account the cost of the external uses?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1143–1144

Yes, please add the explanation in the comments.

I agree with your second point, it is too pessimistic and should be fixed.
But I don't fully follow your first point. What do you mean that it will affect the cost in all lanes?

Here is an example where the external uses cost can help:

%1 = load A[0]
%2 = load A[1] // %2 has external use
%3 = load B[0]
%4 = load A[1]
%Ln1 = add %1, %3
%Ln2 = add %2, %4
...
... = %2 // External use of %2

While doing the operand reordering we can choose to vectorize either {%1, %2} or {%1, %4}.
Both have the same opcodes etc. so the rest of the cost calculation will give them the exact same score.
But wouldn't we prefer to vectorize {%1, %4} rather than {%1, %2} to avoid the extract instruction?
How would we do this without taking into account the cost of the external uses?

These 2 loads from A[1] will be combined by instcombine before SLP.

vporpo added a comment.Jan 6 2022, 1:02 PM

This is obviously a contrived example, but it highlights the issue. It still holds if you replace the loads with other instructions of the same opcode that lead to a similar situation requiring tie-breaking using the cost of external uses.

This is obviously a contrived example, but it highlights the issue. It still holds if you replace the loads with other instructions of the same opcode that lead to a similar situation requiring tie-breaking using the cost of external uses.

If the instructions are the same, they will be combined. If the instructions are different, still better to choose the instruction with the higher score, even if it is externally used (we may have a deeper graph, which is still better for the vectorization). There is only one relevant case - if the instructions are very-very similar (their scores are equal). If the instruction is externally used, it still might be vectorized as part of another tree. The only preference here - instruction with a single use (or all vectorized users) and instruction with many uses, I believe. We can check for something like this and consider instruction with a single use (or all vectorized users) as a better choice rather than the instruction with not all vectorized users.

vporpo added a comment.Jan 6 2022, 4:04 PM

Yes, I agree, we need a better way of dealing with uses. We should consider the score first and only check for uses if the scores are exactly the same.
Checking for all-vectorized uses versus not-all-vectorized also makes sense.

Could you please add a TODO (perhaps near line 1336?) with a brief description on how we should be dealing with uses?

Yes, I agree, we need a better way of dealing with uses. We should consider the score first and only check for uses if the scores are exactly the same.
Checking for all-vectorized uses versus not-all-vectorized also makes sense.

Could you please add a TODO (perhaps near line 1336?) with a brief description on how we should be dealing with uses?

I was going to add the analysis of the usage to the patch, have a quick prototype but it has some regressions, will work on the improvements tomorrow.

ABataev updated this revision to Diff 398233.Jan 7 2022, 2:02 PM

Added score for all vectorized users.

vporpo added inline comments.Jan 7 2022, 3:13 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1180

Could you add a comment explaining why we return ScoreAllUserVectorized for this type of instructions?

1326–1336

Since the score calculation is a bit more complicated now, I think it makes sense to move all the score calculation logic into a separate function like getScore() which will help hide all the calls to the separate score functions getLookAheadScore(), getSplatScore(), getExternalScore() and the score scaling. What do you think?

1340

nit: use a static constexpr for the scaling factor

RKSimon retitled this revision from [SLP]Excluded external uses from the reprdering estimation. to [SLP]Excluded external uses from the reordering estimation..Jan 12 2022, 2:58 AM
ABataev marked an inline comment as done.Feb 1 2022, 10:17 AM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1326–1336

I would keep all these smaller functions but will create getScore() for a final score.

1340

Yep, will fix it.

ABataev updated this revision to Diff 405365.Feb 2 2022, 10:55 AM

Rebase + address comments

vporpo accepted this revision.Feb 2 2022, 11:22 AM
This revision is now accepted and ready to land.Feb 2 2022, 11:22 AM
This revision was landed with ongoing or failed builds.Feb 3 2022, 6:55 AM
This revision was automatically updated to reflect the committed changes.