This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Look-ahead operand reordering heuristic.
ClosedPublic

Authored by vporpo on Apr 18 2019, 4:36 PM.

Diff Detail

Event Timeline

vporpo created this revision.Apr 18 2019, 4:36 PM
RKSimon added inline comments.Apr 27 2019, 8:35 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
153

"slp-max-look-ahead-depth" might be better?

753

These kinds of hard coded costs scare me, especially without any context/description.

vporpo updated this revision to Diff 197190.Apr 29 2019, 2:31 PM
vporpo marked 2 inline comments as done.
vporpo added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
753

I agree, they look quite scary but the values themselves are not very important.
What we we really care about is figuring out whether the values could be potentially vectorized or not. The relative differences between the scores is not very important either. It just shows that matching loads is usually preferable to matching instructions with the same opcode etc.

We could still use the same score of 1 for all of them and we would still get decent reordering. Another alternative would be to check TTI for each potential candidate vector, but it looks like an overkill.

Tests?

lib/Transforms/Vectorize/SLPVectorizer.cpp
765

Seems to me, the code is not formatted

766

auto *

767

auto *

773

auto *

774

auto *

778

auto *

779

auto *

824

auto *

825

auto *

vporpo updated this revision to Diff 197195.Apr 29 2019, 3:13 PM
vporpo marked 9 inline comments as done.

Better to commit the test itself at first, without this patch

RKSimon added inline comments.May 1 2019, 9:41 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
769

(style) remove outer brackets

782

What happens in the case where we have alt opcodes? Should we have a preference for all the same opcode vs with alt-opcode? Sometimes the alt-opcodes will fold away (shl + mul etc.) - other times it won't (shl + lshr).

vporpo updated this revision to Diff 197686.May 1 2019, 6:37 PM
vporpo marked 2 inline comments as done.

Addressed comments and updated lit test.

lib/Transforms/Vectorize/SLPVectorizer.cpp
782

Hmm good point. Well, currently 'getScoreAtLevelRec()' will simply walk past the alt instructions and will assign them ScoreSameOpcode. This is does not look very accurate because alt opcodes usually require shuffles and should have a lower score.
I introduced a new ScoreAltOpcodes = 1 so that alt opcodes are not given the same score as identical opcodes (please see the new functions in lit test).

As for the alt opcodes that fold away, maybe that should be fixed in the getSameOpcode() and in struct InstructionState ? If the get folded then maybe isAltShuffle() should return false?

vporpo updated this revision to Diff 198119.May 3 2019, 6:41 PM

Updated getBestOperand() to use getLookAheadScore() for Load and Constant, not just Opcode.

please can you rebase this?

dtemirbulatov added inline comments.May 22 2019, 12:26 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
932

one extra space here.

dtemirbulatov added a comment.EditedMay 23 2019, 11:57 AM

oh, I notice some regression with the change for AArch64/matmul.ll and AArch64/transpose.ll. Maybe there is a way to isolate it with heuristics?

Yes, I will take a look. Maybe it is worth using TTI for the scores after all.

rcorcs added a subscriber: rcorcs.May 28 2019, 4:57 AM
vporpo updated this revision to Diff 202491.May 31 2019, 2:54 PM

I investigated the two AArch64 failing tests. These tests feature the exact problem that we are trying to solve with this look-ahead heuristic. A commutative instruction had operands of the same opcode that the current heuristic has no way of reordering in an informed way. The current reordering was just lucky to pick the proper one, while the look-ahead heuristic was reordering the operands according to the score. However, the problem was that the score calculation was not considering external uses and was therefore favoring a sub-optimal operand ordering.

I updated the patch to factor in the cost of external uses, and both failures are now gone. I also updated the lit-test with a test that shows the problem with the external-uses.

vporpo updated this revision to Diff 203242.Jun 5 2019, 1:41 PM

Rebased.

hmm, I have another failure with this change on my setup, now it is PR39774.ll. Probably, it might be sort algorithm differents of similar, since it just swap of two inserts. I don't have this failure if PR39774.ll is intact.

vporpo updated this revision to Diff 204408.Jun 12 2019, 7:44 PM

Removed changes in PR39774.ll.

Hmm I am now getting the same failure as you @dtemirbulatov . I am not sure what was wrong before, but it seems that the change in PR39774.ll is no longer needed.

This revision is now accepted and ready to land.Jun 13 2019, 5:35 AM

@vporpo I've committed the lookahead.ll changes at rL363925 with current (trunk) codegen - please can you rebase?

RKSimon requested changes to this revision.Jun 20 2019, 6:15 AM
This revision now requires changes to proceed.Jun 20 2019, 6:15 AM
RKSimon accepted this revision.Jun 21 2019, 3:59 AM

LGTM - thanks!

This revision is now accepted and ready to land.Jun 21 2019, 3:59 AM

Thank you for the reviews. Please commit the patch.

This revision was automatically updated to reflect the committed changes.
vporpo reopened this revision.Jun 25 2019, 12:05 AM
vporpo added a subscriber: rnk.

This was reverted in r364111 since it was causing a failure in Chromium reported by @rnk.

This revision is now accepted and ready to land.Jun 25 2019, 12:05 AM
RKSimon requested changes to this revision.Jun 25 2019, 12:18 AM
RKSimon added a reviewer: rnk.

@rnk do you have a repro yet please?

This revision now requires changes to proceed.Jun 25 2019, 12:18 AM

Yes, @RKSimon . It was posted in llvm-commits. I did reproduce it and I will update the patch with the fix + lit test.

vporpo updated this revision to Diff 206383.Jun 25 2019, 1:08 AM
vporpo edited the summary of this revision. (Show Details)

Fixed crash in chromium reported by @rnk.

The crash was caused by two call instructions with different number of arguments (see lookahead_crash() function in lit test).

vporpo marked an inline comment as done.Jun 25 2019, 1:15 AM
vporpo added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
910

This was the cause of the crash. There was no std::min here, so ToIdx could be OpIdx1 + 1 even if I2 had fewer operands than that.

RKSimon accepted this revision.Jun 26 2019, 4:31 AM

LGTM

This revision is now accepted and ready to land.Jun 26 2019, 4:31 AM
This revision was automatically updated to reflect the committed changes.
phosek added a subscriber: phosek.Jun 28 2019, 9:02 AM

This change has caused a massive increase in build times when using LTO. On my workstations when building Clang toolchain itself with LTO, the build time has increased from 50 minutes to 5+ hours. On our continuous builders, we seen timeouts everywhere because they aren't able to finish within the allocated time (5 hours). Would it be possible to revert this change?

RKSimon reopened this revision.Jun 28 2019, 9:54 AM

This change has caused a massive increase in build times when using LTO. On my workstations when building Clang toolchain itself with LTO, the build time has increased from 50 minutes to 5+ hours. On our continuous builders, we seen timeouts everywhere because they aren't able to finish within the allocated time (5 hours). Would it be possible to revert this change?

Reversion makes sense to me - @phosek are you in position to profile it to see where the time is going in SLP please?

This revision is now accepted and ready to land.Jun 28 2019, 9:54 AM
RKSimon requested changes to this revision.Jun 28 2019, 9:54 AM
This revision now requires changes to proceed.Jun 28 2019, 9:54 AM

I think There are two possible causes for the compilation time increase:

  1. Line 901 : We can restrict the number of operands to a max of 2
  2. Line 820: We can restrict the visited users to a ma x of 2.

I can either create a quick patch, or I can revert it. Either is fine.

I think There are two possible causes for the compilation time increase:

  1. Line 901 : We can restrict the number of operands to a max of 2
  2. Line 820: We can restrict the visited users to a ma x of 2.

I can either create a quick patch, or I can revert it. Either is fine.

If you can create a quick patch, I'd be happy to try it locally to see if it addresses the problem.

Thanks, that would be really helpful @phosek . Let me create the quick patch.

The fix for the compilation-time issue reported by @phosek is in D63948 .

vporpo updated this revision to Diff 207417.Jul 1 2019, 3:06 PM

Fixed the compilation-time issue with the introduction of the user budget limit from D63948. Also added lit test for it.

RKSimon accepted this revision.Jul 2 2019, 1:11 AM

LGTM

This revision is now accepted and ready to land.Jul 2 2019, 1:11 AM
This revision was automatically updated to reflect the committed changes.
Carrot added a subscriber: Carrot.Oct 17 2019, 4:01 PM

Following is a simplified test case that shows the performance regression of eigen. The number of instructions in inner loop is increased from 85 to 93. Hope it can be helpful.

#include "third_party/eigen3/Eigen/Core"

template <typename ValueType>
inline void CheckNonnegative(ValueType value) {

if (!(value >= 0)) {
  fprintf(stderr, "Check failed.\n");
}

}

template <typename ValueType>
inline decltype(std::norm(ValueType(0))) SquaredMagnitude(ValueType value) {

typedef decltype(std::norm(ValueType(0))) ComponentType;
ComponentType r = std::real(value);
ComponentType i = std::imag(value);
return r * r + i * i;

}

template <typename VectorType2>
class DotEigen {
public:

DotEigen(): first_vector_(16), second_vector_(16) {
}

inline void Execute() {
  result_ = first_vector_.dot(second_vector_);
  CheckNonnegative(SquaredMagnitude(result_));
}

public:

Eigen::Matrix<float, 16, 1> first_vector_;
VectorType2 second_vector_;
std::complex<float> result_;

};

typedef DotEigen<Eigen::Matrix<std::complex<float>, 16, 1>> MOperation;
MOperation* operation;

void BM_Dot_RealComplex_EigenDotFixed(int num_iterations) {

for (int iter = 0; iter < num_iterations; ++iter) {
  operation->Execute();
}

}

vporpo reopened this revision.Oct 17 2019, 6:55 PM

Thanks @Carrot, I will investigate the issue.

This revision is now accepted and ready to land.Oct 17 2019, 6:55 PM

@Carrot What build settings are you using? I'm not seeing this here: https://godbolt.org/z/4jq0g8

@RKSimon, my command line options are:

-O3 -m64 -msse4.2 -mpclmul -maes -mprefer-vector-width=128 -fexperimental-new-pass-manager -fPIE

I compared r364964 and r364963.

vporpo updated this revision to Diff 225923.Oct 21 2019, 10:59 AM

This fixes the issue reported by @Carrot. It updates getShallowScore() to better cope with extracts from consecutive locations of the same vector (see last lit test). @Carrot, if you have the time please verify that this patch no longer causes a regression. Thanks!

@vporpo, it's verified that the eigen test doesn't regress now, actually the inner loop is the same as before.
Thanks

Please reland this patch again.

vporpo added a comment.Nov 4 2019, 1:12 PM

@RKSimon any comments about the latest changes?

@vporpo Please can you rebase this?

vporpo updated this revision to Diff 228629.Nov 10 2019, 8:34 PM

Rebased.

RKSimon accepted this revision.Nov 11 2019, 7:55 AM

LGTM - cheers

This revision was automatically updated to reflect the committed changes.