This patch introduces a new heuristic for guiding operand reordering. The new "look-ahead" heuristic can look beyond the immediate predecessors. This helps break ties when the immediate predecessors have identical opcodes (see lit test for an example).
Details
- Reviewers
RKSimon ABataev dtemirbulatov Ayal hfinkel rnk - Commits
- rG6a18a9548761: [SLP] Look-ahead operand reordering heuristic.
rGcf47ff5ffb1a: [SLP] Recommit: Look-ahead operand reordering heuristic.
rL364964: [SLP] Recommit: Look-ahead operand reordering heuristic.
rG574cb0eb3a7a: [SLP] Look-ahead operand reordering heuristic.
rL364478: [SLP] Look-ahead operand reordering heuristic.
rG5698921be2d5: [SLP] Look-ahead operand reordering heuristic.
rL364084: [SLP] Look-ahead operand reordering heuristic.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
744 ↗ | (On Diff #195824) | I agree, they look quite scary but the values themselves are not very important. We could still use the same score of 1 for all of them and we would still get decent reordering. Another alternative would be to check TTI for each potential candidate vector, but it looks like an overkill. |
Tests?
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
756 ↗ | (On Diff #197190) | Seems to me, the code is not formatted |
757 ↗ | (On Diff #197190) | auto * |
758 ↗ | (On Diff #197190) | auto * |
764 ↗ | (On Diff #197190) | auto * |
765 ↗ | (On Diff #197190) | auto * |
769 ↗ | (On Diff #197190) | auto * |
770 ↗ | (On Diff #197190) | auto * |
815 ↗ | (On Diff #197190) | auto * |
816 ↗ | (On Diff #197190) | auto * |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
760 ↗ | (On Diff #197366) | (style) remove outer brackets |
773 ↗ | (On Diff #197366) | What happens in the case where we have alt opcodes? Should we have a preference for all the same opcode vs with alt-opcode? Sometimes the alt-opcodes will fold away (shl + mul etc.) - other times it won't (shl + lshr). |
Addressed comments and updated lit test.
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
773 ↗ | (On Diff #197366) | Hmm good point. Well, currently 'getScoreAtLevelRec()' will simply walk past the alt instructions and will assign them ScoreSameOpcode. This is does not look very accurate because alt opcodes usually require shuffles and should have a lower score. As for the alt opcodes that fold away, maybe that should be fixed in the getSameOpcode() and in struct InstructionState ? If the get folded then maybe isAltShuffle() should return false? |
Updated getBestOperand() to use getLookAheadScore() for Load and Constant, not just Opcode.
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
932 ↗ | (On Diff #200564) | one extra space here. |
oh, I notice some regression with the change for AArch64/matmul.ll and AArch64/transpose.ll. Maybe there is a way to isolate it with heuristics?
I investigated the two AArch64 failing tests. These tests feature the exact problem that we are trying to solve with this look-ahead heuristic. A commutative instruction had operands of the same opcode that the current heuristic has no way of reordering in an informed way. The current reordering was just lucky to pick the proper one, while the look-ahead heuristic was reordering the operands according to the score. However, the problem was that the score calculation was not considering external uses and was therefore favoring a sub-optimal operand ordering.
I updated the patch to factor in the cost of external uses, and both failures are now gone. I also updated the lit-test with a test that shows the problem with the external-uses.
hmm, I have another failure with this change on my setup, now it is PR39774.ll. Probably, it might be sort algorithm differents of similar, since it just swap of two inserts. I don't have this failure if PR39774.ll is intact.
Hmm I am now getting the same failure as you @dtemirbulatov . I am not sure what was wrong before, but it seems that the change in PR39774.ll is no longer needed.
This was reverted in r364111 since it was causing a failure in Chromium reported by @rnk.
Yes, @RKSimon . It was posted in llvm-commits. I did reproduce it and I will update the patch with the fix + lit test.
Fixed crash in chromium reported by @rnk.
The crash was caused by two call instructions with different number of arguments (see lookahead_crash() function in lit test).
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
911 ↗ | (On Diff #206383) | This was the cause of the crash. There was no std::min here, so ToIdx could be OpIdx1 + 1 even if I2 had fewer operands than that. |
This change has caused a massive increase in build times when using LTO. On my workstations when building Clang toolchain itself with LTO, the build time has increased from 50 minutes to 5+ hours. On our continuous builders, we seen timeouts everywhere because they aren't able to finish within the allocated time (5 hours). Would it be possible to revert this change?
Reversion makes sense to me - @phosek are you in position to profile it to see where the time is going in SLP please?
I think There are two possible causes for the compilation time increase:
- Line 901 : We can restrict the number of operands to a max of 2
- Line 820: We can restrict the visited users to a ma x of 2.
I can either create a quick patch, or I can revert it. Either is fine.
If you can create a quick patch, I'd be happy to try it locally to see if it addresses the problem.
Fixed the compilation-time issue with the introduction of the user budget limit from D63948. Also added lit test for it.
Following is a simplified test case that shows the performance regression of eigen. The number of instructions in inner loop is increased from 85 to 93. Hope it can be helpful.
#include "third_party/eigen3/Eigen/Core"
template <typename ValueType>
inline void CheckNonnegative(ValueType value) {
if (!(value >= 0)) { fprintf(stderr, "Check failed.\n"); }
}
template <typename ValueType>
inline decltype(std::norm(ValueType(0))) SquaredMagnitude(ValueType value) {
typedef decltype(std::norm(ValueType(0))) ComponentType; ComponentType r = std::real(value); ComponentType i = std::imag(value); return r * r + i * i;
}
template <typename VectorType2>
class DotEigen {
public:
DotEigen(): first_vector_(16), second_vector_(16) { } inline void Execute() { result_ = first_vector_.dot(second_vector_); CheckNonnegative(SquaredMagnitude(result_)); }
public:
Eigen::Matrix<float, 16, 1> first_vector_; VectorType2 second_vector_; std::complex<float> result_;
};
typedef DotEigen<Eigen::Matrix<std::complex<float>, 16, 1>> MOperation;
MOperation* operation;
void BM_Dot_RealComplex_EigenDotFixed(int num_iterations) {
for (int iter = 0; iter < num_iterations; ++iter) { operation->Execute(); }
}
@Carrot What build settings are you using? I'm not seeing this here: https://godbolt.org/z/4jq0g8
@RKSimon, my command line options are:
-O3 -m64 -msse4.2 -mpclmul -maes -mprefer-vector-width=128 -fexperimental-new-pass-manager -fPIE
I compared r364964 and r364963.
@vporpo, it's verified that the eigen test doesn't regress now, actually the inner loop is the same as before.
Thanks