This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Peek into loads when hitting the RecursionMaxDepth
ClosedPublic

Authored by dmgreen on Mar 21 2022, 8:28 AM.

Details

Summary

I have an issue where a SLP tree hit the RecursionMaxDepth limit in the SLP vectorizer, and is unable to analyze the whole tree because of it. This patch effectively slightly extends the limit, but only when it hits a load (or zext/sext of a load). This allows it to peek through in the places where it will be the most valuable, without ballooning out the O(..) by any 2^n factors.

The compile time I measured for this (and D122145 together, as percentage differences) were small:

ClamAV		0.086
7zip		-0.012
tramp3d-v4	0.098
kimwitu++	0.008
sqlite3		0.046
mafft		0.003
lencod		0.169
SPASS		0.006
consumer-typeset -0.019
Bullet		0.011

Diff Detail

Event Timeline

dmgreen created this revision.Mar 21 2022, 8:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 8:28 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
dmgreen requested review of this revision.Mar 21 2022, 8:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2022, 8:28 AM
dmgreen updated this revision to Diff 416965.Mar 21 2022, 8:29 AM

Fix some spelling

Have you checked the compile-time overhead of increasing the max depth by 1?

dmgreen added a comment.EditedMar 22 2022, 7:58 AM

Have you checked the compile-time overhead of increasing the max depth by 1?

Sure - I can check. But it would need to increase the max depth by 2.

OK - it didn't increase the time very much either. Without the influence of D122145, these are the compile times I measured. The first column is this patch (almost no change). The second is increasing the limit from 12 to 14.

			This (%)	14 limit (%)
ClamAV			0.006357042	0.132623139
7zip			0.01145899	0.003936277
tramp3d-v4		-0.00551849	0.009901878
kimwitu++		-0.000151601	0.003420248
sqlite3			-0.005542751	0.002919453
mafft			0.019282126	-0.006002062
lencod			-0.0100786	0.029042137
SPASS			0.006203808	0.014101154
consumer-typeset	0.015334486	0.031552953
Bullet			0.010231917	-0.017328356

Increasing the limit is still very small, but a little bigger in places than this patch (and may have worse performance in some degenerate cases).

Thank you for taking the time to measure the compilation time with an increased max-depth limit.

I am not sure what the max-depth limit was designed to be used for, probably for testing gathers in lit test? But I don't think that it is a very good way of limiting compilation time. A better option would be to limit the total number of nodes in the graph, by simply counting the entries in buildTree_rec(). So I would assume that increasing the max-depth limit should be relatively safe. If we run into compile-time issues, we can introduce the buildTree_rec() entry counter. @ABataev @RKSimon any thoughts on this?

Thank you for taking the time to measure the compilation time with an increased max-depth limit.

I am not sure what the max-depth limit was designed to be used for, probably for testing gathers in lit test? But I don't think that it is a very good way of limiting compilation time. A better option would be to limit the total number of nodes in the graph, by simply counting the entries in buildTree_rec(). So I would assume that increasing the max-depth limit should be relatively safe. If we run into compile-time issues, we can introduce the buildTree_rec() entry counter. @ABataev @RKSimon any thoughts on this?

I don't think it is a good idea to increase/remove completely the limit for tree height. I tried to do something similar recently and found out lots of regressions. I think we can do this safely only after we have the implementation for partial subtree/subgraph vectorization, before that it is not profitable.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4117

It might be not profitable, e.g. if vector extension is not free, while scalar extension is free, and loads are not vectorizable. Also, what if you have an extension of a load and some other instructions, not loads/extractelements, etc.?

Thank you for taking the time to measure the compilation time with an increased max-depth limit.

I am not sure what the max-depth limit was designed to be used for, probably for testing gathers in lit test? But I don't think that it is a very good way of limiting compilation time. A better option would be to limit the total number of nodes in the graph, by simply counting the entries in buildTree_rec(). So I would assume that increasing the max-depth limit should be relatively safe. If we run into compile-time issues, we can introduce the buildTree_rec() entry counter. @ABataev @RKSimon any thoughts on this?

I think I would still recommend this over increasing the max-depth limit. It is more targeted, and less likely to run into degenerate cases where the compile time does increase. It just increases the limit where we know it will be most valuable. They are not mutually exclusive of course - we can always increase the limit separately if we need to.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4117

Im not sure I understand what you mean by unprofitable? This just stops zext(load being forced to be gather if it hits the max depth. It should just mean that those node are either better (not gathers) or the same and shouldn't lead to regressions. It was previously hitting an arbitrary limit - you could say the same where any arbitrary limit causes arbitrary problems. Giving the loads the ability to order nicely should be a bigger win.

For the second part - do you mean a AltOp? If so then that makes sense, we can add a check for that, making sure it is the same as the MainOp.

dmgreen updated this revision to Diff 417871.Mar 24 2022, 3:31 AM

Add a check that S.MainOp == S.AltOp

ABataev added inline comments.Mar 24 2022, 5:51 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4117
  1. The cost of vector sext/zext is larger than the cost of scalar sext/zext (which might be free in many cases).
  2. If S.MainOp is zext/sext(load), it does not mean that all values are zext/sext(load), they might be sext/sext(load,extract,binop,etc.), since you're checking only the mainop.
ktkachov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4114

typo nit: should be "peek"

dmgreen added inline comments.Mar 29 2022, 3:05 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4114

Oh right, yeah. Cheers. I'll fix that.

4117

Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather).

But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it.

For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of?

ABataev added inline comments.Mar 29 2022, 4:25 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4117

Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather).

Yes, but we have already some optimizations/numbers/data for this limit. Yes, this limit is not optimal but it is stable, lots of apps shows that it is good enough. Any changes here are very sensitive.

But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it.

The problem here is that you're not checking for loads, you just check that you have a single load. It will work for just loads, but for zext/sext it is not.

For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of?

Yes, better to have all_of here, at least for zex/sext. For loads themselves it is enough just to check S.MainOp

dmgreen updated this revision to Diff 418902.Mar 29 2022, 3:20 PM
dmgreen retitled this revision from [SLP] Peak into loads when hitting the RecursionMaxDepth to [SLP] Peek into loads when hitting the RecursionMaxDepth.
dmgreen edited the summary of this revision. (Show Details)

Update with more precise zext/sext checks. The clang-formatting comes out a bit strange - let me know if this should be put into a lambda.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4117

I tried the llvm test suite - there were only 2 changes I saw with this patch (according to the file hashes). One was the same code in a different order and the other was essentially the same, just had slightly better reuse of values. Both had the same number of instructions.

I'm don't think I agree that the old limit was particularly special. It was an arbitrary point, and it would seem that increasing should be beneficial in general. A lot of graphs don't get that big, but the ones that do can benefit from including the loads.

Can you give more details about where you expect this to cause problems. Is there a particular benchmark you are worried about?

ABataev added inline comments.Mar 29 2022, 3:25 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4117

I already told that I tried to increase it it and there were lots of regressions. Try your patch for SpecCPU or other benchmark testsuite.

SPEC I've tried for AArch64 - that was one of the places that showed the need for something like this. For AArch64 along with D122145 this helps one of the routines in x264 to speed up the whole thing. The exact speedup is quite dependant on the ordering of shuffles chosen, and may need some more work to come out as good as it can. The introduction of select shuffles wasn't great for targets without them. We are talking about something like a 5% improvement.

I've been trying to run X86 too, but I don't have a great setup for it. On what I think is some sort of "Xeon 6148" thing, these were the performance scores I saw from running SPEC 2017 with -march=native (again with this and D122145):

500.perlbench_r	0.829412281
502.gcc_r	0.398122431
505.mcf_r	-0.352758469
520.omnetpp_r	0.256906516
523.xalancbmk_r	-0.789195872
525.x264_r	0.198126785
531.deepsjeng_r	-0.024245574
541.leela_r	0.003397322
557.xz_r	-0.722254612

They can be quite noisy though I'm afraid, even if those results are averaged between three runs. I wouldn't be surprised if all those changes were down to machine noise or knock-on alignment changes. We have a better setup for AArch64 than we do for X86, but there is always some noise.

I tried 2006 too on AArch64. It was only the perlbench binary that changed there, according to the file hashes. And the performance was the same.

SPEC I've tried for AArch64 - that was one of the places that showed the need for something like this. For AArch64 along with D122145 this helps one of the routines in x264 to speed up the whole thing. The exact speedup is quite dependant on the ordering of shuffles chosen, and may need some more work to come out as good as it can. The introduction of select shuffles wasn't great for targets without them. We are talking about something like a 5% improvement.

I've been trying to run X86 too, but I don't have a great setup for it. On what I think is some sort of "Xeon 6148" thing, these were the performance scores I saw from running SPEC 2017 with -march=native (again with this and D122145):

500.perlbench_r	0.829412281
502.gcc_r	0.398122431
505.mcf_r	-0.352758469
520.omnetpp_r	0.256906516
523.xalancbmk_r	-0.789195872
525.x264_r	0.198126785
531.deepsjeng_r	-0.024245574
541.leela_r	0.003397322
557.xz_r	-0.722254612

They can be quite noisy though I'm afraid, even if those results are averaged between three runs. I wouldn't be surprised if all those changes were down to machine noise or knock-on alignment changes. We have a better setup for AArch64 than we do for X86, but there is always some noise.

I tried 2006 too on AArch64. It was only the perlbench binary that changed there, according to the file hashes. And the performance was the same.

I believe that this patch and D122145 can be landed, just need to tweak them a bit.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4119–4125

I would also add a check that we have VL.size() >= 4 values, for 2 elements it may cause regressions and, probably, single uses, to avoid extra extractelements.

dmgreen marked an inline comment as done.Apr 5 2022, 3:10 AM

I believe that this patch and D122145 can be landed, just need to tweak them a bit.

Ah, great. Thanks. I guess I have to find some way to improve the shuffle order for AArch64 now... :)

dmgreen updated this revision to Diff 420432.Apr 5 2022, 3:10 AM

Update with OneUse checks and a VL limit.

ABataev added inline comments.Apr 5 2022, 3:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4121–4124

I would also checked that sext/zext have just one use too.

dmgreen updated this revision to Diff 420534.Apr 5 2022, 10:27 AM

Add an extra oneuse check on the zext/sext

ABataev added inline comments.Apr 5 2022, 10:36 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4119–4125

Can you merge these 2 checks into one with single all_of to avoid 2 similar checks?

dmgreen updated this revision to Diff 421115.Apr 7 2022, 12:52 AM

Attempt to turn the two loops into one, by checking the opcode matches the MainOp.

This revision is now accepted and ready to land.Apr 7 2022, 5:55 AM

OK I'm going to give this a go to get the better lane ordering. Please let me know if any problems arise.

This revision was landed with ongoing or failed builds.Jul 4 2022, 6:22 AM
This revision was automatically updated to reflect the committed changes.