This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorizer] Fix getSpillCost() calculation
AbandonedPublic

Authored by nikic on Jul 10 2019, 11:28 AM.

Details

Summary

Original motivation here is a performance problem encountered with the following IR, for which getSpillCost() takes an unreasonable amount of time: https://gist.github.com/nikic/a5bbc62632d68f76dd8fa9a4e81d9594

However, it turns out that the source of the perf problem is a correctness problem as well: The current implementation works by going through the tree nodes, remembering the previous node we looked at and inspecting all instructions between the previous and current node. Of course, because this is a tree and not just a chain, the "previous" node in the tree vector may occur after the current one. In this case the implementation ends up scanning through the whole block (even going through the effort of wrapping around from the start of the block to the end and continuing from there). If this happens often enough, this is not just very slow, but also results in gigantic spill costs (I saw costs >500000 for the sample code).

I'm changing the implementation to compute the cost in a single pass from the root node to the start of the basic block, keeping track of which values are currently live along the way. This resolves both the performance and the correctness issues.

The downside here is that I'm stopping at the edge of the basic block. I'm not sure how I would extend this to work across multiple basic blocks and not sure if that would even be worthwhile in the context of what the SLP vectorizer does.

(Side note: Intrinsic calls probably shouldn't count towards the spill cost. In this example the "calls" are all fshr/fshl, which are certainly not going to cause any spilling...)

Diff Detail

Event Timeline

nikic created this revision.Jul 10 2019, 11:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2019, 11:28 AM

You should be able to at least test the now-reasonable runtime with some huge-ish IR?

I guess it would be better if it visited all the BBs. You could traverse the VectorizableTree once and collect the BBs to be visited before you walk through them. Btw the TreeEntries now have operands, which can help you collect the BBs in the right order.
But I am not sure this is worth the trouble, because it looks to me that the spill cost estimation function is not very accurate in the first place. It only considers the spill cost over calls, and while doing so it is not comparing it against the spill cost of the scalar code, which if I am not mistaken could be equal to that of the vector code in many cases. Someone who knows more about the assumptions made here should probably comment on it.

Another point is that if we need to avoid this extra traversal, we could perhaps implement this within the scheduler. It already walks through all the blocks while building the dependence graph. For example we could update the LiveValues in tryScheduleBundle() and compute the cost over calls in some function like extendSchedulingRegion() ?

nikic added a comment.Jul 12 2019, 2:39 PM

I guess it would be better if it visited all the BBs. You could traverse the VectorizableTree once and collect the BBs to be visited before you walk through them. Btw the TreeEntries now have operands, which can help you collect the BBs in the right order.

I'm not familiar with the SLP vectorizer, is there a guarantee that the BBs form a chain of dominating BBs? I.e. is there always a single BB we can go to when the edge of the current one is reached?

But I am not sure this is worth the trouble, because it looks to me that the spill cost estimation function is not very accurate in the first place. It only considers the spill cost over calls, and while doing so it is not comparing it against the spill cost of the scalar code, which if I am not mistaken could be equal to that of the vector code in many cases. Someone who knows more about the assumptions made here should probably comment on it.

The spill cost calculation is only used on AArch64. I believe the intention is mainly to avoid some "trivial" vectorizations that would be detrimental because AArch64 has scalar, but not vector, callee-save registers.

The spill cost estimation is pretty inaccurate for multiple reasons. One is the implementation issue I try to address here, another is the fact that intrinsics are assumed to also cause spilling, another is that each call is considered independently, even though spill costs will likely be amortized across multiple calls. And, as you say, spill costs for the scalar code are not modeled.

No, there is no single BB. There can be more than one operand BBs that will have to be walked. For example, a TreeEntry in BB0 could have all its left operands in BB1 and all its right operands in BB2. In such case we would need to walk through both of them and perhaps keep the maximum cost? But anyway, given the lack of accuracy in the spill cost estimation I am not sure this is worth the trouble.

nikic added a comment.Dec 15 2019, 9:28 AM

No, there is no single BB. There can be more than one operand BBs that will have to be walked. For example, a TreeEntry in BB0 could have all its left operands in BB1 and all its right operands in BB2. In such case we would need to walk through both of them and perhaps keep the maximum cost? But anyway, given the lack of accuracy in the spill cost estimation I am not sure this is worth the trouble.

Hm, the approach used here doesn't really extend well to multiple BBs, or at least it's not obvious to me how I would manage the LiveValues in that case.

Not sure how to move forward here. The goal of this patch was mainly to make the calculation "not wrong" (in the sense of not getting completely astronomical results), but I don't think I know enough about the SLP vectorizer to actually make it "right".

nikic abandoned this revision.Aug 8 2020, 5:47 AM

This has since been addressed by D82444.