This is an archive of the discontinued LLVM Phabricator instance.

Fix quadratic spill behavior?
AbandonedPublic

Authored by qcolombet on Feb 16 2016, 12:49 PM.

Details

Summary

WIP. In a program with a long chain of basic blocks (small loops in sequence), the hopfield network grows extraordinarily slowly, hitting behavior quadratic in the number of basic blocks, and taking 5 minutes to compile. This puts a cap on it to try to fix this. The code doesn't get worse.

No comments/etc yet (still a WIP, still awaiting CPU perf verification).

Diff Detail

Repository
rL LLVM

Event Timeline

escha updated this revision to Diff 48097.Feb 16 2016, 12:49 PM
escha retitled this revision from to Fix quadratic spill behavior?.
escha updated this object.
escha set the repository for this revision to rL LLVM.
escha added a subscriber: stoklund.
escha updated this object.Feb 16 2016, 12:49 PM
escha added reviewers: stoklund, MatzeB, resistor, qcolombet.
escha added a subscriber: resistor.
qcolombet requested changes to this revision.Feb 16 2016, 1:50 PM
qcolombet edited edge metadata.

Hi Fiona,

Yes, in the worse case we can grow the region for each basic block x for each register x for each variable.
That being said, the limit you set is way too low for CPUs which have a few registers. For instance, on X86, the LLVM test suite can go up to 126 iterations.

Given that we don't know how high this value can go in the wild, this is the first time this is reported as a problem, and that it may harm the performance of the generated code if we stop the iteration too soon, I'll go with a finer grain method to set that limit.
A target hook seems more appropriate taking for instance the machine function as a parameter. That way the target can choose the limit based on all the different parameters I've mentioned earlier.
The default implementation would return UINT_MAX so that this patch is a NFC.

What do you think?

Cheers,
-Quentin

lib/CodeGen/RegAllocGreedy.cpp
1045

s/j/iteration/

1046

Simply use iteration here.
E.g.,
for (unsigned iteration = TRI.getMaxHopfieldIteration(MF); iteration; --iteration)

This revision now requires changes to proceed.Feb 16 2016, 1:50 PM
resistor edited edge metadata.Feb 16 2016, 1:57 PM

Do you have any suggestions on how a target might make a reasonable choice for the limit? Does the limit trend down as the size of the register file goes up?

escha added a comment.Feb 16 2016, 1:58 PM

It seems strange to me that the limit would relate to the register file, given that (if I understand the code right) it's iterating over basic blocks, not registers?

escha added a comment.Feb 16 2016, 2:10 PM

To elaborate on this precise example: we have 4000 basic blocks. They're a linear chain of small loops (artificial test). The hopfield network grows at a rate of 2 blocks per call, which results in O(N^2) overall behavior. 256 registers * 4000^2 basic blocks == enormous, big, scary number.

stoklund edited edge metadata.Feb 16 2016, 9:41 PM

To elaborate on this precise example: we have 4000 basic blocks. They're a linear chain of small loops (artificial test). The hopfield network grows at a rate of 2 blocks per call, which results in O(N^2) overall behavior. 256 registers * 4000^2 basic blocks == enormous, big, scary number.

Thanks for identifying this problem.

I think I understand the bad behavior now, but please correct me if I'm wrong: You have a function with a long linear chain of basic blocks, and you have a live range spanning most of this chain, but with very few uses. Perhaps only two. The Hopfield network is only seeded with two active blocks where the uses are, and each iteration of the outer loop in RAGreedy::growRegion() only adds two new nodes to the network due to the completely linear shape of the CFG. Meanwhile, SpillPlacer->iterate() visits the whole set of discovered nodes, which adds up to a quadratic algorithm.

It is interesting that this has not shown up before. I suspect that in most functions, when RAGreedy is in so much trouble that it needs to use region splitting, the live register matrix is already pretty full, and there is not a lot of blank space for the Hopfield network to grow in. You must have very high register pressure in a small part of your function while the long linear chain of basic blocks has very low register pressure.

It is quite reasonable to bound the radius of the Hopfield network like you are proposing here, but it is tricky to pick a maximum radius that doesn't regress performance somewhere else.

I think the quadratic behavior of this algorithm is an accidental effect if its history, and it could be fixed to behave linearly in your case too. In its first incarnation, the Hopfield network was constructed completely before SpillPlacer->iterate() was called, and that function was only called once. This way of computing a region was linear in the size of the CFG, but slow because in a large function, most regions are much much smaller than the whole CFG. The current implementation which grows the Hopfield incrementally as nodes become positive was added in r129188.

The new incremental algorithm is quadratic, but much faster than the old linear algorithm. Most of the time.

We can fix this. When the Hopfield network is expanding, most of the action is happening on the frontier where new nodes are being added. The internal nodes in the network are not likely to be flip-flopping much, or they will at least settle down very quickly. This means that while SpillPlacer->iterate() is recomputing all the nodes in the network, it is probably only the two frontier nodes that are changing their output. Instead of recomputing the whole network on each iteration, we can maintain a SparseSet of nodes that need to be updated:

  • SpillPlacement::activate() adds the node to the todo list.
  • When a node changes value (i.e., update() returns true), its neighbors are added to the todo list.
  • SpillPlacement::iterate() grabs a copy of the todo list, clears the original, and only updates the nodes in the list.

The backwards-then-forwards pattern in SpillPlacement::iterate() is not necessary when the network is being grown iteratively. In fact, a similar effect could be had by iterating over a 'live' SparseSet<unsigned>:

for (unsigned i; i < Todo.size(); ++i) {
  unsigned n = *(Todo.begin() + i);
  if (!nodes[n].update(nodes, Threshold))
    continue;
  Changed = true;
  nodes[n].getLinkedBundles(Todo);
  if (nodes[n].preferReg())
    RecentPositive.push_back(n);
} 
Todo.clear();
escha added a comment.Feb 16 2016, 9:45 PM

I think I understand the bad behavior now, but please correct me if I'm wrong: You have a function with a long linear chain of basic blocks, and you have a live range spanning most of this chain, but with very few uses. Perhaps only two.

Yes, I believe this is *exactly* it.

Hi Jakob,

Thanks for the root causing this issue.

Will you provide a patch or should Fiona or I work on it?

Cheers,
-Quentin

Hi Jakob,

Thanks for the root causing this issue.

Will you provide a patch or should Fiona or I work on it?

I'd be happy to help with understanding the code or reviewing patches, but I don't think I'll have time to make the patch and run benchmarks myself.

Sounds good.

Thanks Jakob!

Fiona, I won’t have time before Friday. Do you want to have a look in the meantime or do you want to live it to me?

Cheers,
-Quentin

escha added a comment.Feb 17 2016, 3:14 PM

This isn't super urgent (it's an artificial test and the bug report's been around for quite a while) so I think I'll leave it to you if that's acceptable.

Sounds good to me :).

Hi Jakob,

I’d like to have your feedbacks on the attached patch.
I’ve more or less blindly applied your suggestions from the other day and I get a bunch of diff in spill placements on the llvm test-suite for x86.
Before I looked closer, could you tell me if the patch is what you had in mind or if I am off?

Apologizes for not being more throughout on the implementation and follow-on for that patch, I had less time than I expected I did the strict minimum :(.

Anyhow, your help is appreciated!

Thanks,
-Quentin
PS: Don’t pay attention to the unique_ptr uselessness, it is a left over of that suggestion “grabs a copy of the todo list, clears the original, and only updates the nodes in the list.”, that if I understood correctly is nullified by “In fact, a similar effect could be had by iterating over a 'live’”.

wmi added a subscriber: wmi.Mar 14 2016, 2:34 PM
qcolombet commandeered this revision.Mar 14 2016, 2:36 PM
qcolombet abandoned this revision.
qcolombet edited reviewers, added: escha; removed: qcolombet.