This is an archive of the discontinued LLVM Phabricator instance.

shrink-wrap: implement more advanced algorithm
Needs ReviewPublic

Authored by thegameg on Mar 9 2017, 8:46 PM.

Details

Reviewers
MatzeB
Summary

THIS IS A WORK IN PROGRESS.

Putting this up for early review.

This pass is an improvement of the current shrink-wrapping pass, based, this time, on the dataflow analysis described in "Minimizing Register Usage Penalty at Procedure Calls - Fred C. Chow" [1]. The aim of this improvement is to remove the restriction that the current shrink-wrapping pass is having, which is having only one save / restore point.

Diff Detail

Event Timeline

thegameg created this revision.Mar 9 2017, 8:46 PM
thegameg edited the summary of this revision. (Show Details)Mar 9 2017, 8:48 PM
thegameg removed a subscriber: mgorny.
MatzeB edited edge metadata.Mar 10 2017, 10:55 AM

Thanks for working on this and posting the early version. On a first high-level glance this looks like a great start! I may need a few more days to dive into in and give detailed comments.

thegameg updated this revision to Diff 91431.Mar 10 2017, 4:05 PM

Update data structures to use the MachineBasicBlock number instead of a pointer.
Update data structures to use register sets instead of boolean maps.
Add new test case.

Hi Francis,

Overal this is heading in the right way and it looks like a great start!

Here's a first round of comments. I mostly looked at stylistic issues and did not look at any of the code in ShrinkWrap2.cpp yet.

  • I really like all those tests getting created during development also bonus points for them being reasonably/small reduced.

Naming: There are a lot of different names for the same/similar things around:

  • Saves/Restores seems to be predominantly used for callee saves.
  • Spills/Reloads (or Spills/Fills we don't even agree there) seem to be used by the register allocator.

I think you are picking names from both domains. I probably talked too much about spills/reloads to you as I am a regalloc guy, unfortunately this leads to more mixups here (i.e. having insertSpills() and insertRestores()).

include/llvm/CodeGen/MachineFrameInfo.h
18

Looks like this change is not needed (anymore?). I only see usage of MachineBasicBlock*

296–297

This looks like the main information you compute so it would deserve a few more comments explaining details. I wondered about:

  • I assume we can have multiple entries for the same register (as long as they are in different blocks)?
  • Can we have different FrameIndexes for the same register in different blocks?
  • How does this relate to the existing CSInfo struct?
684–696

Would it make sense to sprinkle some asserts here to make sure you are not accidentally mixing up the different systems and only use one at a time?

include/llvm/CodeGen/MachineFunction.h
478 ↗(On Diff #91431)

Unrelated to this patch and obvious. Feel free to just commit this without review.

lib/CodeGen/PrologEpilogInserter.cpp
400

Do not repeat function names in doxygen comments (A lot of old code does the same, but we try to avoid it for new code).

403

In llvm we try to use ArrayRef<T> instead of cont std::vector<T>&. It typically provides the same performance but works with std::vector, llvm::SmallVector and other containers (as long as the containers stores all values in a continuous memory region).

409

I tend to use references (=const TargetRegisterInfo &TRI = for stuff that cannot be nullptr).
Similar in a number of other places.

415–417

Use range based for.

432

Maybe call it RestoreBB as the variable doesn't hold a restore instruction but is about a basic block.

442

Avoiding auto is friendlier to the reader if the type isn't immediately obvious.

446

How about I = Restore.getFirstTerminator().

466–473

Would it make sense to just reverse the loop over CSI instead of using the subtle AtStart/BeforeI logic?

lib/Target/X86/X86FrameLowering.cpp
1162–1165

odd linebreak. Also could be written as:

NumBytes = FrameSize;
if (!MFI.getSaves().empty())
  NumBytes -= X86FI->getCalleeSavedFrameSize();
1675–1681

see above

test/CodeGen/X86/ShrinkWrapping/2006-04-27-ISelFoldingBug-Reduced.mir
4 ↗(On Diff #91431)

Maybe move this CHECK: next to the other CHECKS. It is also often a good idea to have some CHECK-LABEL for a point where the function begins (not sure if there actually is such a thing in the debug output here though).
Similar in the other tests.

thegameg added inline comments.Mar 14 2017, 2:45 PM
include/llvm/CodeGen/MachineFrameInfo.h
18

It is, unfortunately. DenseMap<K, V> uses DenseMapInfo<K>, which is specialized on DenseMapInfo<K*>, which uses PointerLikeTypeTraits<T*>, which uses alignof (T) to check the number of low bits available.

thegameg updated this revision to Diff 91809.Mar 14 2017, 6:37 PM
thegameg marked 14 inline comments as done.
thegameg edited the summary of this revision. (Show Details)

Thank you for the review Matthias!

Hope I addressed all your concerns so far in this patch.

I also "added support" for AArch64, fixed some region-related bugs and add some more tests.

thegameg removed subscribers: aemerson, qcolombet.

Hi Francis,

Out of curiosity, what is the compile time impact of this early version compared to the performance gain?

I am not aware of the improvement that Lupo and Wilken did to the algorithm, but back in 2009, we worked on Fred's shrink-wrapping and the compile time impact was just not worth the improvement.
You can have a look at 378553cb for instance. Thus, I am wondering if we are heading in the same direction and if yes, if instead iterative improvements of the current approach wouldn't be more productive.

Cheers,
-Quentin

Out of curiosity, what is the compile time impact of this early version compared to the performance gain?

For the current version, I'm expecting it to be quite high, since the implementation of the algorithm is quite basic, but I didn't measure that yet, and of course, I'm going to look into it after I get most of it fixed.

I am not aware of the improvement that Lupo and Wilken did to the algorithm, but back in 2009, we worked on Fred's shrink-wrapping and the compile time impact was just not worth the improvement.
You can have a look at 378553cb for instance. Thus, I am wondering if we are heading in the same direction and if yes, if instead iterative improvements of the current approach wouldn't be more productive.

I'm aware of the 2009 version of shrink-wrapping, and yes, I'm heading in the same direction as that, but I think we can improve it to reduce compile time. I also took a look at your implementation before, and I couldn't think of a way to produce multiple save / restore points for one register (which seems to be the feature that we're the most interested in?) using dominators (I could look into it if you have any ideas, of course).

And one of my concerns related to the improvement by Lupo and Wilken is actually compile time, since we actually need SESE regions, which are expensive to compute and unused through the pipeline, so we wouldn't be able to reuse the analysis results as we do with dominator trees (RegionInfo.h:25).

Thanks,
Francis

Cheers,
-Quentin

I couldn't think of a way to produce multiple save / restore points for one register (which seems to be the feature that we're the most interested in?) using dominators (I could look into it if you have any ideas, of course).

Is that what will bring performance?
I remember looking at candidate we missed with the current implementation and the main problem was the fact that we don't support several restore/save points, not that we don't split the set of CSRs. At that time we were not interested into pushing the approach further so we actually didn't categorize what is needed to get more interested candidates.
What does your categorization tell you?

Put differently, what is the percentage that we miss because of the granularity of the CSRs set we track, the number of save/restore points, "poor" allocation and so on?

Basically what I am saying is that your approach may make sense but I would rather have the implementation data-driven. My worries is that the end of your work we will end up in the same situation as 2009 and decide not to use it.

My 2-cent.

Cheers,
-Quentin

I couldn't think of a way to produce multiple save / restore points for one register (which seems to be the feature that we're the most interested in?) using dominators (I could look into it if you have any ideas, of course).

Is that what will bring performance?
I remember looking at candidate we missed with the current implementation and the main problem was the fact that we don't support several restore/save points, not that we don't split the set of CSRs. At that time we were not interested into pushing the approach further so we actually didn't categorize what is needed to get more interested candidates.
What does your categorization tell you?

Well, splitting the CSRs set and the save / restore points is, to me, very related. What I noticed (maybe I should look further into that?) is that once we have more than one CSR use, our allocation gets dropped because we have to merge the points with the other register's points (same for stack operations). But I agree, that having multiple save / restore points should be more beneficial than splitting the CSRs.

Put differently, what is the percentage that we miss because of the granularity of the CSRs set we track, the number of save/restore points, "poor" allocation and so on?

I don't have any numbers, and I'm not sure my estimation would be correct, and I actually think that I should look into that (maybe I should have gathered those numbers from the beginning?).

Basically what I am saying is that your approach may make sense but I would rather have the implementation data-driven. My worries is that the end of your work we will end up in the same situation as 2009 and decide not to use it.

I see, thank you for taking a look, I'll see if I can get some measurements asap.

Hey, here's a nother round of comments on the data flow analysis. Remember that dataflow is just a mental model/theoretical framework. For a concrete instance you can often find shortcuts/optimizations:

  • The code is usually more elegant if you just have a single map from blocks (numbers) to a struct that contains all of anticipated,available for that block. Rather than having a separate map for anticipated and available.
  • Instead of calculating each register on its own, it is often beneficial to calculate all register at the same time. That enables the use of bit operations (like and/or) for the join operations!
  • In your specific case making a difference between in/out is probably not worth it. For example looking ANTOUT/ANTIN: Every time you want to compute ANTIN from ANTOUT you just take the information from ANTOUT and add the stuff from usesReg(). You could just not store ANTOUT (in memory) at all, instead every time where you would change ANTOUT, you just go ahead and add usesReg() immediately and store the results in ANTIN.
  • Similarily you could not store the results of a block with 0 or 1 successor (immediately forward the results in the 1 successor case).
  • The anticipated information seems to be strictly growing (= you are never removing bits), in that cases you can probably put the results of the usesRegs() analysis immediately into the ANTOUT array instead of keeping a separate thing around.
lib/CodeGen/PrologEpilogInserter.cpp
746

llvm coding style recommends against curly braces to call constructors.

lib/CodeGen/ShrinkWrap2.cpp
69–72 ↗(On Diff #91809)

At least implement this with a function taking a lambda instead of using preprocessor macros (though a patch to add iterators to BitVector would of course be apreciated as well).

99 ↗(On Diff #91809)

BitVector should be the better choice over SmallSet for block numbers as the block numbers should be pretty dense.

100–104 ↗(On Diff #91809)

We use typedef over using in llvm when possible.

I actually think that I should look into that (maybe I should have gathered those numbers from the beginning?).

I guess it depends what is the goal of this implementation. For instance, if you want to have a baseline regarding what ShrinkWrapping can do, but don't actually plan to use this implementation in production, your approach makes sense. It will help to identify what are the cases that we miss right now and give elements to draw a path forward for fixing the existing shrink-wrapping for the important case or coming up with a new approach for shrink-wrapping.
Now, if you approach this as "this is the v2 shrink-wrapping", then yes, I would suggest to gather numbers first :).

Cheers,
-Quentin

thegameg updated this revision to Diff 95242.Apr 13 2017, 4:46 PM
thegameg edited the summary of this revision. (Show Details)
  • Handle irreducible loops.
  • Process the CFG as a DAG, by merging all the blocks in a SCC into a single block.
  • Remove loop, use RPO and PO traversal for AV and AN.
  • Struct of maps -> map of structs.
  • Remove curly braces for constructors.
  • Add BitVector iterators (D32060).
  • using -> typedef.
  • More tests.
thegameg removed a subscriber: rengolin.Apr 13 2017, 4:47 PM
kbarton added a subscriber: kbarton.May 3 2017, 7:23 PM
thegameg updated this revision to Diff 102450.Jun 13 2017, 4:41 PM
thegameg marked 3 inline comments as done.
thegameg edited the summary of this revision. (Show Details)
  • Fixed many bugs.
  • Refactor the pass as a helper object used by PEI.
  • Improve compile-time.
  • Add CFI support.
  • Clean tests.
  • More and more PEI hacks.
thegameg updated this revision to Diff 103086.EditedJun 19 2017, 11:45 AM

Fix some CFI-related changes that regressed during the refactoring.

It is worth noting that D18046 seems to solve this problem in a more general way, which would definitely help with stack-setup shrink-wrapping, which needs CFA updates between basic blocks.

Also, enable the SP-bump combining on AArch64 and start XFAILing some tests that need to be looked into later.

davide added a subscriber: davide.Jul 28 2017, 11:58 AM

This now needs to update the "Restore" flag from CalleeSavedInfo, as of r310619: Add "Restored" flag to CalleeSavedInfo.