This is an archive of the discontinued LLVM Phabricator instance.

Add callee-saved registers as implicit uses in return instructions
Needs ReviewPublic

Authored by kparzysz on May 9 2017, 8:07 AM.

Details

Summary

Update the block live-ins to have the saved registers on the paths from the entry blocks to the save blocks, and have the restored registers on the paths from the restore blocks to the returns.

Diff Detail

Repository
rL LLVM

Event Timeline

kparzysz created this revision.May 9 2017, 8:07 AM

This patch doesn't update any other code that may be simplified as a result.

gberry added a subscriber: gberry.May 9 2017, 8:49 AM
MatzeB edited edge metadata.

I really like the direction of this as it would get rid of the "pristine registers" and the "how to calculate live-outs of a return block" special cases.

Some notes:

  • This needs some compiletime checking to ensure the pristine registers were really just a premature optimization rather than something that fundamentally improves compiletime. Though in the interest of simplicity/maintainability I'd even accept smallish compiletime changes.
  • I still have to dive into the patches details, at a first glance I am surprised this needs the new getSavedRegisters()/getRestoredRegisters() callbacks. Seems like there is some exception handling magic going on, why do we need to special case it?
  • I still have to dive into the patches details, at a first glance I am surprised this needs the new getSavedRegisters()/getRestoredRegisters() callbacks. Seems like there is some exception handling magic going on, why do we need to special case it?

Magic #1: ARM saves LR, but restores the value directly into PC, and in general there is no way of telling that LR does not need to be restored. The general assumption is that the registers that are saved in the save blocks are the same that are later restored, and also the same registers that are live-out from the function (i.e. consumed by the return instructions). Without this, LR would have ended up as imp-use on return instructions, and caused "use of undefined register" errors.

Magic #2: X86 on Windows has some concept of funclets, whose entry points are treated as save blocks, but which do not save all of the registers in CSI. The exception handling runtime saves a few of them, and those ones are excluded from saving/restoring. I think it actually happens to save/restore all of them (based on my reading of the code). In this case, again we have to keep track which saves actually reached a given return instruction if we are to add the implicit uses correctly.

I don't like these hooks that much either.

I really like the direction of this as it would get rid of the "pristine registers"

This patch wouldn't allow us to get rid of pristine registers, unless it added them as live-ins to all blocks (and implicit uses of them to all returns). Is that what you have in mind?

kparzysz updated this revision to Diff 98763.May 12 2017, 7:30 AM

Add pristine registers as live-in to all blocks that are on a path from entry to a return block.

This causes one X86 testcase to fail: CodeGen/X86/break-false-dep.ll

test/CodeGen/X86/break-false-dep.ll:317:7: error: expected string not found in input
;AVX: vcvtsi2sdq {{.*}}, %xmm6, {{%xmm[0-9]+}}
      ^
<stdin>:496:2: note: scanning from here
 #APP
 ^
<stdin>:585:2: note: possible intended match here
 vcvtsi2sdq %r11, %xmm2, %xmm4
 ^

The difference is that XMM6 is a pristine register, and is started showing up as live-in in basic blocks. This lead the ExecutionDepsFix to pick a different register to use in an undef use (and it happened to be XMM2).

Any comments since the patch has been updated? It now adds pristine registers as live-in to all blocks between entry and exit. It caused one testcase failure, but the failure seems harmless.

As far as I understand it the main trick here is introducing the getSavedRegisters()/getRestoredRegisters() callbacks, because as it turns out the CalleeSavedInfo is not flexible enough to express what actually happens in practice (as you explained above exceptions and EH landing pads).

I found the code in updateLiveness() perfectly esp. compared to the new one that visited the blocks to fill a bitvector and then does something with every set bit, no idea why the step through the bitvector is necessary. Should be easy enough to do this while working like the original code. There are some good updates there like using a bitvector for the visited set or block numbers for the worklist. But factoring this out into the getAllPaths() function feels cumbersome: It requires you to first transform the list of save blocks/restore blocks which is already a nice list into a bitset just so the bitset is iterated again and put into the worklist. Similar with the results: You already visit the relevant blocks but instead of updating the liveins lists you set a bit in the bitset just so that caller has to iterate the bitset again to do something for each visited block.

If this patch works then we should be able to simplify the code in LivePhysRegs::addLiveOutsNoPristines() by removing all special handling for return blocks and LivePhysRegs::addLiveOuts() to simply add the pristine registers in the return block case.

lib/CodeGen/PrologEpilogInserter.cpp
975–981
  • Those two loops should be merged.
  • I guess it's customary to call block variables MBB in CodeGen.
  • On a general note I wish that we could make better assumptions about EH landing pads, but I understand why you did it this way and it's fine for this patch.
982

Maybe increasing this to 16 or 32; I'd expect more than 4 CSRs to be common.

983

Could use MCPhysReg instead of unsigned.

984–985

Francis landed a patch recently that finally allows for(unsigned PRegs : Pristine.set_bits()). Similar in a few other places here (though as noted in the general comments, it feels like too many unnecessary intermediate bitsets to me).

990

Sorry for the confusion about the pristine registers.
Handling them like this is wrong: You need to push them into every single livein list in the function, not just to the ones betweeen entry/save and restore/return.

I think we better leave that to another patch then (after measuring the impact on compiletime/codesize).

1008

A normal list should do here, as you already habe visited sets around. You only add stuff to the worklist that is not in the visited set yet. That allows you to use a simple Vector again.

lib/Target/ARM/ARMLoadStoreOptimizer.cpp
1930–1936

Maybe factor out the duplicated code.

But factoring this out into the getAllPaths() function feels cumbersome: It requires you to first transform the list of save blocks/restore blocks which is already a nice list into a bitset just so the bitset is iterated again and put into the worklist. Similar with the results: You already visit the relevant blocks but instead of updating the liveins lists you set a bit in the bitset just so that caller has to iterate the bitset again to do something for each visited block.

One of the factors that contributed to this choice was that getting the set of all blocks on paths between two other sets of blocks (e.g. restore blocks and returning blocks) seems like a useful utility to have. The need for calculating this could reasonably arise outside of PEI (I remember needing/wanting this for some work op Hexagon optimizations). It could be placed in another file, but there is no "precedent" for that (we have BasicBlockUtils.h for IR-level blocks, but not for CodeGen).

Regarding the use of BitVector---sure, I can change it to something more universal (any model of a set that can be iterated over would do). BitVector has both: cheap test for membership and iteration, and as a bonus a predictable traversal ordering (based on block numbers). Copying the argument into another structure (work list) may seem cumbersome, but it would need to happen regardless of what structure is used.

I'll leave the bit vector in for now and address your other comments.

kparzysz updated this revision to Diff 101414.Jun 5 2017, 9:14 AM
kparzysz marked 6 inline comments as done.

Addressed the comments. Removed the handling of pristine registers from this version.

lib/CodeGen/PrologEpilogInserter.cpp
984–985

Removed this particular statement, but used the set_bits range in other places.