This patch contains a pass that runs after basic block layout and inserts CFI instructions in epilogue. It assumes that there are CFI instructions inserted in prologue. It then finds the value of the offset/register values that are set and uses them to compute the offset/register values that should be set in epilogue. Also, it handles the case with multiple epilogues or epilogue block in the middle of the function by adding CFI instructions where needed.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Hi Violeta,
I'm in the middle of the review, but there's one thing I want to say about the general approach. What bothers me a bit is that recognizing the instructions that modify the stack pointer is separated from where these are actually are emitted. So, if we start making sp changes in the epilogue in some new, unpredictable way, this will break. On the other hand, I understand the need for doing this post-block-layout. (Unfortunately, I missed the original email, only found it now.)
I've been thinking, though - if we're ok with assuming that the sp only changes using a specific set of instructions, why not emit all cfa adjustment post-layout, instead of during frame lowering? And if we're not ok with that, then perhaps we need to solve the epilogue problem in a different way as well?
lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp | ||
---|---|---|
577 ↗ | (On Diff #50273) | Have you checked this with the Darwin people? |
lib/Target/X86/X86CFIInstrInserter.cpp | ||
102 ↗ | (On Diff #50273) | The LLVM convention is to generally frown upon braces for single-line blocks, except when needed for readability. This isn't a hard-and-fast rule (and clang-format does not enforce it), but to be consistent with the rest of the code base, I'd suggest removing them. |
102 ↗ | (On Diff #50273) | What if we don't have debug info, but have EH? |
108 ↗ | (On Diff #50273) | Why not TRI->getStackRegister(MF)? |
113 ↗ | (On Diff #50273) | I think TRI->getPtrSizedFrameRegister(MF) already does what you want. |
123 ↗ | (On Diff #50273) | Why are these two on the heap? |
124 ↗ | (On Diff #50273) | Actually, why is this a member? Isn't it local to AnalyzeMBB()? |
128 ↗ | (On Diff #50273) | Can you use a range for? |
165 ↗ | (On Diff #50273) | This doesn't look it's used anywhere. |
181 ↗ | (On Diff #50273) | Maybe have CFIInstructions[CFIIndex].getOperation() as a temp variable? |
219 ↗ | (On Diff #50273) | This doesn't look right. Why would the first block that has pops be the epilogue? More generally, will it make sense to mark the epilogue block while it's emitted? Sure, it can move it around, but do we expect it to be split? If we do, and can't mark it ahead of time, then I think we need a better way to recognize it. |
test/CodeGen/X86/epilogue-cfi-fp.ll | ||
2 | Any chance the tests can be made smaller? |
lib/Target/X86/X86CFIInstrInserter.cpp | ||
---|---|---|
322 ↗ | (On Diff #50273) | Reformat this to make the comment look better? (I guess there was an 80-char violation there, and clang-format broke the comment up?) |
332 ↗ | (On Diff #50273) | Don't you have this in StackPtr already? |
347 ↗ | (On Diff #50273) | You mean just "pops", right? It's not a "pop %esp". |
348 ↗ | (On Diff #50273) | Same as above re comment. |
354 ↗ | (On Diff #50273) | Oh, so you did mean "pop %esp"? If not, then shouldn't this be SlotSize? |
365 ↗ | (On Diff #50273) | Can we maybe have different code only to get the offset based on the opcode, and then use the same code to actually construct the CFI instruction? It looks like most of the inside of these 3 ifs is duplicated. |
lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp | ||
---|---|---|
577 ↗ | (On Diff #50273) | No, I haven't. I will post a question to llvm-dev. |
lib/Target/X86/X86CFIInstrInserter.cpp | ||
102 ↗ | (On Diff #50273) | I didn't know about the decision concerning -g and .eh_frame. Yes, these changes would modify .eh_frame now, since there is no support for saying which CFI directives should go in .eh_frame, and which in .debug_frame. Are there any plans for adding the support for generating different .eh_frame and .debug_frame? |
123 ↗ | (On Diff #50273) | I changed BlocksToAnalyze not to be, as there was no particular reason for it. And as for MBBInfoList, I wanted to create an array and index its elements so I created one on heap because its length is variable (number of MBBs in MF). Should I use SmallVector or something else instead? |
219 ↗ | (On Diff #50273) | FoundEpilogue is a badly named variable. The thing I am trying to recognize here isn't 'the epilogue', instead, I'm looking for each BB that contains instructions that change the value of SP (or FP). That is what I consider to be 'an epilogue' here. The flag FoundEpilogue would probably be better called something such as ShouldUpdateCFI, because it is set to true when a BB contains instructions that cause the need for adding CFI instructions that update values of offset/register. |
354 ↗ | (On Diff #50273) | No, the comment is wrong, I did not mean "pop %esp". InitialOffset is set to SlotSize at the beginning. |
Thank you for the review, Michael!
One of the reasons against inserting CFI instructions in epilogue during frame lowering, that I have come across, is the fact that their existence in the epilogue block interferes with the work of the tail duplication pass; it doesn't duplicate blocks which it would otherwise and therefore produces different code.
Generating all cfa adjustment post-layout sounds like a good idea. However, if support for generating different .eh_frame and .debug_frame is added, I am not sure if moving all CFA adjustments to a late pass would somehow affect the decision on which CFI directive should end up in .eh_frame and which shouldn't. Maybe generating CFI directives for .eh_frame could be left as it is right now (done in frame lowering), and generating all CFI for .debug_frame could be done in a late pass.
I addressed your comments in the patch.
Comments inline.
Regardless of those comments, I'm still not sure I'm comfortable with the approach.
Reid, I think you originally supported this on the mailing list. Can you please take a look and see whether this is what you had in mind?
lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp | ||
---|---|---|
577 ↗ | (On Diff #50955) | May also be worthwhile to add someone relevant to this review. |
lib/Target/X86/X86CFIInstrInserter.cpp | ||
103 ↗ | (On Diff #50955) | I don't think there are any plans to do that right now, unfortunately. |
124 ↗ | (On Diff #50955) | Or an std::vector - I don't think SmallVector is too appropriate here, there's no reason for MF.size() to be small. |
202 ↗ | (On Diff #50955) | Can we maybe rip this whole if out into a separate function that returns ShouldUpdateCFI? |
220 ↗ | (On Diff #50955) | Can you also update the comments/names, as well as the comment in the class definition that explains what InsertCFIInEpilogue() does? |
238 ↗ | (On Diff #50955) | This still seems somewhat odd. The logic in InsertCFIInEpilogue made sense to me, when it was meant specifically for epilogues. But now, this will affect every block - including non-epilogue blocks that happen to pop / change esp. If I understand what's going on correctly, this will add a .cfi_def_cfa_offset at the end of each block that modifies the cfa offset. But doesn't CorrectCFA already do this, regardless? Or am I getting something wrong here? In any case, perhaps it will be easier to understand once the names and comments are updated. |
355 ↗ | (On Diff #50955) | It may be so, but this has nothing to do with the initial offset - this is explicitly the slot size. |
test/CodeGen/X86/epilogue-cfi-fp.ll | ||
3 | This is a good start, but the tests still look like they contain much more than needed to check the specific things they check. Both in terms of debug info and attributes, and, for the EH test, a lot of code that seems to me to be redundant. |
lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp | ||
---|---|---|
577 ↗ | (On Diff #50955) | I posted a question to llvm-dev and the response http://lists.llvm.org/pipermail/llvm-dev/2016-March/097044.html made me think about the possible solution to the problem concerning generating compact unwind encoding as well as the problem of impacting eh_frame in the case when -g is passed. The solution consisted of marking CFI instructions with "frame destroy/frame setup" tags and then disregarding the CFI instructions marked with 'frame destroy' when generating compact unwind, or emitting eh_frame. However, the problem with this solution is the case when there are already generated .s files, and then .o files are generated: in that case, none of the CFI instructions are marked with 'frame setup/frame destroy' tags, because information about the tags does not exist in the .s file. Because of this, and because the response on the mailing list said that "The Darwin compact unwinder is never going to learn anything to its benefit from a CFI instruction in the epilogue.", I decided to disable the pass for Darwin. |
lib/Target/X86/X86CFIInstrInserter.cpp | ||
103 ↗ | (On Diff #50955) | No, there isn't. The CFI instructions that I'm generating already end up in eh_frame. |
238 ↗ | (On Diff #50955) | No, InsertCFIInEpilogue() will add appropriate CFI instructions after each instruction that changes the SP (FP). For example, it will add a .cfi_def_cfa_offset with the updated offset after each pop instruction. I added a check to see whether instructions that cause the CFI instructions to be inserted are marked as FrameDestroy. That way, it can be assumed that they are part of the epilogue. On the other hand, CorrectCFA() inserts a CFI instruction at the beginning of a MBB if it's needed. The case when this is needed is when epilogue block (with updated CFI) comes before another MBB that should have the offset that is set by the prologue. Then, a cfi_def_cfa_offset is inserted at the beginning of that MBB so it would override the offset set in the epilogue, and therefore provide correct offset for the MBB in question. |
Hi Michael,
Thank you for the review.
I addressed your comments in the patch and answered your questions inline.
I would like to explain why I went with this approach.
My first solution was to add the CFI instructions during emitEpilogue() in frame lowering. I realized that I would need to cover the cases when epilogue was in the middle of the function (when it was setting the wrong offset for blocks below it). The solution for that problem was to add a pass that would run after block placement and insert additional CFI instructions that correct the offset for BBs that have the wrong offset set by the epilogue above them (this is basically what the CorrectCFA method does now). However, then I found out another problem with inserting CFI instructions during emitEpilogue(): they interfered with later passes, e.g. the tail duplication pass wouldn't duplicate blocks that contained them. That is when I decided to move the insertion of CFI instructions in epilogue to the implemented pass.
What is it about this approach that you are not comfortable with? Do you have any suggestions on how it could be improved?
Sorry, I'm late to the party here (I was about halfway through implementing this myself before I found this). I'd also like to understand why a separate pass is needed. For the epilogue in the middle of a function problem, why are save/restore CFI instructions not sufficient? Can you elaborate on the problem in TailDuplication?
Rather than using a heuristic to decide what is an epilogue, why not check the FrameDestroy flag on the instructions?
Wrt to the the Darwin problem, the current solution that's implemented in X86AsmBackend doesn't seem great to me (since it generically doesn't handle cfi instructions in epilogues). Perhaps the correct solution is to have a heuristic that detects which cfi instructions are part of the prologue? This wouldn't be a problem of course if compact unwind info were generated earlier (as it used to be before rL190290), but the concern about not getting compact unwind info from the .s if we could have gotten it from the .ll is quite valid. How does GCC handle this situation?
Also, minor review comment inline.
lib/Target/X86/X86CFIInstrInserter.cpp | ||
---|---|---|
112 ↗ | (On Diff #53270) | Have you checked that this works with x32, I saw problems there in my implementation. |
Two more comments I found while testing this out.
lib/Target/X86/X86CFIInstrInserter.cpp | ||
---|---|---|
105 ↗ | (On Diff #53270) | && !MF->getFunction()->needsUnwindTableEntry() would be good. That's what's used elsewhere. |
377 ↗ | (On Diff #53270) | Missing X86::ADD64ri32. A sufficiently large alloca would probably work for a test. |
The problem with TailDuplication is the main reason why a separate pass is needed. In shouldTailDuplicate() method, while going through instructions in the BB, the pass finds CFI instructions (that are marked as NotDuplicable in Target.td) and decides not to duplicate the given BB.
Using save/restore CFI instructions could probably simplify solving the problem of epilogue in the middle, if CFI instructions were inserted in epilogue in frame lowering.
Another problem that complicates the case when CFI instructions are inserted in epilogue during frame lowering is shrink wrapping. It can cause epilogue to split (e.g. to a BB containing 'pop' instructions and a BB consisting of 'ret' instruction). Those BBs can be reordered later, and there could exist a situation where 'ret' instruction comes before 'pop' instructions. That ret instruction should have def_cfa_offset set by the epilogue, not prologue. This problem could be solved in some way, but the solution would probably be complicated and scattered among different passes.
Also, I believe that problems with other passes could come up, but that we just haven't come across them yet. Those, and mainly the TailDuplication issue, are the reasons why I decided to implement a separate pass to solve this issue.
As for deciding what is an epilogue, I will check if just checking the FrameDestroy flag is sufficient.
Concerning the problem with Darwin, I tried setting FrameSetup tag to CFI instructions added in prologue and using that when generating compact unwind, and it works, but not for the case when it is generated from the .s file (because the information about the tag is lost). I haven't checked what GCC does yet.
lib/Target/X86/X86CFIInstrInserter.cpp | ||
---|---|---|
112 ↗ | (On Diff #53270) | I didn't notice any problems with x32 so far. What problems have you come across? |
377 ↗ | (On Diff #53270) | Thanks for pointing it out! |
Apparently I did say that. This pass ended up being surprisingly complicated, I thought it would be a lot simpler. =/
I guess I think we do need a late pass, but it should coordinate with X86FrameLowering so that it doesn't have to reason too much about arbitrary X86 code. We shouldn't have to try to pattern match instructions that reset ESP from EBP, for example. We shouldn't need to run a dataflow algorithm to calculate the CFA on exit-entry to MBB. X86FrameLowering should *know* this information, and should write it down in some side table that we can refer to in this late pass.
Does that make sense? The late pass should only exist to handle the special case of machine block placement creating mid-function return blocks. The majority of the CFA logic should be in the prologue/epilogue generation.
Could we have an intermediate state where between frame lowering and the final cleanup pass, we essentially assume that at entry to every basic block we're in the CFI state we would have been after the prologue and then use the late pass to insert save/restore instruction when it detects that is not the case? Eventually the late pass could also do cfi optimization.
As for the Darwin problem, I think I've come around to the idea that the only possible way to do this properly is to generate the unwind info while we still know what the prologue is, record it in the assembly (I think cfi_fde_data lets you do that) and only do the heuristic inference thing if no such annotation is found.
lib/Target/X86/X86CFIInstrInserter.cpp | ||
---|---|---|
112 ↗ | (On Diff #53270) | Passing StackPtr to getDwarfRegNum didn't work. Note that x32 is 32bit pointers on 64bit not x86. |
The key point with that proposal being (which I realized I didn't mention) that one could handle tail duplication by inserting save state at the beginning of the basic block, then remember state just before the duplicated instructions and have everything else as usual. Of course it would then be nice for the late pass to clean this up (e.g. remove save/restore pairs when there's no instructions in between), but at least it would be correct.
I think we want to make sure that we move in a direction that makes it easier to do optimizations that affect the CFI between X86FrameLowering and this late pass. For example, we cannot schedule the pushes generated by the X86CallFrameOptimization pass without moving the CFI along with the push. So we generate very poor code in cases like this where the push operands get in the way of outgoing inreg arguments:
target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128" target triple = "i386-unknown-linux-gnu" declare i32 @f1(i32 inreg, i32 inreg, i32 inreg, i32, i32) define i32 @f2(i32 inreg %a, i32 inreg %b, i32 inreg %c, i32 %d, i32 %e) nounwind { entry: %call = tail call i32 @f1(i32 inreg 1, i32 inreg 2, i32 inreg 3, i32 %a, i32 %b) %add = add nsw i32 %call, 1 ret i32 %add }
LLVM generates this:
f2: pushl %edi pushl %esi pushl %eax movl %edx, %esi movl %eax, %edi subl $8, %esp movl $1, %eax movl $2, %edx movl $3, %ecx pushl %esi pushl %edi calll f1 addl $16, %esp incl %eax addl $4, %esp popl %esi popl %edi retl
icc generates much cleaner code (gcc is similar):
f2: subl $20, %esp movl $3, %ecx pushl %edx pushl %eax movl $1, %eax movl $2, %edx call f1 incl %eax addl $28, %esp ret
We would also like the ability to accumulate the stack-cleanup "add %esp" instructions for a series of calls like this:
target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128" target triple = "i386-unknown-linux-gnu" declare void @C(i32, i32, i32, i32) define void @F() nounwind { entry: tail call void @C(i32 1, i32 2, i32 3, i32 4) tail call void @C(i32 5, i32 6, i32 7, i32 8) tail call void @C(i32 9, i32 10, i32 11, i32 12) ret void }
Instead of what is currently generated
F: subl $12, %esp pushl $4 pushl $3 pushl $2 pushl $1 calll C addl $16, %esp pushl $8 pushl $7 pushl $6 pushl $5 calll C addl $16, %esp pushl $12 pushl $11 pushl $10 pushl $9 calll C addl $28, %esp retl
we can eliminate both "addl $16, %esp" instructions and bump up the last %esp adjust to "addl $60, %esp". This is simpler to do without having separate CFI & stack-adjust instructions.
To put this into a concrete proposal, I would suggest making this new pass responsible not only for generating proper epilog CFI but also for generating the CFI for simple stack adjusts. That would not only help enable optimizations like the above, but also eliminate the need for transformations that generate fixed stack adjusts to worry about also generating CFI. There have recently been at least 3 patches that added CFI for transforms involving stack adjusts (see D13767, D14021, D18246) all with their own logic for adding the CFI and deciding whether or not it's necessary.
To avoid having to calculate CFA on entry/end in the late pass, we could remember information about entry/end offset (and register) of a MBB in frame lowering and just use that info in the pass later. Possible problems could be BB merging, splitting and creating in later passes. Reid, is this what you had in mind? I am not sure if you meant to leave the insertion of CFI instructions in epilogue in the frame lowering, or to use the information available during frame lowering in order to simplify the implemented pass (but leave the insertion of CFI instructions in the late pass). Also, is storing information in side tables already used somewhere in the code?
We could store certain information about instructions: how they affect the offset/register. This information could be written down when the instruction is created, and then used in the late pass to generate and insert CFI instructions. That could make it easier to do optimizations that affect the CFI between X86FrameLowering and this pass.
I will start investigating these ideas, to see what kind of problems may arise. Do you have any thoughts on them?
Keno, could you explain your proposal a bit further, maybe provide an example?
My thinking was the following: There's generally three phases to CFI:
- Setting up everything (prologue)
- The function body
- (Potentially multiple epilogues)
On entry to most basic blocks, the current state of the CFI program will be that which we established after the prologue.
The idea is basically to, in the early passes have the invariant that no basic block other than the prologue changes the CFI
state (i.e. the state at the beginning of the basic block must be the same as the state at the end of the basic block). I think such an invariant
could be easily maintained in all the pre-layout passes (worst case by just inserting remember/restore pairs). I think in
most situations this will actually be already correct (since the only places inside the function where the CFI state changes is the epilogue).
The job of the late pass would then simply be to clean everything up once the layout is known (e.g. removing redundant remember/restore pairs, coalescing multiple instructions etc.
Hi Keno,
Thanks for the explanation. Just to check if I am following you, my understanding of your idea is to:
- insert CFI instructions in epilogue during frame lowering (I suppose to use cfi_adjust_cfa_offset)
- surround epilogue with remember state/restore state CFI instructions (also in frame lowering)
- teach TailDuplication to recognize and allow this pattern (a pair of remember/restore state instructions + adjust instructions in between). By "allowing" I mean to duplicate that block even though it contains CFI instructions.
- have the late pass remove duplicate remember/restore state instructions
Is this what you had in mind?
Hi, I would like to propose a potential solution to this problem. If anyone knows a reason why this would fail, or if someone can think of a better way to do it, share your thoughts, help would be much appreciated. Here is the proposal:
- insert cfi instructions in epilogue during frame lowering
- add a flag to each MBB that specifies the type of that MBB based on its beginning and end cfi offset:
- type 1: MBBs that are inside the prologue-epilogue path and have the same beginning and end offset (the one set by prologue)
- type 2: prologue block that has initial frame offset as the beginning offset and sets its own end offset
- type 3: epilogue blocks that have beginning offset the same as end prologue offset and end offset the same as initial frame offset
- type 4: MBBs that are outside the prologue-epilogue path, that have beginning and end offset the same as the initial frame offset
Default value of the flag would be type 1. During frame lowering, prologue and epilogue blocks would have their appropriate flags set, as well as blocks that are outside the prologue-epilogue path. Maybe this could also be done in shrink wrapping, by finding predecessors of the prologue and successors of the epilogue. This would probably require some tree searching, but it would be limited to basic blocks outside of prologue-epilogue path. Does anyone know of a better way to find information about which blocks are outside of the prologue-epilogue path?
- teach tail duplication and any other pass that causes problems to allow cfi instructions in the epilogue and not to change the code generation
- maintain information about added flags throughout different passes (during merging, splitting of MBBs)
- the late pass would go through all MBBs while keeping track of the current cfi offset that is set and add cfi instructions if needed (based on the information from the added flags). Could information about offset set by the prologue be obtained from getStackSize() (for the case without FP)?
What are your thoughts? Can anyone think of something that breaks this solution? One thing that comes to mind would be the existence of multiple prologues, i.e. multiple prologue-epilogue pairs with different frame sizes. Can this happen in LLVM?
Sorry for not getting back to this thread sooner. The plan looks reasonable to me, though I worry about the possibility of an epilogue split over two MBBs, which doesn't seem like it would be doable in this scheme. I don't think there are cases in LLVM where there is more than one prologue.
Hi Keno, thanks for commenting. As for the splitting of the epilogue over two MBBs, I have come across a case where 'pop' instructions were separated from the 'ret' instruction ('pop' instrs were in one MBB, and 'ret' was in another MBB). That case would be covered with the proposed solution. I haven't come across a case where 'add', 'pop' instructions that form an epilogue are split over multiple MBBs.
I would start working on implementing this if no one has anything against the proposed solution.
Here is the current implementation of the proposed solution.
I have added a new flag to a MBB that represents its type based on its beginning and end cfi offset. There are 4 types of MBBs: PROLOGUE, EPILOGUE, IN_PATH and OUT_PATH. Default type of the MBB is IN_PATH. During prologue and epilogue emission in X86FrameLowering, corresponding MBBs are set as PROLOGUE and EPILOGUE, and also, MBBs that are not in the prologue/epilogue path are marked as OUT_PATH. There are changes in some of the passes that split or merge MBBs, where these types are updated correctly. There is also a late pass that inserts additional CFI instructions to the beginnings of the MBBs for the cases where epilogue is in the middle of the function and sets incorrect CFI offset and register values for those MBBs.
Does this look like a step in the right direction? Does anyone have any comments?
The more I think about this, the more I think this is the wrong approach.
As I understand it, the problem is that CFI line tables are linear based on the assembly output, but that the actual CFI adjustments can be anywhere, and we need some way to "reset" the values for the linear layout.
This isn't just an epilogue problem, we are limited in changing stores to pushes because of this. If we have a diamond with a spill in the top of the diamond and a restore in the join, we will get this wrong if we don't lay the diamond out linearly. This is just an example, but there are other things that having better cfa tracking would enable.
If we're going to add information to a basic block, why don't we store the incoming and outgoing cfa_offset, and cfa_register? Then we can allow CFI directives to be copied.
This allows several beneficial things:
- We can shrink-wrap things that need to use a smaller prologue (meaning there are actually 2 prologues in the function)
- We can add a verification pass that all outgoing offsets of predecessors match incoming offsets of successors.
- We can do more store-to-push conversions
- This is platform independent. CFI cfa_def directives are platform independent, and only support register,offset or dwarf expressions. We can ignore dwarf expressions for now.
Then the pass that's needed simply inserts a directive between blocks that are linear-mismatches, and we can even handle rbp,rsp mismatches.
Hi Kyle, thank you for commenting!
I am not sure that I fully understand what you are proposing.
Are you saying that we should attach info about incoming and outgoing cfa_offset and register to a MBB or actually add CFI instructions to the beginning and end of a MBB (and then remove unnecessary CFI instructions later)?
Do you think that CFI instructions in epilogue should be inserted during frame lowering, when instructions that change the SP are created? Same question for the store-to-push optimization. The problem is that CFI instructions that are inserted earlier (before MBB reordering, merging, splitting, that is done in the later passes (tail duplication, control flow optimizer)) affect code generation. If we add information about incoming and outgoing CFI offset and register to a MBB, we would definitely be more flexible. That would support multiple MBBs that change CFI offset (that have different CFI offset at their beginning and end) besides prologue and epilogue. However, the problem with CFI instructions impacting code generation would still be present, and would be handled similarly to the way it is handled in the proposed implementation. Do you have something else in mind for solving this issue?
What do you think is wrong with this approach?
lib/CodeGen/BranchFolding.cpp | ||
---|---|---|
305 | How can I use TII to get this information? To do more tail merging (CFI instructions have different CFI indices, so isIdenticalTo() returns false, and prevents tail merging that would have happened otherwise). |
Attach the info
Do you think that CFI instructions in epilogue should be inserted during frame lowering, when instructions that change the SP are created?
Yes.
Same question for the store-to-push optimization.
Yes.
The problem is that CFI instructions that are inserted earlier (before MBB reordering, merging, splitting, that is done in the later passes (tail duplication, control flow optimizer)) affect code generation.
They currently affect code generation. This is my biggest complaint. I don't see why they should affect these passes.
If we add information about incoming and outgoing CFI offset and register to a MBB, we would definitely be more flexible. That would support multiple MBBs that change CFI offset (that have different CFI offset at their beginning and end) besides prologue and epilogue. However, the problem with CFI instructions impacting code generation would still be present, and would be handled similarly to the way it is handled in the proposed implementation. Do you have something else in mind for solving this issue?
What do you think is wrong with this approach?
I've answered above, but I'm still not convinced that CFI instructions, being only assembler directives to generate dwarf info should affect code generation. I think this is the big problem worth solving, and then the rest of the work will be small problems.
They currently affect code generation. This is my biggest complaint. I don't see why they should affect these passes.
This patch modifies those passes with changes that prevent CFI instructions from affecting code generation, so, with these changes we will avoid affecting code generation.
Those kinds of modifications are exactly what I don't want to see.
Here's what I'd like to see:
Change CFI Instructions so that the following 3 things are no longer true:
- CFI Instructions are marked as not duplicable.
- CFI Instructions don't compare as equal (and so can't be merged)
- CFI Instructions count as instructions when counting for tail-duplicating or tail-merging.
The first 2 shouldn't require any changes to existing passes, and the third should be minor, and definitely not platform specific.
These will require other work as well, but basically what I described above.
As a second best, special casing these instructions in a platform-independent way for platforms that can handle them later would work.
That means adding something to Target Info to indicate if a platform can handle these instructions, defaulting it to false, and then setting it to true for X86.
That would be sufficient for tail-duplication and tail merging.
lib/CodeGen/BranchFolding.cpp | ||
---|---|---|
305 | CFI Instructions should be more like the debug instructions that are skipped below. | |
lib/CodeGen/TailDuplicator.cpp | ||
588 | This tells me that marking CFI Instructions as not duplicable is wrong for all platforms. | |
607–608 | And this tells me that either it needs to be marked as a debug instruction, or we need a broader category for assembler directives that are not instructions. | |
lib/CodeGen/TargetInstrInfo.cpp | ||
396 ↗ | (On Diff #92010) | I read the DWARF spec on CFI. I didn't see where it said this only worked on X86. |
Here is an implementation of the approach that was suggested.
The following set of changes is introduced:
- Changes to CFI instructions:
- they are now marked as duplicable.
- they can compare as equal (they are compared based on operation/offset/register values, and not on CFIIndex).
- they do not count as instructions when tail-duplicating or tail-merging.
These changes ensure that CFI instructions in epilogue do not affect code generation.
- Attached information about cfa offset and cfa register to MachineBasicBlock. Each basic block now has info about cfa offset and register valid at it's entry and exit. When a CFI instruction gets inserted into a basic block, it is checked whether block's outgoing cfa offset and register need to be updated. It is checked whether incoming and outgoing information of the block's successors needs to be updated as well. This information is also updated when blocks are split, merged, duplicated or created.
This information is used by a late pass that inserts CFI instructions in order to set correct cfa offset and register for basic blocks. This needs to be done if blocks get reordered in a way that some have incorrect cfa offset and register set by previous blocks.
A verification pass is added that checks that outgoing cfa offset and register of predecessor blocks match incoming offset and register of their successors.
The current implementation adds CFI instructions in epilogue for X86. However, the changes described above can be used for all platforms.
Some initial thoughts. I would like to hide the actual CFI algorithms from the existing passes as much as possible.
lib/CodeGen/BranchFolding.cpp | ||
---|---|---|
450 | I'd like to see this factored out into MachineBasicBlock. Something like recalculateCFI(bool useExistingIncoming) | |
1012 | This looks like the same code above. Please factor this out. | |
lib/CodeGen/TailDuplicator.cpp | ||
607–608 | Can you create a function on MI: isDirective() and have it return true for debugvalue and CFIInstruction? | |
924 | It would nice to not have this in the pass, but rather as an abstraction: PrevBB->mergeCFIInfo(TailBB); |
This is coming along nicely. I forgot to say last time that I was pleased overall.
I have a few more things, but they're slowly getting smaller.
lib/CodeGen/BranchFolding.cpp | ||
---|---|---|
349 | should this be isDirective? | |
366 | Is this block still necessary if the tests above are changed to isDirective? I'm having trouble making sense of them | |
lib/CodeGen/MachineBasicBlock.cpp | ||
1456 | Why is this offset negative? Can you explain it to me? |
Fixed a bug related to calculating values of incoming and outgoing cfa offset. For def_cfa and def_cfa_offset, negative offset value was used for calculation, and that was not the case for adjust_cfa_offset. Now, the values stored as in/out cfa offsets are the actual offsets set by cfi directives (and not their negative values).
lib/CodeGen/BranchFolding.cpp | ||
---|---|---|
349 | I don't think that this logic can be applied to CFI instructions. It seems that this code aims to set the iterators to include consecutive DBG_VALUE instructions above last common instruction in one block (when those DBG instructions do not exist in the other block). I don't think that that's the desired behaviour for CFI instructions. | |
366 | It is. The tests above make sure that CFI instructions are not compared with isIdenticalTo(), and are not included when calculating TailLen. BB2: In this example, BB1 and BB2 will have 4 common instructions (RET, POP, POP and ADD). When INSTRUCTION_A and INSTRUCTION_B are compared as not equal, after incrementing the iterators (++I1; ++I2;), I1 will point to ADD (the last common instruction, as it should), however I2 will point to the CFI instruction. This will, later on, result in BB2 being 'hacked off' (in ReplaceTailWithBranchTo()) at the wrong place (starting from this CFI instruction, instead of ADD below it), so this CFI instruction will be lost. | |
lib/CodeGen/MachineBasicBlock.cpp | ||
1456 | A short answer would be: because the value of cfa offset that is passed when creating a CFI instruction with createDefCfa() and createDefCfaOffset() is negated. That is the value that will end up as the actual offset in cfi directives. I was calculating in/out cfa offset with this non-negated value (that is passed when creating these instructions) so I had to negate the offset when reading the value with getOffset(). I am not sure why the correct value is not passed in the first place (e.g. in createAdjustCfaOffset(), the passed value is not negated). |
This is looking pretty good. I'm going to go back over, but most of my initial concerns have been satisfied.
lib/CodeGen/BranchFolding.cpp | ||
---|---|---|
366 | Thanks. Can you shrink the example: BB2: and include it in the comments? |
OK. It's looking pretty good overall. I looked a lot closer at the actual CFI Code. Mostly it looks great. I don't think you need AdjustCFAOffset at all. You spend a lot of work maintaining it, but in the one case that it's used, it's just (OutgoingOffset - IncomingOffset). It's only used with blocks that don't contain an offset def.
That was the hardest part for me to understand, and I don't think you need it. If you can remove it, feel free to mark the comments as done by way of code removal.
include/llvm/CodeGen/MachineBasicBlock.h | ||
---|---|---|
770 | Should this always be equal to OutgoingCFAOffset - IncomingCFAOffset? I think I get it. It's the total amount of adjustments that occur in this block, either since the beginning or since the last def_cfa_offset directive in the function. Can you be more explicit in the comments? | |
lib/CodeGen/CFIInfoVerifier.cpp | ||
65 | Can you name this something like Pred? | |
90 | Same here. | |
lib/CodeGen/CFIInstrInserter.cpp | ||
66 | you should clear this value in CorrectCFA. | |
89 | The negative here is confusing. Can you make it so you don't have to remember to flip the sign here? | |
98 | Same here. | |
lib/CodeGen/MachineBasicBlock.cpp | ||
1362 | I think this would be more clear if you used a local variable. | |
1433 | Isn't there a missing case here, where if you encounter both a def_cfa_offset and a def_cfa_register, you can early return, like below? | |
1486 | Should this short-circuit if they already match? | |
1514 | Same here. It looks like you could short-circuit if they're equal. | |
1577 | It feels weird to me that you would scan the block twice. This could be done in the loop above, couldn't it? |
Removed AdjustCFAOffset.
Addressed other comments.
Changed the code in updateCFIInfoSucc() a bit since AdjustCFAOffset is no longer used.
This is looking really good. Thanks.
I still don't get the negation. I'm glad that you pinned it down to one location in the code, but I don't understand why it's there at all. Can you explain it to me?
lib/CodeGen/CFIInstrInserter.cpp | ||
---|---|---|
44 | I'm still not sure I get the negatives. Do the CFI instructions return the negative of what they're created with? That doesn't seem to make any sense. |
Hi,
I see that your patch is mostly about the state of CFA, but I think this applies to other things, like callee-saved registers.
As part of my work on shrink-wrapping, I needed to add support for saving / restoring registers in multiple locations. So, in order to get correct CFI information for the registers, we have to add the corresponding .cfi_offset and .cfi_restore to the save and restore points, which results in the following code:
// BLOCK0 if (i) { // BLOCK1 save reg1 if (a) { // BLOCK2 save reg2 [... use reg1 reg2 ...] restore reg2 restore reg1 ret } else { // BLOCK3 save reg3 [... use only reg1, reg3 ...] restore reg3 restore reg1 ret } } else { // BLOCK4 ret }
So, in this case, it's not correct to assume that the CFI state starts at the entry block, nor to assume that it is the same in the whole function (pretty much the same issue as CFA). For this, I emit a CFI_INSTRUCTION .cfi_offset along with every save, and a CFI_INSTRUCTION .cfi_restore along with every restore. After that, in the AsmPrinter I run another pass to fill the gaps and add more .cfi_offset or .cfi_restore where the fall through state is not correct.
Imagine the order of the basic blocks was:
BLOCK0 BLOCK1 BLOCK4 BLOCK2 BLOCK3
then we need an extra .cfi_restore reg1 at the end of BLOCK1 and an extra .cfi_offset reg1 at the beginning of BLOCK2.
This could currently happen with the current shrink-wrapping implementation, but the block placement is never placing blocks in a way that the prologue comes after the epilogue, so we didn't hit that issue yet.
I guess this issue can be also addressed with the same infrastructure you're building here for CFA. What do you think?
Also, I might be missing something here, but do we really need to maintain the state of CFA in every MBB starting from PEI to (almost) AsmPrinter? Can't we just do one simple pass during (or before) AsmPrinter which collects all the CFI directives and does the analysis based on the CFG and the final block layout (since at that point, we have that information) ? It looks like TailDuplication was the problem, but looks like you fixed that to skip CFI instructions.
Let me know what you think.
lib/CodeGen/CFIInstrInserter.cpp | ||
---|---|---|
44 | The thing is, they are created with the negative of what you pass to the create* method: static MCCFIInstruction createDefCfa(MCSymbol *L, unsigned Register, int Offset) { return MCCFIInstruction(OpDefCfa, L, Register, -Offset, ""); } static MCCFIInstruction createDefCfaOffset(MCSymbol *L, int Offset) { return MCCFIInstruction(OpDefCfaOffset, L, 0, -Offset, ""); } So, if you want to create a '.cfi_def_cfa_offset 16' you would pass -16 to the createDefCfaOffset() method. static MCCFIInstruction createAdjustCfaOffset(MCSymbol *L, int Adjustment) { return MCCFIInstruction(OpAdjustCfaOffset, L, 0, Adjustment, ""); } And here, for in/out cfa offset, I am storing these positive values that end up in cfi directives, and then creating def_cfa or def_cfa_offset instruction, hence the '-MBB.getIncomingCFAOffset()'. |
It would be nice to get rid of that negative completely. Not in this patch, but just remove it completely wherever we create CFI Instructions.
Hi Francis,
I am not sure if I am missing something, but yes, it looks like you could do something similar for cfi_offset and cfi_restore; keep track of registers that should be saved/restored at the beginning and end of a basic block and then insert save/restore CFI instructions for appropriate registers if the basic blocks get reordered in a way that this information is no longer correct.
For the second question, I have started with something similar (inserted CFI instructions in epilogue and added additional CFI instrs that correct the CFA calculation rule both in the late pass), however, that was not preferable, because I was analyzing each BB, checking it for CFI instructions, and analyzing if additional instructions should be inserted, basically, I should have known this information already, and not need to do this analysis in the pass.
Hi Violeta,
Thanks for taking your time to answer.
Yes, exactly. I'll try to merge this with your code from CFIInstrInserter.
Sorry for being so late to the review.
I'm very concerned about adding members to MachineBasicBlock, that makes this information effectively part of our machine representation that needs to be modified appropriately by transformation passes. In that respect it is not natural and I expect maintenance problems long term.
I know this was already discussed but I fail to find the exact reasons why producing all the CFA information as part of the AsmPrinter was dismissed.
Intuitively I would expect it to be very well possible to do some abstract simulation of the stack pointer during emission (possibly with target callbacks helping out to determine which instructions modify the stack/callframe access in which way). And I would take such a solution any day over introducing more "state" into machine functions/blocks. Even if it is more/complexer code than doing it as part of Prolog Epilog emission.
Should this always be equal to OutgoingCFAOffset - IncomingCFAOffset?
I think I get it. It's the total amount of adjustments that occur in this block, either since the beginning or since the last def_cfa_offset directive in the function. Can you be more explicit in the comments?