This is an archive of the discontinued LLVM Phabricator instance.

[LLD][COFF] Deduplicate .pdata entries
Needs ReviewPublic

Authored by aganea on Jun 30 2020, 8:28 AM.

Details

Summary

When targetting x64, before this patch LLD would generate duplicate RUNTIME_FUNCTION entries. They are not exactly duplicates stricto sensu, because the unwind RVA is different. However the two target UNWIND_INFOs would have the same content. An example out of the final EXE before this patch (I used dumpbin /unwindinfo ...):

0049D0DC 0554E9D0 0554E9F7 06ACBCE8   <- RUNTIME_FUNCTION entry (in .pdata)
   Unwind version: 1                  <- pointed UNWIND_INFO entry (in .xdata)
   Unwind flags: EHANDLER UHANDLER
   Size of prologue: 0x0A
   Count of codes: 2
   Unwind codes:
     0A: ALLOC_SMALL, size=0x30
     06: PUSH_NONVOL, register=rbp
   Handler: 05503340 
0049D0E8 0554E9D0 0554E9F7 06ACB0E8   <- note the last value (the unwind RVA) is different
   Unwind version: 1                  <- different UNWIND_INFO entry, but same content as above
   Unwind flags: EHANDLER UHANDLER
   Size of prologue: 0x0A
   Count of codes: 2
   Unwind codes:
     0A: ALLOC_SMALL, size=0x30
     06: PUSH_NONVOL, register=rbp
   Handler: 05503340

Unfourtunately this prevents us from using some external (commercial) post-processing tools, which assert if the EXE contains duplicate .pdata entries. This behavior is probably wrong anyway, how would the NT loader handle duplicate entries for the same function?

After this patch, we thusly remove any .pdata entries that have the same starting address. We make those changes directly into the final output stream, because the input .pdata streams need to merged, reallocated & sorted first. This has two side-effects: 1. the output PE section that receives the .pdata ends-up a bit larger than needed. 2. the .xdata entries are not stripped currently since they are part of .text sections (same chunks). But all in all we're talking about a few kbytes wasted. In our case, out of 300,000 RUNTIME_FUNCTION entries, this patch removes about 700, so that's about 14 kb wasted out of a 96 MB EXE.

If reviewers would think of a better way to implement this, I would gladly oblige.

Fixes PR45950.

Diff Detail

Event Timeline

aganea created this revision.Jun 30 2020, 8:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2020, 8:28 AM
aganea edited the summary of this revision. (Show Details)Jun 30 2020, 8:30 AM
aganea edited the summary of this revision. (Show Details)Jun 30 2020, 8:34 AM

Just curious - do you know why you end up with duplicates? In most cases where there could be duplicates, they should be in a comdat section where only one is picked? And if not, you'd have a duplicate definition of a symbol in the corresponding text section - right?

lld/COFF/Writer.cpp
1869

Would it be possible to adjust the virtual size of the .pdata section as well, to exclude the pruned bits? I could imagine that some runtime introspection tools locate the section directly instead of using the data directory.

Just curious - do you know why you end up with duplicates? In most cases where there could be duplicates, they should be in a comdat section where only one is picked? And if not, you'd have a duplicate definition of a symbol in the corresponding text section - right?

Poking into this again - so in the testcase, you made the .text section a comdat so that doesn't conflict, but not the .pdata/.xdata. As far as I know, when llvm generates .pdata/.xdata for a comdat .text section, the .pdata/.xdata sections also should be comdat (and made associative to the .text section).

So in the case when you're running into this issue, aren't .pdata made into an associative comdat at all? Or does it work in general, but is failing in some corner case? Or is it a case of handwritten assembly, or code generated by some other external tool that doesn't get this aspect entirely right?

aganea marked an inline comment as done.Jul 13 2020, 8:18 AM

@mstorsjo This happens on VS2019 libs. I don't know if this happens on VS2017. I don't have an exact repro, but the test below exhibits the same effect as the problem I was observing. I will dig more into this.

lld/COFF/Writer.cpp
1869

Unfortunately the .pdata is merged into .rdata, there're other things following the merged .pdata in the output, and at this point all the RVAs have been calculated and sections already merged & written.

I've tried pruning earlier, but it's a tail biting snake. We need the entire merged, reallocated & sorted .pdata stream, and the computed RVAs, but we don't have that unless all sections have been written. But once they're written, we can't prune them, unless we would re-write again.

If we did a two-step "writeSection" and perhaps computed everything in advance, before writing, it could maybe be possible to resize the output section, but that would involve some sizeable refactoring. I figured it wasn't worth it, but if anyone has a different view on this, please let me know.

@mstorsjo This happens on VS2019 libs. I don't know if this happens on VS2017. I don't have an exact repro, but the test below exhibits the same effect as the problem I was observing. I will dig more into this.

Right, so in the object file bits that end up linked in statically? Interesting.

What does MSVC link.exe do in that case - does it also dedup them, or keep the duplicates?

lld/COFF/Writer.cpp
1869

Right, I agree it's probably a sensible tradeoff to keep it like this - it's probably not worth rearchitecting things just to be able to prune it better.

To see that I understand the issue correctly: Before we've fixed the layout and set RVAs, the .pdata section is unsorted. At that stage, we should still be able to check which pdata entries actually point to the same symbol (although it might be messy, having to dereference the relocations?) - but without sorting it, a duplicate check would end up rather costly - and without RVAs we can't really sort it. Am I understanding the situation correctly?

Given the (presumed) small extent of the issue, this solution does sound like a good tradeoff.

Also, right, if merging .pdata into another section, there's no need to update section sizes (and in that case, a few noop bytes in the middle of the .rdata are just orphaned bytes anyway).

IIRC with mingw setups, .pdata doesn't normally end up merged though. Is it (within reasonable effort) possible to check if .pdata is left unmerged, and in that case fix the section size? Or is the association from the enclosing section lost earlier at some point? If not, then this is clearly good enough.

aganea marked an inline comment as done.Jul 13 2020, 9:46 AM
aganea added inline comments.
lld/COFF/Writer.cpp
1869

Yes, the description of the situation is correct. I think all this would potentially wouldn't be a problem if /Gy was used for compiling (which is not the case here), then perhaps de-dup would be done correctly. However even at that I'm not sure, it seems the de-dup in ICF.cpp is done by hashing the chunk content, which doesn't work in the case of .pdata. Even if the realloc was applied and RVA computed, the ICF would still need to understand that two different .pdata records (RUNTIME_FUNCTION) are equal if they point to two different .xdata records (UNWIND_INFO) but with the same content. I've also ran into a peculiar case where two .xdata records were almost the same, bearing the fact they were using a different unwind handler (__CxxFrameHandler4 instead of 3) - however the unwind codes were all the same.

link.exe does seems to merge all .pdata records, because we didn't see this issue with the post-link tool I've mentioned (which asserts if two contigous .pdata records point to the same .text function).

As for the output section size, I'll extend the patch with your suggestion!

I've digged a bit more. The issue is that the ICF improperly folds .text$x sections (catch funclets) but not always the "associated" .pdata and .xdata records, in part because those additional sections are attached to .text$mn not to .text$x. This occurs only with MSVC-cl built OBJs, Clang does not create separate .text$x sections for catch functlets (rather, they are merged into the function .text). I'm seeing the bug when linking with the FBX SDK 2016.1 VS2015 (lib\vs2015\x64\release\libfbxsdk-mt.lib).

The COFF table for fbxmesh.obj in the archive above looks like this:

11C 00000000 SECT88 notype       Static       | .text$mn
    Section length   52, #relocs    2, #linenums    0, checksum 270664C2, selection    2 (pick any)
11E 00000000 SECT89 notype       Static       | .text$x            <<<< ----- this is de-duplicated, probably it shouldn't be ----- >>>>
    Section length   20, #relocs    2, #linenums    0, checksum 7C6EA17D, selection    5 (pick associative Section 0x88)
...
817 00000000 SECT160 notype       Static      | .xdata
    Section length   10, #relocs    2, #linenums    0, checksum DFA6EA80, selection    5 (pick associative Section 0x88)
819 00000000 SECT160 notype       Static      | $unwind$?_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ
81A 00000000 SECT161 notype       Static      | .pdata
    Section length    C, #relocs    3, #linenums    0, checksum 1703F4BB, selection    5 (pick associative Section 0x88)
81C 00000000 SECT161 notype       Static      | $pdata$?_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ
81D 00000000 SECT162 notype       Static      | .rdata
    Section length   28, #relocs    3, #linenums    0, checksum  849A3B3, selection    5 (pick associative Section 0x88)
81F 00000000 SECT162 notype       Static      | $cppxdata$?_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ
820 00000000 SECT163 notype       Static      | .xdata
    Section length   10, #relocs    0, #linenums    0, checksum 8999943C, selection    5 (pick associative Section 0x88)
822 00000000 SECT163 notype       Static      | $stateUnwindMap$?_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ
823 00000000 SECT164 notype       Static      | .xdata
    Section length   14, #relocs    1, #linenums    0, checksum 570F4CF1, selection    5 (pick associative Section 0x88)
825 00000000 SECT164 notype       Static      | $tryMap$?_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ
826 00000000 SECT165 notype       Static      | .xdata
    Section length   14, #relocs    1, #linenums    0, checksum   677058, selection    5 (pick associative Section 0x88)
828 00000000 SECT165 notype       Static      | $handlerMap$0$?_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ
829 00000000 SECT166 notype       Static      | .xdata
    Section length   28, #relocs    5, #linenums    0, checksum F9326582, selection    5 (pick associative Section 0x88)
82B 00000000 SECT166 notype       Static      | $ip2state$?_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ
82C 00000000 SECT167 notype       Static      | .xdata
    Section length   10, #relocs    2, #linenums    0, checksum 5840A276, selection    5 (pick associative Section 0x88)
82E 00000000 SECT167 notype       Static      | $unwind$?catch$0@?0??_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ@4HA
82F 00000000 SECT168 notype       Static      | .pdata            <<<< ----- but this is not de-duplicated ----- >>>>
    Section length    C, #relocs    3, #linenums    0, checksum F9766256, selection    5 (pick associative Section 0x88)
831 00000000 SECT168 notype       Static      | $pdata$?catch$0@?0??_Buyheadnode@?$_Tree_comp_alloc@V?$_Tmap_traits@HHU?$less@H@std@@V?$allocator@U?$pair@$$CBHH@std@@@2@$00@std@@@std@@QEAAPEAU?$_Tree_node@U?$pair@$$CBHH@std@@PEAX@2@XZ@4HA

Note how all associative COMDATs are attached to .text$mn, none to .text$x, this seems to be by design.

This gives this non-trivial graph: https://miro.com/app/board/o9J_klyfWoY=/

In essence, we end up with several .pdata entries in the PE pointing to the same .text$x address.

I think there should be a rule in the ICF to prevent .text$x (SECT89) from being folded if its pointee .xdata and .pdata cannot be folded as well, like here. Although the ICF algorithm seems to work based only on forward dependencies.

I can't find a repro, something else could be involved (the counter for the equivalence classes?).
I've tried the following:

// --- foo.h
void bar(int);
void bar(unsigned);
template <typename T> struct A {
  template <typename X> void foo(X a) {
    try {
      bar(a);
    } catch (...) {
      throw;
    }
  }
};
// --- a.cpp
#include "foo.h"
void test(int a) {
  A<int> val;
  val.foo(a);
}
// --- b.cpp
#include "foo.h"
template <> struct A<unsigned> {
  void foo(unsigned a) {
    try {
      bar(a);
      bar(a);
    } catch (...) {
      throw;
    }
  }
};
void alttest(unsigned a) {
  A<unsigned> val;
  val.foo(a);
}

But that doesn't trigger the bug. The issue seems to be occuring on the MS-STL std::_Tree::_Insert_nohint function, although from MS-STL VS2015.

There are many other libraries linked in, compiled with VS2019 or VS2017, so the issue could be from elsewhere.

I'm a bit stuck at the point. Would you go with the current patch, or something different in lld/COFF/ICF.cpp?
Ideas? @majnemer @rnk @ruiu

rnk added a comment.Sep 10 2020, 5:13 PM

This all sounds like this bug: https://bugs.llvm.org/show_bug.cgi?id=35337
Which was supposed to be fixed in rG107f55005bc9c9de2378057f56ae02016795a3ae

As I understand it, we have several catch blocks that are functionally identical (same text), but they have different xdata (EH states), and we probably shouldn't fold them together.

Note how all associative COMDATs are attached to .text$mn, none to .text$x, this seems to be by design.

I see, so that prevents our fix from doing what it's supposed to. However, I notice that all the .text$x sections are associated with the .text$mn section. Is ICF running on the .text$x sections even though they are associated with something else? It probably shouldn't be if it is. That could be the fix.


By the way, it would be a great improvement to LLVM to emit funclets in a separate .text$x section. It's a natural way to implement hot/cold code partitioning, it separates code that only runs during EH from non-exceptional code. This is actually one of the few advantages of funclets over landingpads: it's really easy to implement this partitioning.