This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Implement basic export trie
ClosedPublic

Authored by int3 on Mar 28 2020, 1:05 AM.

Details

Summary

Build the trie by performing a three-way radix quicksort: We start by
sorting the strings by their first characters, then sort the strings
with the same first characters by their second characters, and so on
recursively. Each time the prefixes diverge, we add a node to the trie.
Thanks to @ruiu for the idea.

I used llvm-mc's radix quicksort implementation as a starting point. The
trie offset fixpoint code was taken from
MachONormalizedFileBinaryWriter.cpp.

Depends on D76908.

Diff Detail

Event Timeline

int3 created this revision.Mar 28 2020, 1:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2020, 1:05 AM
Harbormaster failed remote builds in B50786: Diff 253311!
Harbormaster failed remote builds in B50785: Diff 253310!
int3 edited the summary of this revision. (Show Details)Mar 28 2020, 9:43 AM
int3 marked 6 inline comments as done.Mar 28 2020, 10:12 AM
int3 added inline comments.
lld/MachO/ExportTrie.cpp
41–43

may want to consider making this a BumpPtrList

91

The original implementation used a unique BumpPtrAllocator instance for the export trie construction, which gets deallocated once trie serialization is complete. Here I am using the global bAlloc that doesn't deallocate till the end of the program. Not if it's worth using a local allocator instance to reduce memory consumption

193–196

This is basically a topological sort. It's unclear to me though why this approach (of finding every element starting from the root) was chosen vs doing a single preorder traversal of the trie...

int3 updated this revision to Diff 253762.Mar 30 2020, 6:29 PM

update test to use symtab addresses

ruiu added inline comments.Mar 31 2020, 1:52 AM
lld/MachO/ExportTrie.cpp
11

I'd add a file comment to explain the structure of the export table, that is, the symbol table is a trie instead of the usual runs of null-terminated strings. So it can be prefix-compressed and more compact, though it's more complicated.

48

I wonder if you can use std::vector.

ruiu added inline comments.Mar 31 2020, 1:52 AM
lld/MachO/ExportTrie.cpp
52

Do you need this destructor?

66

Initialize with = 0 here.

75

addSymbol seem to keep the internal trie consistent all the time, but I don't think we need to split the work between addSymbol and build. We can make addSymbol just to store an argument to an array, and move the code for trie construction to build.

lld/MachO/ExportTrie.h
23

I don't think we need to be efficient here. I'd use std::vector<> instead.

int3 marked 4 inline comments as done.Mar 31 2020, 7:07 PM
int3 added inline comments.
lld/MachO/ExportTrie.cpp
48

yeah that's an option. Probably not worth testing until we have more functionality implemented and can profile against real inputs though

lld/MachO/ExportTrie.h
23

There doesn't seem to be a way to wrap an std::vector in a raw_ostream though, which I need for the ULEB encoding functions. Would you prefer I use an std::string and a raw_string_ostream instead?

int3 marked an inline comment as done.Mar 31 2020, 7:26 PM
int3 added inline comments.
lld/MachO/ExportTrie.cpp
75

Not sure I understand what you're going for. Do you mean to have TrieBuilder::addSymbol collect the symbols to export in a vector, and then have build() call TrieNode::addSymbol() for all the symbols right before serializing the trie?

int3 updated this revision to Diff 254076.Mar 31 2020, 7:39 PM

address some comments

ruiu added inline comments.Mar 31 2020, 8:03 PM
lld/MachO/ExportTrie.cpp
75

Yes, that's what I meant.

ruiu added inline comments.Mar 31 2020, 9:13 PM
lld/MachO/ExportTrie.cpp
48

super nit: substring is a single word

75

What I wanted to say is that it looks like the trie implemented in this file is designed with the generic Map use cases in mind. You can add a new string one at a time to an existing trie, and you can also look it up as if it were a map (it is actually a map). In order to that, we create a trie in memory. But I think we can avoid that. We can just accumulate added strings to a vector, and in build() we sort the strings and directly emit a serialized trie from the sorted strings.

lld/MachO/ExportTrie.h
23

Yeah, I guess so. Symbol tables are not that small, and there's only one symbol table for each output, so we don't need to optimize it by using an LLVM-specific class. std::string would just work fine.

smeenai added inline comments.Apr 8 2020, 9:41 PM
lld/MachO/ExportTrie.cpp
75

+1 – we should only need the trie for output purposes here.

193–196

Not having looked at this in detail yet, would they give comparable results?

ruiu added inline comments.Apr 8 2020, 11:15 PM
lld/MachO/ExportTrie.cpp
75

It is probably better to rewrite the code instead of borrowing it from the existing code. I believe it shouldn't take too much time, and it might even be a bit of fun to write from scratch.

int3 planned changes to this revision.Apr 9 2020, 4:04 PM
int3 marked 2 inline comments as done.
int3 added inline comments.
lld/MachO/ExportTrie.cpp
75

We can just accumulate added strings to a vector, and in build() we sort the strings and directly emit a serialized trie from the sorted strings.

Ah, I see what you mean. That would definitely be more memory/allocation-efficient. I guess we should do a radix sort instead of a simple std::sort so we don't compare prefixes needlessly.

193–196

I think the only difference would be the order (and therefore the offsets) of the serialized nodes. Might make things more or less compact, it's unclear to me...

int3 marked an inline comment as done.Apr 9 2020, 8:02 PM
int3 added inline comments.
lld/MachO/ExportTrie.cpp
75

oh I see llvm-mc has a radix sort implementation in StringTableBuilder.cpp. Might steal that

int3 updated this revision to Diff 256907.Apr 12 2020, 6:48 PM
int3 edited the summary of this revision. (Show Details)

build trie via sorting

int3 edited the summary of this revision. (Show Details)Apr 12 2020, 6:48 PM
int3 added inline comments.Apr 21 2020, 11:40 PM
lld/MachO/ExportTrie.cpp
193–196

Aha, I see that ld64 has an -exported_symbols_order flag, so the export trie "can be optimized to make lookups of popular symbols faster". The implementation also secondarily sorts the trie by address. I guess that means dyld loads this trie lazily. I'll add a TODO to support this...

I think we can still implement it more efficiently than looking up every element from the root: While building the trie, we should keep track of the lowest user-ordered leaf that it contains. Then we can sort the nodes based on that.

int3 marked an inline comment as done.Apr 22 2020, 5:03 AM
int3 added inline comments.
lld/MachO/ExportTrie.cpp
75

Btw, the current implementation still builds a trie explicitly in memory -- I think we need that in order to iteratively calculate the uleb offsets. But now the trie is built via sorting rather than repeated insertion. Compared to repeatedly inserting symbols, this approach is better because we don't have to repeatedly update / split nodes.

Let me know if this was roughly what you had in mind, or if I'm missing some way to avoid having the trie in memory altogether...

I haven't though about this too hard, but if -exported_symbols_order wasn't a thing, we wouldn't have to do the loop to convergence, right? We could just do a post-order traversal, so that the the offset of each child of a node is fixed before you reach that node. (The wrinkle is that the root node needs to be at the start of the trie, since all the offsets are positive, but you could just traverse everything except the root node and calculate their offsets, then calculate the size of the root node, then add that size to all the other offsets.)

I haven't thought very hard about how -exported_symbols_order will work in this scheme either, but your idea for that makes sense at a high level. It doesn't seem ld64 handles this super optimally either. For example, if you have symbols _a and _ab, and you want _ab to be ordered in the trie before _a (or just only have _ab in the exported symbols order file), ld64 produces:

<root>
  |
  _a
  |
  |-----
  |    |
  *b  *''

(terminal nodes prefixed with a *)

whereas without any order file, you would have gotten

<root>
  |
 *_a
  |
  *b

Both tries take the same number of node traversals to get to _ab, but the first one needlessly pessimizes _a in the process.

I'd be curious about exactly how dyld processes the export trie, to understand how much of a difference -exported_symbols_order might make.

Assuming that my post-order traversal scheme actually works, and assuming that we don't plan to support -exported_symbols_order any time soon, would it make sense to just use that scheme for now and figure out -exported_symbols_order when we get there? I suspect we'll run into a bunch more considerations when we actually get around to implementing that, so I don't think it's super important to be coding with that in mind right now. (As in, the post-order definitely won't work when we have to respect -exported_symbols_order, but we can think about that later.)

lld/MachO/ExportTrie.cpp
16

From what I understand of the trie structure, the strings for children are actually stored in the parent. This is interesting by itself, and it also implies that the root node is always the empty string. I think it's worth making a note of that and reflecting that in the diagram as well.

56

Nit: can this be part of the anonymous namespace above?

57

Should add a TODO about adding proper support for the reexport and stub-and-resolver flags.

72

I'd add a comment about what the return value signifies.

73

This is the 1 byte for the TerminalSize member, right? Might be clearer to explain that in the comment.

77

uleb128 of length of symbol info?

91

I think LLD generally just allocates everything using the global allocator and deallocates it in one fell swoop at the end (or just lets the OS clean it up), so this should be fine. I'd be curious if anything in LLD COFF or ELF uses local allocators though.

100

Why this assert? The nodeSize (or TerminalSize as macho2yaml terms it) is a ULEB128, right?

108

You'd just have to split up the node in that case, right?

138

Super nit: have the comments in the order of the parameters (or vice versa).

153

Would it be preferable to use either the median or a random pivot? (I'm assuming the standard quicksort considerations for pivots apply, namely that always using the first element would trigger the quadratic worst case for an already sorted input.)

185

Did you mean sortAndBuild?

Is there an easy way to either express this as a loop or trust that the compiler will do the tail call optimization? I don't mind the goto very much, but I'm wondering how it compares to the alternatives.

193–196

Yeah, -exported_symbols_order flag is a wrinkle :/ What you said makes sense for when we have to support that though.

int3 marked 5 inline comments as done.Apr 28 2020, 5:52 PM

Post-order wouldn't work because the offsets are from the start of the export trie section, not from the parent node... not sure why it was designed that way :/

lld/MachO/ExportTrie.cpp
16

good catch, will fix

100

nope, it's a byte. Only the offsets are ulebs

108

yeah

153

yeah they do apply. I'll pick the median

185

Oops, this was copypasta from llvm-mc's original implementation (they called it multikeySort there)

The TCO was also from that implementation, but I think it's a "standard" quicksort optimization too, at least for the normal two-way quicksort. Wikipedia says that it's "suggested by Sedgwick and widely used in practice": https://en.wikipedia.org/wiki/Quicksort#Optimizations

int3 added a comment.Apr 28 2020, 5:58 PM

Re -exported_symbols_order, I think the scheme I have in mind would only change how the nodes are ordered, but not which ones are created, so we will hopefully not run into the weird ld64 case you described.

int3 marked an inline comment as done.Apr 28 2020, 6:11 PM
int3 added inline comments.
lld/MachO/ExportTrie.cpp
91

I made that comment on the original build-trie-via-insertion diff, which did a bunch of unnecessary string copying. The current diff just uses StringRefs that point to the character buffers of the symbols themselves and assumes they are immutable while the trie is being built and serialized, so this is no longer an issue.

int3 marked 5 inline comments as done.Apr 28 2020, 6:18 PM
int3 added inline comments.
lld/MachO/ExportTrie.cpp
100

nvm, I'm wrong, oops. It's a uleb indeed

Post-order wouldn't work because the offsets are from the start of the export trie section, not from the parent node... not sure why it was designed that way :/

Ah, right ... I was initially thinking you could just compute the size of the root node once you reach it and then adjust all the other offsets accordingly, but of course adjusting the offset might change the length of its ULEB encoding, and then you're back in the same situation.

int3 marked 9 inline comments as done.Apr 28 2020, 8:39 PM
int3 added inline comments.
lld/MachO/ExportTrie.cpp
16

I feel like the root node corresponding to the entry string is pretty standard for tries though. Also I think it makes more sense to say the strings are stored on the edges rather than the parent. I'll update the comment + diagram to reflect that; hopefully it's clearer from the new diagram that the root node corresponds to the empty string

100

I realized that the usage of nodeSize is inconsistent and confusing. In updateOffset() it includes the terminal size and the edges, but in writeTo() it doesn't. I'll rename nodeSize to terminalSize in writeTo().

185

It seems like there's no way to force the compiler to do TCO. There are only attributes to tell clang *not* to do TCO.

int3 updated this revision to Diff 260831.Apr 28 2020, 8:39 PM

address comments

smeenai accepted this revision.Apr 28 2020, 9:27 PM

LGTM.

lld/MachO/ExportTrie.cpp
16

Looks good!

165

Nit: use vec.empty()

185

Yeah. LLVM has a musttail attribute, but I don't think that's exposed to clang, and IIRC even that doesn't guarantee a tail call. Ideally the optimizer would be able to figure this out on its own, but there's no guarantee of that.

Any opinions on the current goto structure vs. wrapping the whole thing in a while (!vec.empty()) and adding an explicit return in the isTerminal case? I think that'd be equivalent, though I guess the goto makes the tail recursive call more apparent.

This revision is now accepted and ready to land.Apr 28 2020, 9:27 PM
MaskRay added inline comments.Apr 28 2020, 10:39 PM
lld/MachO/ExportTrie.cpp
139

Nit: auto *

lld/test/MachO/symtab.s
6 ↗(On Diff #260831)

I wish we can have a conciser llvm-readobj output... Due to llvm-readobj's verbosity, I prefer llvm-readelf for ELF objects...

int3 updated this revision to Diff 260841.Apr 28 2020, 10:56 PM
int3 marked 4 inline comments as done.

address comments

lld/MachO/ExportTrie.cpp
185

Yeah I think while() here would make for more confusing code for the reason you mentioned (plus add more indentation :P)

lld/test/MachO/symtab.s
6 ↗(On Diff #260831)

Hm, llvm-objdump --syms is terser, though I'm not sure it prints all the symbol info (in particular the flags). For the other tests where I just want the addresses of the symbols, I use objdump; but since this test is specifically for the symtab, I think it makes sense to get the most detailed representation of the symbols.

This revision was automatically updated to reflect the committed changes.
smeenai added inline comments.Oct 13 2021, 9:12 PM
lld/MachO/ExportTrie.cpp
185

I randomly came across this thread again. https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html is a fun related read :)

Herald added a project: Restricted Project. · View Herald TranscriptOct 13 2021, 9:12 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript