This is an archive of the discontinued LLVM Phabricator instance.

[Metadata] Use temporary MD nodes when appending module flags during module linking
Needs ReviewPublic

Authored by wolfgangp on Jan 27 2022, 4:12 PM.

Details

Reviewers
dexonsmith
Summary

This is a proposal to fix issue 51893.

As described there, we're seeing excessive memory usage at link time when using LTO + IPGO, and we traced it back to the appending of module flags during module linking.

This patch suggests to use temporary MD nodes for the list nodes that are newly created when module flags with SrcBehavior "append" are handled. Since we have to turn them into permanent nodes at some point, we keep track of the temp nodes in the module, and make them permanent when we know we're done with the linking.

If anyone has any better suggestions, please let me know.

@dexonsmith Apologies if you are not the right reviewer. The code has been implemented by Rafael originally, but IIRC he's no longer working on llvm.

Diff Detail

Event Timeline

wolfgangp created this revision.Jan 27 2022, 4:12 PM
wolfgangp requested review of this revision.Jan 27 2022, 4:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2022, 4:12 PM

Seems like a good problem solve. IIUC, the root problem is that there are O(N) "versions" of the module flags being created, something like:

  • Flags1 = !{!1}
  • Flags2 = !{!1, !2}
  • Flags3 = !{!1, !2, !3}
  • ...
  • FlagsN = !{!1, !2, !3, ... !N}

IIUC, the status quo (without this patch) is that all these iterations of Flags are "leaked" onto the LLVMContext.

  • With this patch, each iteration is instead made a temporary node so it can be deleted after. That seems reasonable (although the delete operation seems a bit error-prone).
  • This patch does not fix the root problem, which is that "append" is O(N) instead of amortized O(1). As a result, O(N^2) work is still being done here.

I think it'd be better to fix the workload to use a "vector" data structure.

One approach would be to (carefully) add a capacity and resize operation to MDNode operands:

  • Change MDNode to allow hung-off operands; e.g., steal a bit from NumOperands to indicate whether hung-off, and store the pointer to operands and a capacity where co-allocated operands usually go.
  • Add a function MDNode::resize() that only appropriate subclasses enable (only some MDNode subclasses would want to enable this feature). Asserts if the subclass doesn't allow it, and asserts if the node is Uniqued (i.e., only legal for Temporary or Distinct, which have a real "identity" and are not stored in uniquing sets).
  • Ensure that an empty temporary or distinct node can be resized later by co-allocating enough space; as a side effect, even an empty node would start with some small storage, but that's probably okay. (An empty uniqued node wouldn't need anything co-allocated.)
  • (The capacity of a distinct node would not be serialized in textual IR or bitcode...)
  • (Probably there are other implementation strategies.)

With MDNode::resize() in place:

  • Change module flags in "append" or "append-unique" mode to use a distinct metadata node to store the list of nodes. The flag tuple that references it should be distinct as well. (distinct means "don't store in the uniquing table"; appropriate here since the content can change.)
  • Ideally, add an IR verifier check that these nodes are distinct. Fix textual IR testcases to use distinct in these places and add a bitcode upgrade to apply distinct retroactively.
  • During module linking, call resize() (and/or a derived push_back()) to extend the existing metadata node in-place rather than creating a new one. (If the "ideal" verifier/upgrade step above is skipped, then the append operation would need to check "are the right nodes distinct?" and if not create new distinct ones that can be modified in place.)

A second approach would be to add first-class support for module flags. E.g., could create an MDModuleFlag metadata type to use instead of the 3-part tuple. This node type could support the necessary operations efficiently.

A third approach would be to not fix the IR at all, but instead would be to change llvm::Module to have a "lazy module flags" mode. When enabled, !llvm.module.flags is NOT kept up-to-date with changes. Append and AppendUnique flags that have been modified are stored in staging data structures (maybe, SmallVector and SmallSetVector, respectively). LTO could put the module into this "lazy" mode before linking, and put it back into "normal" mode after it was finished linking to commit the changes.

IMO, the first approach I mention above would be best (possibly with the second approach as a follow up!) since the resize() operation would be generally useful; there are lots of metadata nodes that are really just lists of arbitrary size, and that get extended later on (e.g., @dblaikie @aprantl, I think there are places in debug info IR where new nodes are allocated for this sort of thing, is that right?).

ychen added a subscriber: ychen.Jan 31 2022, 1:06 PM

IMO, the first approach I mention above would be best (possibly with the second approach as a follow up!) since the resize() operation would be generally useful; there are lots of metadata nodes that are really just lists of arbitrary size, and that get extended later on (e.g., @dblaikie @aprantl, I think there are places in debug info IR where new nodes are allocated for this sort of thing, is that right?).

I'm actually not sure how many debug info nodes there are that get incrementally appended to... - the CU list itself, yes, but otherwise mostly the lists are made by frontends and made the right size the first time, I think? yeah, DIBuilder keeping a SmallVector of retained type nodes, then creating the necessary vector to create an MDNode from that in one shot, without incrementally appending.

But I could be wrong - just my rough estimate/recollection.

IMO, the first approach I mention above would be best (possibly with the second approach as a follow up!) since the resize() operation would be generally useful; there are lots of metadata nodes that are really just lists of arbitrary size, and that get extended later on (e.g., @dblaikie @aprantl, I think there are places in debug info IR where new nodes are allocated for this sort of thing, is that right?).

I'm actually not sure how many debug info nodes there are that get incrementally appended to... - the CU list itself, yes, but otherwise mostly the lists are made by frontends and made the right size the first time, I think? yeah, DIBuilder keeping a SmallVector of retained type nodes, then creating the necessary vector to create an MDNode from that in one shot, without incrementally appending.

But I could be wrong - just my rough estimate/recollection.

Yeah, I think it was the CU list (hits this kind of problem in LTO, probably has some custom code now) and the retained types (delays creation in IRBuilder to work around this) that I was thinking of.

Since distinct and temporary metadata are non-const anyway, it seems better to me to allow the size to change dynamically, and teaching the clients to take advantage of that rather than having extra side data, or temporarily invalid state (note that the verifier fails if it finds a temporary node).

IMO, the first approach I mention above would be best (possibly with the second approach as a follow up!) since the resize() operation would be generally useful; there are lots of metadata nodes that are really just lists of arbitrary size, and that get extended later on (e.g., @dblaikie @aprantl, I think there are places in debug info IR where new nodes are allocated for this sort of thing, is that right?).

I'm actually not sure how many debug info nodes there are that get incrementally appended to... - the CU list itself, yes, but otherwise mostly the lists are made by frontends and made the right size the first time, I think? yeah, DIBuilder keeping a SmallVector of retained type nodes, then creating the necessary vector to create an MDNode from that in one shot, without incrementally appending.

But I could be wrong - just my rough estimate/recollection.

Yeah, I think it was the CU list (hits this kind of problem in LTO, probably has some custom code now) and the retained types (delays creation in IRBuilder to work around this) that I was thinking of.

Since distinct and temporary metadata are non-const anyway, it seems better to me to allow the size to change dynamically, and teaching the clients to take advantage of that rather than having extra side data, or temporarily invalid state (note that the verifier fails if it finds a temporary node).

Sounds OK to me, if it makes sense to you!

Thanks for looking into this.

It seems best to abandon the proposed change and introduce a resize() capability for MDNodes.

Not being all that intimately familiar with MD, I might have some silly questions later on.

Thanks for looking into this.

It seems best to abandon the proposed change and introduce a resize() capability for MDNodes.

Not being all that intimately familiar with MD, I might have some silly questions later on.

(A bit behind on reviews); sounds great! I think the main gotcha is that an MDNode in a "uniqued" state shouldn't be resizable since it's constant-like.

I'm finally getting around to take a stab at this. One thing in particular is giving me trouble:

  • Ensure that an empty temporary or distinct node can be resized later by co-allocating enough space; as a side effect, even an empty node would start with some small storage, but that's probably okay. (An empty uniqued node wouldn't need anything co-allocated.)

I'm largely following your implementation strategy. On first allocation I'm creating all nodes they way they are currently, i.e. with co-allocated operands. For temporary and distinct nodes I'm allocating extra space for a potentially needed hung-off-storage descriptor, if the number of operands is small (i.e. the space allocated for them is not enough for what's needed for the descriptor). For uniqued nodes I don't allocate anything extra.

The resize operation is fairly straightforward to implement, but the trouble I'm having is at deallocation. Temporary MDnodes can be made unique, and later at deallocation time we can't distinguish between uniqued nodes that have originally been allocated as temporary nodes and the ones that were allocated as unique from the start. A possible solution would be to steal another bit from NumOperands (reducing it to 30 bits) and capture the original allocation state there, but I'm still looking for alternatives.

Cloning these small temporary nodes (without the extra size) at uniquing time would be another option, though it seems like the infrastructure is not available for this operation. I assume we don't want to saddle all MDnodes (even the uniqued ones) with the extra space.

At the moment I'm trying the additional bit approach, which seems to be working, but I wouldn't mind hearing your thoughts on this.

Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2022, 11:34 AM

I'm finally getting around to take a stab at this. One thing in particular is giving me trouble:

  • Ensure that an empty temporary or distinct node can be resized later by co-allocating enough space; as a side effect, even an empty node would start with some small storage, but that's probably okay. (An empty uniqued node wouldn't need anything co-allocated.)

I'm largely following your implementation strategy. On first allocation I'm creating all nodes they way they are currently, i.e. with co-allocated operands. For temporary and distinct nodes I'm allocating extra space for a potentially needed hung-off-storage descriptor, if the number of operands is small (i.e. the space allocated for them is not enough for what's needed for the descriptor). For uniqued nodes I don't allocate anything extra.

The resize operation is fairly straightforward to implement, but the trouble I'm having is at deallocation. Temporary MDnodes can be made unique, and later at deallocation time we can't distinguish between uniqued nodes that have originally been allocated as temporary nodes and the ones that were allocated as unique from the start. A possible solution would be to steal another bit from NumOperands (reducing it to 30 bits) and capture the original allocation state there, but I'm still looking for alternatives.

Cloning these small temporary nodes (without the extra size) at uniquing time would be another option, though it seems like the infrastructure is not available for this operation. I assume we don't want to saddle all MDnodes (even the uniqued ones) with the extra space.

At the moment I'm trying the additional bit approach, which seems to be working, but I wouldn't mind hearing your thoughts on this.

Stealing a bit down to NumOperands=30 seems pretty reasonable. I think an MDNode with more than a billion operands probably doesn't need to be supported.

I think it's okay to saddle anything that came through the "temporary" or "distinct" path with one extra word; not that bitcode layout of metadata is designed to avoid metadata temporaries in most cases for uniqued nodes.

But I'm not sure you need it. I think you can do something like this pseudo-code for the co-allocation:

struct HungOffType {
  uint32_t Capacity;
  uint32_t NumOperands;
  MDOperand Ops[Capacity];
};
union CoallocationType {
  MDOperand Small[SmallCapacity];
  std::unique_ptr<HungOffType> Large;
};

You can cut up NumOperands without loss of generality once you have that setup:

uint32_t IsSmall : 1;          // Whether it's small.
uint32_t SmallCapacity : 8;    // Co-allocate up to 256 operands.
uint32_t SmallNumOperands : 8; // How much small storage is used?

size_t getNumOperands() const {
  return IsSmall ? SmallNumOperands : getLargeOps().NumOperands;
}

This limits the co-allocation to 256 operands, but if the initial number of operands is too big for that it can hang them off (maybe we want to limit small size further; a histogram might tell us). It also buries getNumOperands() behind a an extra pointer, but anyone checking the size is probably iterating through anyway so it's a pointer they were already going to traverse.

This is a breakdown of MDNodes by number of operands using a bootstrap and a couple of other internal builds (-g -O2). The second column gives the percentage of MDnodes with the number of operands in the first column.
99.9 percent of all nodes have 15 operands or fewer. Almost half have 2 operands.

OperandsPctCumulative
03.73.7
120.223.8
247.070.8
32.873.6
45.679.2
54.283.5
68.992.3
70.192.4
85.297.6
9-152.399.9
16-310.1100.0
32-640.0100.0
> 640.0100.0

Great; seems like maybe we'd be okay to max out at 16 operands in "small" mode.