This is an archive of the discontinued LLVM Phabricator instance.

[IR] Lazily number instructions for local dominance queries
ClosedPublic

Authored by rnk on Sep 4 2018, 4:45 PM.

Details

Summary

Essentially, fold OrderedBasicBlock into BasicBlock, and make it
auto-invalidate the instruction ordering when new instructions are
added. Notably, we don't need to invalidate it when removing
instructions, which is helpful when a pass mostly delete dead
instructions rather than transforming them.

The downside is that Instruction grows from 56 bytes to 64 bytes. The
resulting LLVM code is substantially simpler and automatically handles
invalidation, which makes me think that this is the right speed and size
tradeoff. There's more low-hanging fruit in MemorySSA and DSE, which
maintain their own instruction orderings today.

The important change is in SymbolTableTraitsImpl.h, where the numbering
is invalidated. Everything else should be straightforward.

We probably want to implement a fancier re-numbering scheme so that
local updates don't invalidate the ordering, but I plan for that to be
future work, maybe for someone else.

Event Timeline

rnk created this revision.Sep 4 2018, 4:45 PM
rnk added inline comments.Sep 4 2018, 5:05 PM
llvm/include/llvm/IR/SymbolTableListTraits.h
61–64

Oops, I'll revert this.

rnk updated this revision to Diff 163955.Sep 4 2018, 5:17 PM
  • remove stale changes

I have zero context on the various ways we can/do add instructions to a BB, so I can't immediately be helpful with the tricky part of this review. :) I've no complaints about the general direction of this patch, though.

When the code's more polished, would it be reasonable to also stick a verifier of this somewhere (that we maybe just run per-query or per-pass on EXPENSIVE_CHECKS builds, or something)?

rnk added a comment.Sep 5 2018, 9:45 AM

When the code's more polished, would it be reasonable to also stick a verifier of this somewhere (that we maybe just run per-query or per-pass on EXPENSIVE_CHECKS builds, or something)?

We could do it if expensive checks are enabled, but I think checks in NDEBUG that don't affect the asymptotic performance are the most useful. Maybe before inserting an instruction, if the numbering was previously considered valid, we can assert that it actually was? This way, we do linear work the first time we create the ordering, and then do linear work when we invalidate it. We could also check when destroying the BB, since that also does linear work.

rnk updated this revision to Diff 164089.Sep 5 2018, 11:32 AM
  • add instruction order validation
  • invalidate order when splicing instructions
  • call transferNodesFromList for same-list transfers
dexonsmith added a reviewer: ahatanak.EditedSep 5 2018, 5:42 PM
dexonsmith added subscribers: ahatanak, bogner.

This is awesome to see!

@ahatanak, you had a prototype of something similar a couple of years ago. Can you take a look?

The downside is that Instruction grows from 56 bytes to 64 bytes,

What's the growth for 32-bit pointers?

and I
don't have a good way to measure what that costs in practice. As the one
who removed the vtable from Value, I will say that this is how I would
like to spend the 8 bytes that I saved a year ago in r303362.

The two memory-sensitive cases I'm aware of are embedded compilers and LTO (with debug info).

It would be good to get someone to comment on the former.

On the latter: we still support users of -flto=full, so it would be interesting to know how big a regression in peak memory this would be. In the past, I've gotten numbers by running a build with -flto, using -save-temps when invoking the ld64, and then running llc on the optimized (but not CodeGen'ed) bitcode file (since the peak is in CodeGen). If you're really curious, that still seems like a decent way to look at the impact. But I doubt this will even register a difference; Instructions are usually malloc-allocated, and malloc is often 16-byte aligned.

dmgreen added a subscriber: dmgreen.Sep 8 2018, 6:43 AM
rnk added a comment.Sep 12 2018, 2:04 PM

Any opinions on this? I'm eager to get it in so I can release it and get some faster builds, but I know it's a core data structure change.

FWIW, it seems like OrderedBB invalidation is causing bugs at least in LoopSafetyInfo (D50377) which @mkazantsev is currently working on fixing. There might be other places that get this wrong too, so having automatic invalidation seems like another good plus on top of the speedups.

This looks good to me, but like said, I don't have sufficient context to stamp it with great confidence.

Thanks again!

llvm/include/llvm/IR/BasicBlock.h
433

nit: I'd expect a function named this to assert(). Can we rename it to something that sounds more boolean-y like isInstrOrderingValid() (?) and/or add LLVM_NODISCARD?

433

nit * 2: I don't know what precedent is set elsewhere, but personally, I feel that if the intent is for this to only be used in NDEBUG builds, we should wrap this function decl in #ifndef NDEBUG.

Mostly because I'd prefer the compiler tells me to reconsider my life choices over silently getting potentially-incorrect results. If it turns out that there's a case where the latter is preferable, it's a ~4 line diff. :)

I've run a big corpus of fuzz tests on this patch and it passed OK. So the patch seems good to me in terms of stability. Unfortunately I don't have time to give a proper code review on that. :(

It also makes my work on ICF tracking much easier as @fhahn has mentioned. So I'd also be happy to see it checked in. :)

rnk added a comment.Sep 14 2018, 2:42 PM

@gbiv suggested that I test this on a large codebase, so I went ahead and built the Chrome unit_tests target with this, and the validation checks passed.

I'm going to address his comments, and otherwise I think we should go forward with this.

vsk added a subscriber: vsk.Sep 15 2018, 12:48 AM

I don't see any fundamental flaws in the algorithm, it looks pretty robust. I have some nit comments, otherwise it LGTM. (Note that I'm maybe not the most qualified person to approve changes in such fundamental components as BasicBlock and Instruction, but this change seems profitable).

llvm/unittests/IR/BasicBlockTest.cpp
152

EXPECT_TRUE/FALSE(BB->isInstrOrderValid()) before and after that to make sure that it works at all?

mkazantsev added inline comments.Sep 16 2018, 8:31 PM
llvm/include/llvm/IR/BasicBlock.h
418

I would rather use some named constant instead of 1; it is widespread use across the code and may be confusing for a reader. Just a suggestion.

433

Agreed here.

llvm/lib/IR/Instruction.cpp
102

Maybe also makes sense to assert that Parent is not nullptr (i.e. instructions not detached).

llvm/unittests/IR/BasicBlockTest.cpp
186

Do you mind adding the similar check for Instruction->removeFromParent and Instruction->eraseFromParent?

rnk updated this revision to Diff 166056.Sep 18 2018, 6:17 PM
rnk marked 2 inline comments as done.
  • Use private bitfield for subclass data
  • Change cached order invalidation
  • Add tests for removal and erasure
rnk added inline comments.Sep 18 2018, 6:22 PM
llvm/include/llvm/IR/BasicBlock.h
418

I rewrote this to use a bitfield. I think it's easier to understand now.

433

So, isInstrOrderingValid would conflict with isInstrOrderValid, which gives you the cached answer.

I'll just make it void and assert internally in !NDEBUG builds.

llvm/unittests/IR/BasicBlockTest.cpp
186

Done, but they don't invalidate ordering, so I check for that instead.

rnk updated this revision to Diff 166059.Sep 18 2018, 6:23 PM
  • Fix one last use of setValueSubclassData
kuhar added a subscriber: kuhar.Sep 20 2018, 7:34 AM
kuhar added inline comments.Sep 21 2018, 1:08 PM
llvm/include/llvm/IR/BasicBlock.h
455

Does it make sense to disable this when EXPENSIVE_CHECKS are set?

llvm/lib/Analysis/OrderedInstructions.cpp
32

Is the separate function localDominates still needed? Seems like the body is trivial and could be inlined here?

llvm/lib/IR/BasicBlock.cpp
496

Is it possible to use noncontiguous indices? If the indices are spread apart, you should be able to perform most insertions without renumbering instructions.

llvm/lib/IR/SymbolTableListTraitsImpl.h
96

Isn't it enough to invalidate only the indices of instructions that follow the first inserted one?

rnk updated this revision to Diff 166712.Sep 24 2018, 10:08 AM
rnk marked an inline comment as done.
  • remove localDominates
rnk added a comment.Sep 24 2018, 10:08 AM

Any other thoughts on this?

llvm/include/llvm/IR/BasicBlock.h
455

I would assume EXPENSIVE_CHECKS implies !NDEBUG, and these checks are on when assertions enabled.

llvm/lib/Analysis/OrderedInstructions.cpp
32

Sure, fixed.

llvm/lib/IR/BasicBlock.cpp
496

I want to put that out of scope of the initial change. We can do all kinds of fancy tricks here to avoid invalidating the ordering, but it's hard to provide meaningfully better algorithmic guarantees. And, the more complex code will require more complex testing, and it might have bugs. I'd rather come back and implement a more complex algorithm once profiling shows that there is a bottleneck, especially since it's often easier to remove these bottlenecks by delaying insertion.

llvm/lib/IR/SymbolTableListTraitsImpl.h
96

Yes, but recording that info and leveraging it is complex, and it doesn't change the asymptotic performance. We'd need more than a bit in BasicBlock to do it.

kuhar added inline comments.Sep 24 2018, 10:20 AM
llvm/include/llvm/IR/BasicBlock.h
455

I think you can use expensive checks independently of build type

llvm/lib/IR/BasicBlock.cpp
496

Sure, makes perfect sense. I'm not very familiar with the IR part of llvm, but I'd prefer to see a comment that explains that in a relevant place if you believe that this is a good future direction.

llvm/lib/IR/SymbolTableListTraitsImpl.h
96

Makes sense.
How expensive is it to add new data members to BasicBlock? Do you know of any attempts to stick some data inside and measure how it affects compilation times?

llvm/include/llvm/IR/BasicBlock.h
455

Sounds to me like a bug in EXPENSIVE_CHECKS if it can be used without assertions.

Personally, every time I've written #if EXPENSIVE_CHECKS code, I've assumed that assertions are on. Glancing at a handful of users, the most common case I can find by far is:

#ifdef EXPENSIVE_CHECKS
assert(...); // Or call a function that just does a lot of asserts
#endif

Moreover, I can't think of a case where I'd say "build an LLVM that spends as much time as it wants verifying itself, but not with assertions."

sanjoy added a subscriber: sanjoy.Sep 24 2018, 11:49 AM

The downside is that Instruction grows from 56 bytes to 64 bytes, and I don't have a good way to measure what that costs in practice.

In the commit that removed the table you said "Removing the virtual table pointer from Value saves 1% of RSS when doing LTO of llc on Linux."; so I'd expect the regression to be in the same ballpark?

rnk added a comment.Sep 24 2018, 2:22 PM

The downside is that Instruction grows from 56 bytes to 64 bytes, and I don't have a good way to measure what that costs in practice.

In the commit that removed the table you said "Removing the virtual table pointer from Value saves 1% of RSS when doing LTO of llc on Linux."; so I'd expect the regression to be in the same ballpark?

I've started running a full LTO step of llc, but it's taking quite a while (>20min). I recall I picked LTO of llc last time because it completed in a few minutes, so I could repeat the measurement a few times to build confidence that it wasn't noise. Something may have changed. :( We'll see what comes back soon, I guess.

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

That I agree with. :)

Though, as I said on the llvm-dev thread, we may be able to get the best of both worlds by using something like the waymarking algorithm.

rnk added a comment.Sep 24 2018, 4:00 PM

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

That I agree with. :)

I'm only patient enough to get two runs, before and after, and max RSS before the patch was 4816464 kb, and after, 4867836 kb. That's an increase of 1.06%. In absolute terms, ~50MB of wasted memory for Instruction positions feels high.

Though, as I said on the llvm-dev thread, we may be able to get the best of both worlds by using something like the waymarking algorithm.

That's probably possible, but I'm a little afraid to try to steal bits from ilist next/prev pointers. The complexity cost of stealing those bits from ilist is likely to be more than is really worth it, and it's not clear if we want to use linked lists over the long term. I would rather get right complexity first, and then make a follow-up change to try to reduce the memory usage. I don't want to let the perfect be the enemy of the good.

As I mentioned on llvmdev, I'm strongly opposed to this patch without a significant amount of analysis and diligence applied to it. I'm concerned about both memory and compile time impact. We don't just add random caches to the core IR to speed up particular clients. We care a lot about sizeof(Instruction) and its subclasses, and this adds bloat to all of them. Similarly, you don't seem interested in evaluating "tricks" to make this efficient in practice (sparse numbering), nor have you evaluated other implementation choices that will probably also work with less impact. Let's continue discussing this on llvm-dev.

To be clear, I'm not necessarily saying that this patch is the wrong thing to do, I would just like more diligence and experimentation with alternate approaches. Thanks :-)

nikic added a subscriber: nikic.Oct 26 2018, 2:08 PM
nikic added inline comments.
llvm/include/llvm/IR/BasicBlock.h
458

This should probably be BasicBlock::validateInstrOrdering() rather than Instruction::validateInstrOrdering().

rnk updated this revision to Diff 180334.Jan 4 2019, 3:37 PM

rebase

rnk updated this revision to Diff 181660.Jan 14 2019, 3:47 PM
  • rebase

Hi Reid,

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

-Chris

I haven't reviewed the patch in full detail, but the predicate "comesBefore" should probably be something like "isBeforeInBlock".

rnk added a subscriber: chandlerc.Jan 15 2019, 11:59 AM

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

Thanks! I'll bring it up on the dev list. I do know that @chandlerc wants to see a version of this that uses out of line numbers in a hash table, similar to the way we maintain value names out of line in a symbol table, but still done as part of the IR. I wanted to prototype that and compare.

In D51664#1358377, @rnk wrote:

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

Thanks! I'll bring it up on the dev list. I do know that @chandlerc wants to see a version of this that uses out of line numbers in a hash table, similar to the way we maintain value names out of line in a symbol table, but still done as part of the IR. I wanted to prototype that and compare.

To be super clear, I definitely want *some* solution here. I'm completely on board with this being a real problem and we should solve it.

And if even getting reasonable data proves to be tons of work, I think its fine to say that and move on. In our conversation I was worried we weren't even checking to see if a side table was an effective strategy. I don't have any reason to believe this is going to be so important that it is worth *tons* of effort to validate both alternatives, it just seems useful to try a quick prototype.

rnk updated this revision to Diff 183195.Jan 23 2019, 3:06 PM
  • rebase over r351992

Hi Reid,

Thanks for doing this.

Instead of changing the instruction and basic block classes, could we instead provide an enhanced version of ilist that does that?
E.g., something based on https://scholar.google.com/scholar?cluster=5225046542682967685&hl=en&as_sdt=0,5 (and we can add laziness on top if we want)

The rationale is that I wanted to do something similar in the Machine representation as well (and rework the SlotIndexes in the process) and was thinking that an improved version of ilist would do that for us while allowing to share the code in the middle-end and backend.

Bottom line, I was hoping this work would solve the dominance problem instead of basic block for the backend as well :).

Cheers,
-Quentin

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 4:15 PM
rnk updated this revision to Diff 188995.Mar 1 2019, 4:11 PM
  • rebase

For DSE, it seems quite straight-forward to preserve OrderedBB (we just remove instructions or replace existing ones with another one). I've added D59789 sketching that. This could be a stop-gap until this patch gets through.

vsk added a comment.Feb 13 2020, 5:26 PM

Is there still interest in pushing this forward? I like the approach taken here. We also just got a report of clang spending 21 minutes inside of DeadStoreElimination, and the lion's share of that time was within OrderedBasicBlock::comesBefore..

rnk added a comment.Feb 14 2020, 12:56 PM

I still think we should do this. I think @fhahn is reimplementing DSE using MemorySSA, so presumably the DSE cases won't be an issue soon, but setting all that aside, I still think it would be nice if we could say once and for all that Instruction::dominates(Instruction*) is amortized O(1) if you haven't modified the instruction stream. Otherwise this kind of pathology will pop up again. Putting the ordering on the IR saves clients from ferrying around and maintaining OrderedInstructions / OrderedBasicBlock data structures, and that seems like a win.

In D51664#1877029, @rnk wrote:

I still think we should do this. I think @fhahn is reimplementing DSE using MemorySSA, so presumably the DSE cases won't be an issue soon, but setting all that aside, I still think it would be nice if we could say once and for all that Instruction::dominates(Instruction*) is amortized O(1) if you haven't modified the instruction stream. Otherwise this kind of pathology will pop up again. Putting the ordering on the IR saves clients from ferrying around and maintaining OrderedInstructions / OrderedBasicBlock data structures, and that seems like a win.

I'm strongly in favour of this too. Is there any reviewer you're specifically waiting for? Or should the RFC be bumped to ensure there's consensus before you spend the time rebasing?

In D51664#1877032, @rnk wrote:

Yes, we've been using this in MLIR for at least a year now. There have been quite a few cases in MLIR, unrelated to DSE, that have really benefited from O(1) dominance checks readily available. There has only been one situation that I can recall where recomputing the block order showed up on a profile, but adding in some basic striding to the order assignment completely fixed it.

rnk edited the summary of this revision. (Show Details)Feb 14 2020, 1:36 PM
rnk updated this revision to Diff 244758.Feb 14 2020, 1:37 PM
  • rebase
rnk marked an inline comment as done.Feb 14 2020, 1:50 PM

I reposted the RFC to llvm-dev, and I think this time we'll reach a different consensus.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1265–1267

I don't have an equivalent API for this. I don't have a good way to profile to check if that is or is not the case.

lattner accepted this revision.Feb 15 2020, 11:58 AM

I haven't carefully reviewed the patch, but I think this is the right thing to do architecturally for the compiler. Thank you for driving this Reid. I'd appreciate it if someone could scrutinize the patch though!

This revision is now accepted and ready to land.Feb 15 2020, 11:58 AM
rnk added a comment.Feb 18 2020, 2:37 PM

Thanks! I was planning to wait a bit more after restarting the RFC to land this, but I think we have a lot of support here, and I don't have a lot of reasons to wait. I'm going to go ahead and push it, and if there are problems or objections, we can revert.

This revision was automatically updated to reflect the committed changes.

WOW! Thanks!

+1

I think that this is a really nice step forward and will make a lot of things easier in the future.

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp