Page MenuHomePhabricator

[IR] Lazily number instructions for local dominance queries
Needs ReviewPublic

Authored by rnk on Sep 4 2018, 4:45 PM.

Details

Summary

Essentially, fold OrderedBasicBlock into BasicBlock, and make it
auto-invalidate the instruction ordering when new instructions are
added. Notably, we don't need to invalidate it when removing
instructions, which is helpful when a pass mostly delete dead
instructions rather than transforming them.

This dramatically speeds up clang compiling clang with debug info.
Compiling SemaChecking.cpp on my computer previously took 121s, and now
it takes 74s, for a 63% speedup.

The downside is that Instruction grows from 56 bytes to 64 bytes, and I
don't have a good way to measure what that costs in practice. As the one
who removed the vtable from Value, I will say that this is how I would
like to spend the 8 bytes that I saved a year ago in r303362.

The resulting LLVM code is substantially simpler and automatically
handles invalidation, which makes me think that this is the right speed
and size tradeoff. There's more low-hanging fruit in MemorySSA and DSE,
which maintain their own instruction orderings today.

The code isn't completely polished, I just wanted to post some numbers
and show people what this looks like. The important change is in
SymbolTableTraitsImpl.h, where the numbering is invalidated. Everything
else should be straightforward.

We could implement a fancier re-numbering scheme, but I think we should
wait and profile and see if we really need that complexity. If we do,
the APIs will stay the same anyway, so this is a good first step.

Fixes PR38829.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
rnk added inline comments.Sep 4 2018, 5:05 PM
llvm/include/llvm/IR/SymbolTableListTraits.h
61–64 ↗(On Diff #163948)

Oops, I'll revert this.

rnk updated this revision to Diff 163955.Sep 4 2018, 5:17 PM
  • remove stale changes

I have zero context on the various ways we can/do add instructions to a BB, so I can't immediately be helpful with the tricky part of this review. :) I've no complaints about the general direction of this patch, though.

When the code's more polished, would it be reasonable to also stick a verifier of this somewhere (that we maybe just run per-query or per-pass on EXPENSIVE_CHECKS builds, or something)?

rnk added a comment.Sep 5 2018, 9:45 AM

When the code's more polished, would it be reasonable to also stick a verifier of this somewhere (that we maybe just run per-query or per-pass on EXPENSIVE_CHECKS builds, or something)?

We could do it if expensive checks are enabled, but I think checks in NDEBUG that don't affect the asymptotic performance are the most useful. Maybe before inserting an instruction, if the numbering was previously considered valid, we can assert that it actually was? This way, we do linear work the first time we create the ordering, and then do linear work when we invalidate it. We could also check when destroying the BB, since that also does linear work.

rnk updated this revision to Diff 164089.Sep 5 2018, 11:32 AM
  • add instruction order validation
  • invalidate order when splicing instructions
  • call transferNodesFromList for same-list transfers
dexonsmith added a reviewer: ahatanak.EditedSep 5 2018, 5:42 PM
dexonsmith added subscribers: ahatanak, bogner.

This is awesome to see!

@ahatanak, you had a prototype of something similar a couple of years ago. Can you take a look?

The downside is that Instruction grows from 56 bytes to 64 bytes,

What's the growth for 32-bit pointers?

and I
don't have a good way to measure what that costs in practice. As the one
who removed the vtable from Value, I will say that this is how I would
like to spend the 8 bytes that I saved a year ago in r303362.

The two memory-sensitive cases I'm aware of are embedded compilers and LTO (with debug info).

It would be good to get someone to comment on the former.

On the latter: we still support users of -flto=full, so it would be interesting to know how big a regression in peak memory this would be. In the past, I've gotten numbers by running a build with -flto, using -save-temps when invoking the ld64, and then running llc on the optimized (but not CodeGen'ed) bitcode file (since the peak is in CodeGen). If you're really curious, that still seems like a decent way to look at the impact. But I doubt this will even register a difference; Instructions are usually malloc-allocated, and malloc is often 16-byte aligned.

dmgreen added a subscriber: dmgreen.Sep 8 2018, 6:43 AM
rnk added a comment.Sep 12 2018, 2:04 PM

Any opinions on this? I'm eager to get it in so I can release it and get some faster builds, but I know it's a core data structure change.

FWIW, it seems like OrderedBB invalidation is causing bugs at least in LoopSafetyInfo (D50377) which @mkazantsev is currently working on fixing. There might be other places that get this wrong too, so having automatic invalidation seems like another good plus on top of the speedups.

This looks good to me, but like said, I don't have sufficient context to stamp it with great confidence.

Thanks again!

llvm/include/llvm/IR/BasicBlock.h
437

nit: I'd expect a function named this to assert(). Can we rename it to something that sounds more boolean-y like isInstrOrderingValid() (?) and/or add LLVM_NODISCARD?

437

nit * 2: I don't know what precedent is set elsewhere, but personally, I feel that if the intent is for this to only be used in NDEBUG builds, we should wrap this function decl in #ifndef NDEBUG.

Mostly because I'd prefer the compiler tells me to reconsider my life choices over silently getting potentially-incorrect results. If it turns out that there's a case where the latter is preferable, it's a ~4 line diff. :)

I've run a big corpus of fuzz tests on this patch and it passed OK. So the patch seems good to me in terms of stability. Unfortunately I don't have time to give a proper code review on that. :(

It also makes my work on ICF tracking much easier as @fhahn has mentioned. So I'd also be happy to see it checked in. :)

rnk added a comment.Sep 14 2018, 2:42 PM

@gbiv suggested that I test this on a large codebase, so I went ahead and built the Chrome unit_tests target with this, and the validation checks passed.

I'm going to address his comments, and otherwise I think we should go forward with this.

vsk added a subscriber: vsk.Sep 15 2018, 12:48 AM

I don't see any fundamental flaws in the algorithm, it looks pretty robust. I have some nit comments, otherwise it LGTM. (Note that I'm maybe not the most qualified person to approve changes in such fundamental components as BasicBlock and Instruction, but this change seems profitable).

llvm/include/llvm/IR/BasicBlock.h
422

I would rather use some named constant instead of 1; it is widespread use across the code and may be confusing for a reader. Just a suggestion.

437

Agreed here.

llvm/lib/IR/Instruction.cpp
101

Maybe also makes sense to assert that Parent is not nullptr (i.e. instructions not detached).

llvm/unittests/IR/BasicBlockTest.cpp
150

EXPECT_TRUE/FALSE(BB->isInstrOrderValid()) before and after that to make sure that it works at all?

184

Do you mind adding the similar check for Instruction->removeFromParent and Instruction->eraseFromParent?

rnk updated this revision to Diff 166056.Sep 18 2018, 6:17 PM
rnk marked 2 inline comments as done.
  • Use private bitfield for subclass data
  • Change cached order invalidation
  • Add tests for removal and erasure
rnk added inline comments.Sep 18 2018, 6:22 PM
llvm/include/llvm/IR/BasicBlock.h
422

I rewrote this to use a bitfield. I think it's easier to understand now.

437

So, isInstrOrderingValid would conflict with isInstrOrderValid, which gives you the cached answer.

I'll just make it void and assert internally in !NDEBUG builds.

llvm/unittests/IR/BasicBlockTest.cpp
184

Done, but they don't invalidate ordering, so I check for that instead.

rnk updated this revision to Diff 166059.Sep 18 2018, 6:23 PM
  • Fix one last use of setValueSubclassData
kuhar added a subscriber: kuhar.Sep 20 2018, 7:34 AM
kuhar added inline comments.Sep 21 2018, 1:08 PM
llvm/include/llvm/IR/BasicBlock.h
491

Does it make sense to disable this when EXPENSIVE_CHECKS are set?

llvm/lib/Analysis/OrderedInstructions.cpp
31

Is the separate function localDominates still needed? Seems like the body is trivial and could be inlined here?

llvm/lib/IR/BasicBlock.cpp
504

Is it possible to use noncontiguous indices? If the indices are spread apart, you should be able to perform most insertions without renumbering instructions.

llvm/lib/IR/SymbolTableListTraitsImpl.h
100

Isn't it enough to invalidate only the indices of instructions that follow the first inserted one?

rnk updated this revision to Diff 166712.Sep 24 2018, 10:08 AM
rnk marked an inline comment as done.
  • remove localDominates
rnk added a comment.Sep 24 2018, 10:08 AM

Any other thoughts on this?

llvm/include/llvm/IR/BasicBlock.h
491

I would assume EXPENSIVE_CHECKS implies !NDEBUG, and these checks are on when assertions enabled.

llvm/lib/Analysis/OrderedInstructions.cpp
31

Sure, fixed.

llvm/lib/IR/BasicBlock.cpp
504

I want to put that out of scope of the initial change. We can do all kinds of fancy tricks here to avoid invalidating the ordering, but it's hard to provide meaningfully better algorithmic guarantees. And, the more complex code will require more complex testing, and it might have bugs. I'd rather come back and implement a more complex algorithm once profiling shows that there is a bottleneck, especially since it's often easier to remove these bottlenecks by delaying insertion.

llvm/lib/IR/SymbolTableListTraitsImpl.h
100

Yes, but recording that info and leveraging it is complex, and it doesn't change the asymptotic performance. We'd need more than a bit in BasicBlock to do it.

kuhar added inline comments.Sep 24 2018, 10:20 AM
llvm/include/llvm/IR/BasicBlock.h
491

I think you can use expensive checks independently of build type

llvm/lib/IR/BasicBlock.cpp
504

Sure, makes perfect sense. I'm not very familiar with the IR part of llvm, but I'd prefer to see a comment that explains that in a relevant place if you believe that this is a good future direction.

llvm/lib/IR/SymbolTableListTraitsImpl.h
100

Makes sense.
How expensive is it to add new data members to BasicBlock? Do you know of any attempts to stick some data inside and measure how it affects compilation times?

llvm/include/llvm/IR/BasicBlock.h
491

Sounds to me like a bug in EXPENSIVE_CHECKS if it can be used without assertions.

Personally, every time I've written #if EXPENSIVE_CHECKS code, I've assumed that assertions are on. Glancing at a handful of users, the most common case I can find by far is:

#ifdef EXPENSIVE_CHECKS
assert(...); // Or call a function that just does a lot of asserts
#endif

Moreover, I can't think of a case where I'd say "build an LLVM that spends as much time as it wants verifying itself, but not with assertions."

sanjoy added a subscriber: sanjoy.Sep 24 2018, 11:49 AM

The downside is that Instruction grows from 56 bytes to 64 bytes, and I don't have a good way to measure what that costs in practice.

In the commit that removed the table you said "Removing the virtual table pointer from Value saves 1% of RSS when doing LTO of llc on Linux."; so I'd expect the regression to be in the same ballpark?

rnk added a comment.Sep 24 2018, 2:22 PM

The downside is that Instruction grows from 56 bytes to 64 bytes, and I don't have a good way to measure what that costs in practice.

In the commit that removed the table you said "Removing the virtual table pointer from Value saves 1% of RSS when doing LTO of llc on Linux."; so I'd expect the regression to be in the same ballpark?

I've started running a full LTO step of llc, but it's taking quite a while (>20min). I recall I picked LTO of llc last time because it completed in a few minutes, so I could repeat the measurement a few times to build confidence that it wasn't noise. Something may have changed. :( We'll see what comes back soon, I guess.

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

That I agree with. :)

Though, as I said on the llvm-dev thread, we may be able to get the best of both worlds by using something like the waymarking algorithm.

rnk added a comment.Sep 24 2018, 4:00 PM

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

That I agree with. :)

I'm only patient enough to get two runs, before and after, and max RSS before the patch was 4816464 kb, and after, 4867836 kb. That's an increase of 1.06%. In absolute terms, ~50MB of wasted memory for Instruction positions feels high.

Though, as I said on the llvm-dev thread, we may be able to get the best of both worlds by using something like the waymarking algorithm.

That's probably possible, but I'm a little afraid to try to steal bits from ilist next/prev pointers. The complexity cost of stealing those bits from ilist is likely to be more than is really worth it, and it's not clear if we want to use linked lists over the long term. I would rather get right complexity first, and then make a follow-up change to try to reduce the memory usage. I don't want to let the perfect be the enemy of the good.

As I mentioned on llvmdev, I'm strongly opposed to this patch without a significant amount of analysis and diligence applied to it. I'm concerned about both memory and compile time impact. We don't just add random caches to the core IR to speed up particular clients. We care a lot about sizeof(Instruction) and its subclasses, and this adds bloat to all of them. Similarly, you don't seem interested in evaluating "tricks" to make this efficient in practice (sparse numbering), nor have you evaluated other implementation choices that will probably also work with less impact. Let's continue discussing this on llvm-dev.

To be clear, I'm not necessarily saying that this patch is the wrong thing to do, I would just like more diligence and experimentation with alternate approaches. Thanks :-)

nikic added a subscriber: nikic.Oct 26 2018, 2:08 PM
nikic added inline comments.
llvm/include/llvm/IR/BasicBlock.h
494

This should probably be BasicBlock::validateInstrOrdering() rather than Instruction::validateInstrOrdering().

rnk updated this revision to Diff 180334.Jan 4 2019, 3:37 PM

rebase

rnk updated this revision to Diff 181660.Jan 14 2019, 3:47 PM
  • rebase

Hi Reid,

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

-Chris

I haven't reviewed the patch in full detail, but the predicate "comesBefore" should probably be something like "isBeforeInBlock".

rnk added a subscriber: chandlerc.Jan 15 2019, 11:59 AM

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

Thanks! I'll bring it up on the dev list. I do know that @chandlerc wants to see a version of this that uses out of line numbers in a hash table, similar to the way we maintain value names out of line in a symbol table, but still done as part of the IR. I wanted to prototype that and compare.

In D51664#1358377, @rnk wrote:

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

Thanks! I'll bring it up on the dev list. I do know that @chandlerc wants to see a version of this that uses out of line numbers in a hash table, similar to the way we maintain value names out of line in a symbol table, but still done as part of the IR. I wanted to prototype that and compare.

To be super clear, I definitely want *some* solution here. I'm completely on board with this being a real problem and we should solve it.

And if even getting reasonable data proves to be tons of work, I think its fine to say that and move on. In our conversation I was worried we weren't even checking to see if a side table was an effective strategy. I don't have any reason to believe this is going to be so important that it is worth *tons* of effort to validate both alternatives, it just seems useful to try a quick prototype.

rnk updated this revision to Diff 183195.Jan 23 2019, 3:06 PM
  • rebase over r351992

Hi Reid,

Thanks for doing this.

Instead of changing the instruction and basic block classes, could we instead provide an enhanced version of ilist that does that?
E.g., something based on https://scholar.google.com/scholar?cluster=5225046542682967685&hl=en&as_sdt=0,5 (and we can add laziness on top if we want)

The rationale is that I wanted to do something similar in the Machine representation as well (and rework the SlotIndexes in the process) and was thinking that an improved version of ilist would do that for us while allowing to share the code in the middle-end and backend.

Bottom line, I was hoping this work would solve the dominance problem instead of basic block for the backend as well :).

Cheers,
-Quentin

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 4:15 PM
rnk updated this revision to Diff 188995.Mar 1 2019, 4:11 PM
  • rebase

For DSE, it seems quite straight-forward to preserve OrderedBB (we just remove instructions or replace existing ones with another one). I've added D59789 sketching that. This could be a stop-gap until this patch gets through.