Page MenuHomePhabricator

[IR] Lazily number instructions for local dominance queries
ClosedPublic

Authored by rnk on Sep 4 2018, 4:45 PM.

Details

Summary

Essentially, fold OrderedBasicBlock into BasicBlock, and make it
auto-invalidate the instruction ordering when new instructions are
added. Notably, we don't need to invalidate it when removing
instructions, which is helpful when a pass mostly delete dead
instructions rather than transforming them.

The downside is that Instruction grows from 56 bytes to 64 bytes. The
resulting LLVM code is substantially simpler and automatically handles
invalidation, which makes me think that this is the right speed and size
tradeoff. There's more low-hanging fruit in MemorySSA and DSE, which
maintain their own instruction orderings today.

The important change is in SymbolTableTraitsImpl.h, where the numbering
is invalidated. Everything else should be straightforward.

We probably want to implement a fancier re-numbering scheme so that
local updates don't invalidate the ordering, but I plan for that to be
future work, maybe for someone else.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mkazantsev added inline comments.Sep 16 2018, 8:31 PM
llvm/include/llvm/IR/BasicBlock.h
444

I would rather use some named constant instead of 1; it is widespread use across the code and may be confusing for a reader. Just a suggestion.

459

Agreed here.

llvm/lib/IR/Instruction.cpp
101

Maybe also makes sense to assert that Parent is not nullptr (i.e. instructions not detached).

llvm/unittests/IR/BasicBlockTest.cpp
153

EXPECT_TRUE/FALSE(BB->isInstrOrderValid()) before and after that to make sure that it works at all?

187

Do you mind adding the similar check for Instruction->removeFromParent and Instruction->eraseFromParent?

rnk updated this revision to Diff 166056.Sep 18 2018, 6:17 PM
rnk marked 2 inline comments as done.
  • Use private bitfield for subclass data
  • Change cached order invalidation
  • Add tests for removal and erasure
rnk added inline comments.Sep 18 2018, 6:22 PM
llvm/include/llvm/IR/BasicBlock.h
444

I rewrote this to use a bitfield. I think it's easier to understand now.

459

So, isInstrOrderingValid would conflict with isInstrOrderValid, which gives you the cached answer.

I'll just make it void and assert internally in !NDEBUG builds.

llvm/unittests/IR/BasicBlockTest.cpp
187

Done, but they don't invalidate ordering, so I check for that instead.

rnk updated this revision to Diff 166059.Sep 18 2018, 6:23 PM
  • Fix one last use of setValueSubclassData
kuhar added a subscriber: kuhar.Sep 20 2018, 7:34 AM
kuhar added inline comments.Sep 21 2018, 1:08 PM
llvm/include/llvm/IR/BasicBlock.h
513

Does it make sense to disable this when EXPENSIVE_CHECKS are set?

llvm/lib/Analysis/OrderedInstructions.cpp
31

Is the separate function localDominates still needed? Seems like the body is trivial and could be inlined here?

llvm/lib/IR/BasicBlock.cpp
519

Is it possible to use noncontiguous indices? If the indices are spread apart, you should be able to perform most insertions without renumbering instructions.

llvm/lib/IR/SymbolTableListTraitsImpl.h
100

Isn't it enough to invalidate only the indices of instructions that follow the first inserted one?

rnk updated this revision to Diff 166712.Sep 24 2018, 10:08 AM
rnk marked an inline comment as done.
  • remove localDominates
rnk added a comment.Sep 24 2018, 10:08 AM

Any other thoughts on this?

llvm/include/llvm/IR/BasicBlock.h
513

I would assume EXPENSIVE_CHECKS implies !NDEBUG, and these checks are on when assertions enabled.

llvm/lib/Analysis/OrderedInstructions.cpp
31

Sure, fixed.

llvm/lib/IR/BasicBlock.cpp
519

I want to put that out of scope of the initial change. We can do all kinds of fancy tricks here to avoid invalidating the ordering, but it's hard to provide meaningfully better algorithmic guarantees. And, the more complex code will require more complex testing, and it might have bugs. I'd rather come back and implement a more complex algorithm once profiling shows that there is a bottleneck, especially since it's often easier to remove these bottlenecks by delaying insertion.

llvm/lib/IR/SymbolTableListTraitsImpl.h
100

Yes, but recording that info and leveraging it is complex, and it doesn't change the asymptotic performance. We'd need more than a bit in BasicBlock to do it.

kuhar added inline comments.Sep 24 2018, 10:20 AM
llvm/include/llvm/IR/BasicBlock.h
513

I think you can use expensive checks independently of build type

llvm/lib/IR/BasicBlock.cpp
519

Sure, makes perfect sense. I'm not very familiar with the IR part of llvm, but I'd prefer to see a comment that explains that in a relevant place if you believe that this is a good future direction.

llvm/lib/IR/SymbolTableListTraitsImpl.h
100

Makes sense.
How expensive is it to add new data members to BasicBlock? Do you know of any attempts to stick some data inside and measure how it affects compilation times?

llvm/include/llvm/IR/BasicBlock.h
513

Sounds to me like a bug in EXPENSIVE_CHECKS if it can be used without assertions.

Personally, every time I've written #if EXPENSIVE_CHECKS code, I've assumed that assertions are on. Glancing at a handful of users, the most common case I can find by far is:

#ifdef EXPENSIVE_CHECKS
assert(...); // Or call a function that just does a lot of asserts
#endif

Moreover, I can't think of a case where I'd say "build an LLVM that spends as much time as it wants verifying itself, but not with assertions."

sanjoy added a subscriber: sanjoy.Sep 24 2018, 11:49 AM

The downside is that Instruction grows from 56 bytes to 64 bytes, and I don't have a good way to measure what that costs in practice.

In the commit that removed the table you said "Removing the virtual table pointer from Value saves 1% of RSS when doing LTO of llc on Linux."; so I'd expect the regression to be in the same ballpark?

rnk added a comment.Sep 24 2018, 2:22 PM

The downside is that Instruction grows from 56 bytes to 64 bytes, and I don't have a good way to measure what that costs in practice.

In the commit that removed the table you said "Removing the virtual table pointer from Value saves 1% of RSS when doing LTO of llc on Linux."; so I'd expect the regression to be in the same ballpark?

I've started running a full LTO step of llc, but it's taking quite a while (>20min). I recall I picked LTO of llc last time because it completed in a few minutes, so I could repeat the measurement a few times to build confidence that it wasn't noise. Something may have changed. :( We'll see what comes back soon, I guess.

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

That I agree with. :)

Though, as I said on the llvm-dev thread, we may be able to get the best of both worlds by using something like the waymarking algorithm.

rnk added a comment.Sep 24 2018, 4:00 PM

If I get no results, at least 1% RSS is an upper bound on increased LTO memory usage. I'm happy to trade that for 40% shorter compile time of the slowest TUs in clang.

That I agree with. :)

I'm only patient enough to get two runs, before and after, and max RSS before the patch was 4816464 kb, and after, 4867836 kb. That's an increase of 1.06%. In absolute terms, ~50MB of wasted memory for Instruction positions feels high.

Though, as I said on the llvm-dev thread, we may be able to get the best of both worlds by using something like the waymarking algorithm.

That's probably possible, but I'm a little afraid to try to steal bits from ilist next/prev pointers. The complexity cost of stealing those bits from ilist is likely to be more than is really worth it, and it's not clear if we want to use linked lists over the long term. I would rather get right complexity first, and then make a follow-up change to try to reduce the memory usage. I don't want to let the perfect be the enemy of the good.

As I mentioned on llvmdev, I'm strongly opposed to this patch without a significant amount of analysis and diligence applied to it. I'm concerned about both memory and compile time impact. We don't just add random caches to the core IR to speed up particular clients. We care a lot about sizeof(Instruction) and its subclasses, and this adds bloat to all of them. Similarly, you don't seem interested in evaluating "tricks" to make this efficient in practice (sparse numbering), nor have you evaluated other implementation choices that will probably also work with less impact. Let's continue discussing this on llvm-dev.

To be clear, I'm not necessarily saying that this patch is the wrong thing to do, I would just like more diligence and experimentation with alternate approaches. Thanks :-)

nikic added a subscriber: nikic.Oct 26 2018, 2:08 PM
nikic added inline comments.
llvm/include/llvm/IR/BasicBlock.h
516

This should probably be BasicBlock::validateInstrOrdering() rather than Instruction::validateInstrOrdering().

rnk updated this revision to Diff 180334.Jan 4 2019, 3:37 PM

rebase

rnk updated this revision to Diff 181660.Jan 14 2019, 3:47 PM
  • rebase

Hi Reid,

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

-Chris

I haven't reviewed the patch in full detail, but the predicate "comesBefore" should probably be something like "isBeforeInBlock".

rnk added a subscriber: chandlerc.Jan 15 2019, 11:59 AM

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

Thanks! I'll bring it up on the dev list. I do know that @chandlerc wants to see a version of this that uses out of line numbers in a hash table, similar to the way we maintain value names out of line in a symbol table, but still done as part of the IR. I wanted to prototype that and compare.

In D51664#1358377, @rnk wrote:

After giving you a hard time about this a few months ago, I've come around to believing that this is the right thing to do. Certain classes of algorithms really do benefit from having a lexicographic ordering comparison that is fast, and I think that this general approach is the best way to go.

Thanks! I'll bring it up on the dev list. I do know that @chandlerc wants to see a version of this that uses out of line numbers in a hash table, similar to the way we maintain value names out of line in a symbol table, but still done as part of the IR. I wanted to prototype that and compare.

To be super clear, I definitely want *some* solution here. I'm completely on board with this being a real problem and we should solve it.

And if even getting reasonable data proves to be tons of work, I think its fine to say that and move on. In our conversation I was worried we weren't even checking to see if a side table was an effective strategy. I don't have any reason to believe this is going to be so important that it is worth *tons* of effort to validate both alternatives, it just seems useful to try a quick prototype.

rnk updated this revision to Diff 183195.Jan 23 2019, 3:06 PM
  • rebase over r351992

Hi Reid,

Thanks for doing this.

Instead of changing the instruction and basic block classes, could we instead provide an enhanced version of ilist that does that?
E.g., something based on https://scholar.google.com/scholar?cluster=5225046542682967685&hl=en&as_sdt=0,5 (and we can add laziness on top if we want)

The rationale is that I wanted to do something similar in the Machine representation as well (and rework the SlotIndexes in the process) and was thinking that an improved version of ilist would do that for us while allowing to share the code in the middle-end and backend.

Bottom line, I was hoping this work would solve the dominance problem instead of basic block for the backend as well :).

Cheers,
-Quentin

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 4:15 PM
rnk updated this revision to Diff 188995.Mar 1 2019, 4:11 PM
  • rebase

For DSE, it seems quite straight-forward to preserve OrderedBB (we just remove instructions or replace existing ones with another one). I've added D59789 sketching that. This could be a stop-gap until this patch gets through.

vsk added a comment.Feb 13 2020, 5:26 PM

Is there still interest in pushing this forward? I like the approach taken here. We also just got a report of clang spending 21 minutes inside of DeadStoreElimination, and the lion's share of that time was within OrderedBasicBlock::comesBefore..

rnk added a comment.Feb 14 2020, 12:56 PM

I still think we should do this. I think @fhahn is reimplementing DSE using MemorySSA, so presumably the DSE cases won't be an issue soon, but setting all that aside, I still think it would be nice if we could say once and for all that Instruction::dominates(Instruction*) is amortized O(1) if you haven't modified the instruction stream. Otherwise this kind of pathology will pop up again. Putting the ordering on the IR saves clients from ferrying around and maintaining OrderedInstructions / OrderedBasicBlock data structures, and that seems like a win.

In D51664#1877029, @rnk wrote:

I still think we should do this. I think @fhahn is reimplementing DSE using MemorySSA, so presumably the DSE cases won't be an issue soon, but setting all that aside, I still think it would be nice if we could say once and for all that Instruction::dominates(Instruction*) is amortized O(1) if you haven't modified the instruction stream. Otherwise this kind of pathology will pop up again. Putting the ordering on the IR saves clients from ferrying around and maintaining OrderedInstructions / OrderedBasicBlock data structures, and that seems like a win.

I'm strongly in favour of this too. Is there any reviewer you're specifically waiting for? Or should the RFC be bumped to ensure there's consensus before you spend the time rebasing?

In D51664#1877032, @rnk wrote:

Yes, we've been using this in MLIR for at least a year now. There have been quite a few cases in MLIR, unrelated to DSE, that have really benefited from O(1) dominance checks readily available. There has only been one situation that I can recall where recomputing the block order showed up on a profile, but adding in some basic striding to the order assignment completely fixed it.

rnk edited the summary of this revision. (Show Details)Feb 14 2020, 1:36 PM
rnk updated this revision to Diff 244758.Feb 14 2020, 1:37 PM
  • rebase
rnk marked an inline comment as done.Feb 14 2020, 1:50 PM

I reposted the RFC to llvm-dev, and I think this time we'll reach a different consensus.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1311

I don't have an equivalent API for this. I don't have a good way to profile to check if that is or is not the case.

lattner accepted this revision.Feb 15 2020, 11:58 AM

I haven't carefully reviewed the patch, but I think this is the right thing to do architecturally for the compiler. Thank you for driving this Reid. I'd appreciate it if someone could scrutinize the patch though!

This revision is now accepted and ready to land.Feb 15 2020, 11:58 AM
rnk added a comment.Feb 18 2020, 2:37 PM

Thanks! I was planning to wait a bit more after restarting the RFC to land this, but I think we have a lot of support here, and I don't have a lot of reasons to wait. I'm going to go ahead and push it, and if there are problems or objections, we can revert.

This revision was automatically updated to reflect the committed changes.

WOW! Thanks!

+1

I think that this is a really nice step forward and will make a lot of things easier in the future.