This is an archive of the discontinued LLVM Phabricator instance.

[Dominators] Convert existing passes and utils to use the DomTreeUpdater class
ClosedPublic

Authored by NutshellySima on Jul 5 2018, 6:34 AM.

Details

Summary

This patch is the second in a series of patches related to the RFC - A new dominator tree updater for LLVM.

It converts passes (e.g. adce/jump-threading) and various functions which currently accept DDT in local.cpp and BasicBlockUtils.cpp to use the new DomTreeUpdater class.
These converted functions in utils can accept DomTreeUpdater with either UpdateStrategy and can deal with both DT and PDT held by the DomTreeUpdater.

Diff Detail

Event Timeline

NutshellySima created this revision.Jul 5 2018, 6:34 AM

Hello @NutshellySima , my comments are inline below. These changes are a good real-world usage of the new DTU class and should help us find any bugs or deficiencies. I'm curious to know what kind of testing you've done on this code: ninja check, test-suite, and/or test-suite for compile time?

lib/Transforms/Scalar/JumpThreading.cpp
421–422

It's not obvious here the reasoning behind this statement. I recommend either changing this to DTU->flush() or adding a comment stating the call will force all updates to the internal DT/PDT trees.

998

So much cleaner!

lib/Transforms/Scalar/RewriteStatepointsForGC.cpp
2540

Same as other flush() comment.

lib/Transforms/Utils/BasicBlockUtils.cpp
160

It might be a good idea to add a comment here why you're only adding these updates for UpdateStrategy::Lazy. Even I can't remember why this is the case, and I did this change. :)

(BTW, this routine is one of the most difficult and I struggled with it for weeks. Keep an eye on changes in here.)

198

Why did you move the LI and MemDep calls up?

205

I'm seeing several statements of this long form. Maybe add query methods like isUpdateEager() and isUpdateLazy() to simplify long comparison statements like these?

222

Please add {} here. Even though it's only one statement the if section uses them and it's a long block. Putting a small amount of distance between the two next statements makes it clear that we always call deleteBB.

You might also want to add a comment here denoting this as the Lazy path. With the outer if/else it's a little hard to read.

lib/Transforms/Utils/Local.cpp
745

Have you verified this does not need to be done for the Eager strategy? From what I can tell no one in LLVM is calling to RemovePredecessorAndSimplify. I wouldn't have added support for DDT here unless I absolutely needed it so I suspect the changes to JumpThreading have removed the need for it. I am nervous about this change because I am uncertain it's correct.

1042

Why are you always zapping the block here instead of relying on the deleteBB routine and the knowledge of your current UpdateStrategy within the DTU class?

2007–2008

I have concerns about code coverage here (and the sole user of this routine, removeUnreachableBlocks below). JumpThreading no longer makes use of this routine and the sole user of it now is RewriteStatepointsForGC.cpp. I don't know how much actual testing of your new code paths is actually happening.

2247

Same question here as the similar zap blocks comment above.

NutshellySima added a comment.EditedJul 6 2018, 9:01 AM

Hello @NutshellySima , my comments are inline below. These changes are a good real-world usage of the new DTU class and should help us find any bugs or deficiencies. I'm curious to know what kind of testing you've done on this code: ninja check, test-suite, and/or test-suite for compile time?

I ran ninja check-llvm with assertions enabled.

lib/Transforms/Utils/BasicBlockUtils.cpp
198

Previously, this function accepts DT/DDT. In the DT case, it calls DT->eraseNode(BB) then call BB->eraseFromParent() after calling LI->removeBlock(BB); In the DDT case, it calls DDT.deleteBB(BB) before calling LI->removeBlock(BB).

Since outside users can pass a DTU working under Eager UpdateStrategy (that is to say, calling DTU.deleteBB(BB) will immediately do both DT/PDT.eraseNode(BB) and BB->eraseFromParent()) which obviously has conflicts with LI->removeBlock(BB), so I move them up.

lib/Transforms/Utils/Local.cpp
745

This part of the code reuses the DT handling code before to be the Eager UpdateStrategy ones and the DDT handling code to the Lazy ones. The only perhaps uncertain change is whether my PDT handling code under Eager strategy (L729-L737) is correct and whether the previous code of DDT can also handle the PDT case.
I modified the unittest of local.cpp to cover the change and it seems correct.

1042

I'm quite uncertain here because I have to call deleteBB after applyUpdates to make this utility works under Eager strategy. But it asserts when running the unittest, so I add these zapping block codes here.

I ran ninja check-llvm with assertions enabled.

That is definitely not sufficient. The .ll test cases are only a small subset of real-world code and people only add to them to verify a new feature or to prove a bug is fixed. I recommend running test-suite to get a better representation of code out in the world. If you are uncertain how to do so please e-mail me and I'll provide detailed instructions. I'm hoping we can avoid stressful bug report debugging and LLVM code reverts/reapplies.

lib/Transforms/Utils/BasicBlockUtils.cpp
198

That makes sense. Thanks!

lib/Transforms/Utils/Local.cpp
745

Understood. One way I verified coverage during testing was adding assert(0 && "tracking string 1") on code paths I was uncertain were being exercised. After running ninja check and test-suite I'd grep for the string.

1042

Your response makes me even less comfortable with this code. Users will not be aware this zapping is to prepare the code for DTU eager updates and it's also done on every block, even if a DTU isn't passed in which is unnecessary. I recommend debugging this more to see if you can find a way to hide this kind of BB manipulation and tie it to the eager strategy better.

NutshellySima marked 7 inline comments as done.Jul 7 2018, 4:32 AM
NutshellySima added inline comments.
lib/Transforms/Scalar/JumpThreading.cpp
421–422

I think DTU->flush() should only be called when people need either

  1. Flush all deleteBB awaiting deletion.
  2. Do queries on both DT and PDT.

In the future, if DTU holds both DT and PDT in JumpThreading, call DTU->getDomTree which only flushes DT can pend updates on PDT as lazy as possible.

lib/Transforms/Utils/BasicBlockUtils.cpp
160

OK. The reason why only these updates are only added for UpdateStrategy::Lazy is that I reuse the codes handling DT previously to handle the Eager case. If the performance impact is negligible, we can combine these two ways together into only using the incremental way in the future.

205

Sure. It can be added in another patch.

lib/Transforms/Utils/Local.cpp
1042

I understand why these zapping codes are needed. DTU.applyUpdates() do simple checks on whether the edge exists in the CFG. However, the TerminatorInst of delBB isn't removed, so the check just discard the update if applyUpdates is called before deleteBB. I'll modify this part of code into the logic of removing the TerminatorInst before calling applyUpdates.

  1. Add DTU to LoopUtils
  2. Address comments
  3. rebase
NutshellySima marked 5 inline comments as done.Jul 7 2018, 5:56 AM

Hello @NutshellySima , comments are inline.

lib/Transforms/Scalar/JumpThreading.cpp
421–422

I'm concerned the complexity of updates will become unmanageable if you decide to take this route. Some questions I have:

  • How do you preserve the list of updates for one vs the other? Two separate queues?
  • How do you decide to delete unnecessary updates from one and not the other?
  • How do you manage the list of delted BBs? Are they only really deleted when both DT and PDT are flushed?
  • How do you make sure you don't actually delete a BB early?
  • How do you handle Eager vs Lazy updates?
  • How do you handle *Relaxed() updates?

The longer you avoid flushing the further your updates become from the real CFG's shape. Debugging an assert where DT and PDT update at different times means the developer has to also carefully keep the state of DTU in mind instead of just the IR and the trees. I'm not sure if the possible saved overhead of updates to one or the other tree is worth all that added complexity.

lib/Transforms/Utils/BasicBlockUtils.cpp
160

I always prefer simpler code if the performance impact is negligible. You'll be glad you did when you're furiously debugging a problem and an LLVM release window is coming up. :)

205

Sounds good, thanks!

lib/Transforms/Utils/Local.cpp
1042

Sounds like a cleaner solution, thank you for looking into this. I'll review the code when you update the change.

kuhar added a comment.Jul 9 2018, 9:01 AM

A general high-level question/suggestion: would it be possible to split this patch such that we don't change all of the uses of DDT and plain updates at once? I'm afraid this can subtly break in many places, and the patch will have to be reverted a couple of times before you get them right. If you do it with more granularity, then it should be easier to land each part and test it in isolation.
Is it possible, or do you have a dependency on the utility functions from Local.cpp and other helper files?

lib/Transforms/Scalar/JumpThreading.cpp
364–366

Can DTU have no DomTree here? Maybe we should assert for that?

411

nit: I think formatting is a bit off here. Can you run clang-format on this snippet?

kuhar added inline comments.Jul 9 2018, 9:03 AM
lib/Transforms/Scalar/JumpThreading.cpp
411

Ah, it's because of the really long if condition...

NutshellySima added inline comments.Jul 9 2018, 10:07 AM
lib/Transforms/Scalar/JumpThreading.cpp
421–422

How do you preserve the list of updates for one vs the other? Two separate queues?

There is only one queue and two index to record the updates that have been applied and not applied yet.

Index0123
PendUpdatesoooo
DTUpdateIndex
PDTUpdateIndex

How do you decide to delete unnecessary updates from one and not the other?

If an update is applied by all available trees, it will be purged.

How do you manage the list of deleted BBs? Are they only really deleted when both DT and PDT are flushed?

Yes, more specifically DTUpdateIndex == PDTUpdateIndex == PendUpdates.size(). It's impossible to judge whether a specific DelBB is feasible to be deleted at a specific time point because some callers of delBB just

BasicBlock *BB0 = ...;
DTU.delBB (BB0);
// Then do some actions on BB0
... ...
DTU.flush(); // The caller expects BB0 be deleted now.

How do you make sure you don't actually delete a BB early?

Mentioned above.

How do you handle Eager vs Lazy updates?

applyUpdates:
Lazy/Eager+ForceRemoveDuplicates - Works the same as DDT, more specifically

  1. Deduplicate, which is the main functionality of DDT.
  2. Inspect whether an edge is presented in CFG and decide whether to discard it. It is a must because the update sequence is unbalanced.

Eager - Works the same as raw DT/PDT.
insertEdge/deleteEdge:
Eager: Works the same as raw DT/PDT and adds inspection whether the update is valid.
Lazy: then submits into the list of updates rather than apply it.
recalculate:
Lazy: flush() + raw recalculate()
Eager: raw recalculate()

How do you handle *Relaxed() updates?

It behaves the same as the ones in DDT. It discards invalid updates quietly.

The longer you avoid flushing the further your updates become from the real CFG's shape. Debugging an assert where DT and PDT update at different times means the developer has to also carefully keep the state of DTU in mind instead of just the IR and the trees. I'm not sure if the possible saved overhead of updates to one or the other tree is worth all that added complexity.

I don't think so. With current DDT, let me show an example

// Changing the state of CFG begins. 
... // Modify CFG
MergeBasicBlockIntoOnlyPred (..., &DDT); // Some error happens !
// (You cannot do verification after MergeBasicBlockIntoOnlyPred because the CFG and DT is unsynced)
... // Modify DomTree
... // Modify CFG
// Changing the state of CFG ends.
...
DDT.flush().verify(); // Asserts !

Now with DTU,

// Changing the state of CFG begins.
... // Modify CFG
MergeBasicBlockIntoOnlyPred (..., &DTU); // Some error happens !
//  (You cannot do verification after MergeBasicBlockIntoOnlyPred because the CFG and DT is unsynced)
... // Modify DomTree
... // Modify CFG
// Changing the state of CFG ends.
...
DTU.getDomTree().verify(); // Asserts !

I can't see the difference. They'll assert at the same location and all far away from where the bug happens.

Hello @NutshellySima , comments are inline.

lib/Transforms/Scalar/JumpThreading.cpp
421–422

Let's say you have a DTU using the Lazy strategy. The current set of internally queued pending updates are the following:

[0] {Insert, A, B}
[1] {Delete, B, C}
[2] {Insert, B, D}

And you have both a DT and PDT. Their Indicies are the following:

DTU->IndexDT = 0;
DTU->IndexPDT = 0;

And you flush just the DT:

DTU->getDomTree(); // flush DT only

Now your indicies are the following:

DTU->IndexDT = 3;
DTU0>IndexPDT = 0;

And now the following update comes along:

DTU->deleteEdgeRelaxed(A,B);

In the case of PDT you want nothing to happen: this update will remove the inverted update [0] from the queue and result in a no-op to PDT. But in the case of the DT you need the update to happen: it's a change past the flush. How do you reconcile this internally with only one queue?

NutshellySima added inline comments.Jul 9 2018, 9:50 PM
lib/Transforms/Scalar/JumpThreading.cpp
421–422

In the case of PDT you want nothing to happen: this update will remove the inverted update [0] from the queue and result in a no-op to PDT. But in the case of the DT you need the update to happen: it's a change past the flush. How do you reconcile this internally with only one queue?

In this case, DTU will only inspect updates after std::max(IndexDT, IndexPDT) and try to remove no-ops in that range.

Q: Why I choose this way, but not flush both trees when calling getDT/getPDT/flush?
A: In include/llvm/Support/GenericDomTree.h, comments of applyUpdates:

/// Batch updates should be generally faster when performing longer sequences
/// of updates than calling insertEdge/deleteEdge manually multiple times, as
/// it can reorder the updates and remove redundant ones internally.

It seems that there is performance gain. So I prefer to preserve a longer sequence of updates.

NutshellySima added inline comments.Jul 9 2018, 10:23 PM
lib/Transforms/Scalar/JumpThreading.cpp
421–422

More info added to the above comment:
I think the only thing users of both DDT and DTU need to be careful is that they mustn't submit any updates that is already in the CFG + already known by the DT/PDT before submitting updates.
An example:

// Lazy DTU/DDT
// Edge A->B already in the CFG + DT knows that
DTU.insertEdge(A, B); // Get queued
... // delete A->B in the CFG
DTU.deleteEdge(A, B); // Cancel with the above one

This behavior is not supposed to be a functionality at least in DTU now. I'm not sure whether there are people using the DDT APIs like this.

Hello @NutshellySima , comment are again inline.

lib/Transforms/Scalar/JumpThreading.cpp
421–422

Understood, you are pessimizing early removal of paired updates. I strongly recommend you testing your assumptions about performance of long updates using the CTMark option of test-suite to reduce the chance of reverts due to performance degradation:

cd path/to/cloned/test-suite
cmake ... \
    ... \
    -D TEST_SUITE_SUBDIRS=CTMark \
    -D TEST_SUITE_RUN_BENCHMARKS=0" \
    ... \
    $PWD

Note I recommend this in addition to regular test-suite runs designed to detect crashes and assert()s.

421–422

This behavior is not supposed to be a functionality at least in DTU now. I'm not sure whether there are people using the DDT APIs like this.

I can guarantee you this happens all over JumpThreading. The pass contains no information about prior state or previous failed attempts. I have seen cases where the same edge is inserted and removed 3 or more times across one complete run of the pass. There is a reason why I was forced to implement DDT with the default as Relaxed. :)

NutshellySima added inline comments.Jul 10 2018, 9:49 AM
lib/Transforms/Scalar/JumpThreading.cpp
421–422

I understand the case when trying to inspect the CFG to remove the not really happened ones in jump-threading because it won't be queued and affect following updates.
And I'm thinking whether there is anyone submitting already happened and DT/PDT also knew ones.
I'm confused with your reply. In the example I mentioned (DT already got notified before that A->B exists but we still submit one {Insert, A, B})

// Lazy DTU/DDT
// Edge A->B already in the CFG + DT knows that
DTU.insertEdge(A, B); // Get queued {Insert, A, B}
... // delete A->B in the CFG
DTU.deleteEdge(A, B); // Cancel with the above one {Delete, A, B} cancels {Insert, A, B}
assert (DTU.getDT().verify()); // Asserts! (DT doesn't know A->B is gone!)

It can cause the state of DT not matching the state of the CFG.

421–422

Thanks for your advice.
But I doubt whether this test can show the performance degradation in the *current* LLVM codebase because
first, this behavior only affects

  1. a DTU under Lazy UpdateStrategy with both DT and PDT
  2. trying to perform a long sequence of updates

second, in the current LLVM codebase, only adce incrementally updates PDT with an extremely small update vector.

brzycki added inline comments.Jul 10 2018, 10:47 AM
lib/Transforms/Scalar/JumpThreading.cpp
421–422

CTMark is a decent indicator of compile-time performance. It can help prevent patches being reverted due to compile-time benchmarks run by others. I understand that this specific case is not exercised by anything other than your test cases, but when you (or someone else) actually starts using it this will be invaluable to verifying your hypothesis.

421–422

The longer you defer a flush the higher the chance of seeing very strange updates and changes between the CFG and the state of DT. I cannot give you a specific example at this time but I do recall state changes just as you described. Here is one code path I am concerned about:

  • JumpThreading::ProcessBlock() decides to split A into A -> A' with a single unconditional branch between them.
  • The rest of ProcessBlock() is unable to thread the jump and exits with failure.
  • Another, unrelated block Q is evaluated and ProcessBlock() successfully calls ThreadEdge(). With the SSA updates they added a DDT->flush() inside ThreadEdge() and would flush to DT. This patch is currently reverted, but can come back. It's also possible anyone can add a patch to add DDT->flush() on another call path under ProcessBlock().
  • runImpl() comes back to processing A again and goes to the end of JumpThreading::runImpl(). It sees A has a single successor A' and calls TryToSimplifyUncondBranchFromEmptyBlock() which re-fuses A.

If you defer your updates across this entire set of changes you will see:

// CreateBB(A');
DTU.insertEdge(A, A');
// Give up on A
...
// Work on Q, call a flush() in ThreadEdge() or somehwere else under ProcessBlock()
...
// Back on A
DTU.deleteEdge(A, A');
DTU.deleteBB(A');

The interactions between JumpThreading, the state of the IR, the current state of the DT, and the state of pending DDT updates can lead to very complex states between them. You cannot safely assume no one will do updates that don't make sense or seem to be extra work. It's also why I added the Inversion check early on. For large functions detecting those early helped save a few percent of performance time. The more real world code you run through your algorithm will verify if your assumptions are valid.

421–422

Further comments on your hypothetical code:

// Lazy DTU/DDT
// Edge A->B already in the CFG + DT knows that
DTU.insertEdge(A, B); // Get queued {Insert, A, B}
... // delete A->B in the CFG
DTU.deleteEdge(A, B); // Cancel with the above one {Delete, A, B} cancels {Insert, A, B}
assert (DTU.getDT().verify()); // Asserts! (DT doesn't know A->B is gone!)

The reason I added the silent checks against the CFG were exactly because of cases like this. I don't remember if they were in ProcessBlock() or somewhere under Local.cpp (which is called less in the current implementation).

This is why applyUpdates checked things in a specific order:

  1. do not enqueue duplicates within the ArrayRef. The call to applyUpdate() is expensive and if we can avoid 1 or 2 calls it's worth checking first.
  2. Check edge state in the cfg and silent discard updates that don't reflect the current IR state
  3. Check for duplicated update: discard. The internal queue might have the update that the batch call did not.
  4. Check for the inverted update: discard both.

Every one of these checks I hit with code through JumpThreading and some of them are there to prevent assert() failures.

I hope this answers your concern and I apologize if I did not answer your question. I'm struggling a bit to understand why we may not be understanding each other.

NutshellySima added inline comments.Jul 10 2018, 8:49 PM
lib/Transforms/Scalar/JumpThreading.cpp
421–422

Thanks for your detailed explanation. I think now I understand your concern.
Now I understand it's safe do a full flush when aggressive pass like JumpThreading request flush, but it's not safe to do a partial flush because it's complex and it's hard to know whether there will be anyone doing updates that have side-effects and cancels with any update before flush(), which may cause a bug hard to find out and get fixed.
I'll look into whether there are ways to simplify the update method of DomTree in aggressive pass like JumpThreading these days.

NutshellySima marked 21 inline comments as done.Jul 11 2018, 10:32 AM

Marked as done because this problem should be addressed separately.
The problem:
Currently DTU assumes updates submitted after getDomTree()/getPostDomTree()/flush() won't affect the updates submitted before it.
More specifically, it assumes that the caller should not submit updates that already knew by the DT/PDT.
This problem can cause bugs when aggressive passes like jump-threading try to preserve PDT in the future.

  1. Address comments.
  2. Replace previous raw DT operations into incremental updates to simplify logic.
  3. Implement a simple unit-test for RemoveUnreachableBlocks.
NutshellySima marked 6 inline comments as done.Jul 24 2018, 7:14 AM
  1. Amend comments.
  2. Fix issues on clearing the successor list of DelBB.
NutshellySima marked an inline comment as done.Jul 24 2018, 11:08 AM

Fix minor issues.

Fix minor issues.

The patch converts passes and utils as the following
A. Pass Convertions

Eager (It's functionally equivalent)
DT/PDT.insertEdge/deleteEdge/applyUpdates -> DTU.insertEdge/deleteEdge/applyUpdates
Lazy
DDT.applyUpdates -> DTU.applyUpdates (because pass can control the UpdateStrategy it uses)
DDT.insert/deleteEdge -> DTU.insert/deleteEdgeRelaxed
DDT.pendingDeletedBB -> DTU.isBBPendingDeletion
DDT.flush -> DTU.getDomTree (DTU will auto-flush on destruction)
DDT.deleteBB -> DTU.deleteBB

B. Util Transforms
Despite transforms above,
a. DDT.applyUpdates is replaced with DTU.applyUpdates(..., /*ForceRemoveDuplicates*/ true) to make Eager Strategy DTU passed into can also do deduplication like DDT does.
b.

DDT.deleteBB(...);
DDT.applyUpdates(...);

is replaced with

... // Make sure the successor list of BB is cleared!
DTU.applyUpdates(..., /*ForceRemoveDuplicates*/ true);
DTU.deleteBB(BB);

c. Eager Strategy recalculation

DDT.deleteBB(BB);
DDT.recalculate(F);

is replaced with

DTU.deleteBB(BB, /*DisableNodeErasing*/)
DTU.recalculate(F);

d. All code not using the incremental update method is removed.
C. Unittest

  1. Modify the unittest to check MergeBasicBlockIntoOnlyPred in Local.cpp (12 [2*3*2] patterns)
UpdateStrategy
a. Eager UpdateStrategy
b. Lazy UpdateStrategy
DT/PDT
a. only DT
b. only PDT
c. DT+PDT
IR
a. replace the Entry BB
b. does not replace entry BB
  1. Modify the unittest for ConstantFoldTerminator to test both DT and PDT under Eager/Lazy UpdateStrategy.
  2. Add a unittest for RemoveUnreachableBlocks to test DT and PDT under Eager/Lazy UpdateStrategy

Now the utils after conversion is functionally equivalent with the current DDT implementation and I think the patch is ready.

Thanks for the changes, Chijun!

@brzycki Can you please run run you test suites on this patch and confirm you don't see any new regressions? Chijun has only 2 weeks left until his GSoC and it would be best commit this as soon as possible, unless we discover some problems. Right now, like mentioned above, the patch is functionally equivalent to DDT and shouldn't cause any disruption.

Chijun developed 2 alternative abstractions that make the update code much cleaner in JT, but there won't be enough time to fully test it and commit by the end of his internship. Because of that, we decided to first be done with the transition to the unified interface which is the main artifact of his project.

brzycki accepted this revision.Aug 2 2018, 10:30 AM

LGTM. I like how this revision of the patch is minimally invasive and most of the hard work is hidden in DTU. I strongly recommend running ninja check, ninja check-all, and a standard test-suite run on tip before committing.

Thank you for your hard work and patience on refactoring this interface @NutshellySima .

This revision is now accepted and ready to land.Aug 2 2018, 10:30 AM

LGTM. I like how this revision of the patch is minimally invasive and most of the hard work is hidden in DTU. I strongly recommend running ninja check, ninja check-all, and a standard test-suite run on tip before committing.

Thank you for your hard work and patience on refactoring this interface @NutshellySima .

Thanks for reviewing. I'll sure to test before committing.

Fix a nit in Local.cpp of dealing with the TerminatorInst.

Fix the IR that the entryBB cannot have any pred.

This revision was automatically updated to reflect the committed changes.