This is an archive of the discontinued LLVM Phabricator instance.

Codegen: Tail-duplicate during placement.
ClosedPublic

Authored by iteratee on Mar 16 2016, 2:42 PM.

Details

Summary

The tail duplication pass uses an assumed layout when making duplication
decisions. This is fine, but passes up duplication opportunities that
may arise when blocks are outlined. This change allows duplication after
layout decisions. Because we want the updated CFG to affect subsequent
placement decisions, duplication must occur during placement.

In order to achieve this goal, TailDuplicationPass is split into a
utility class, TailDuplicator, and the pass itself. The pass delegates
nearly everything to the TailDuplicator object, except for looping over
the blocks in a function. This allows the same code to be used for tail
duplication in both places.

This change, in concert with outlining optional branches, allows
triangle shaped code to perform much better, esepecially when the
taken/untaken branches are correlated, as it creates a second spine when
the tests are small enough.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
iteratee marked 2 inline comments as done.May 25 2016, 3:43 PM
iteratee updated this revision to Diff 58545.May 25 2016, 5:29 PM

Add a test and a fix for duplicating blocks that have un-analyzable fallthrough.
Note that the test doesn't currently fail, but D20505 will fail without the fix.

Are there any other concerns with this patch other than performance and size numbers from Quentin?

This comment was removed by iteratee.

Hi Kyle,

The performance results pretty look good so far! (Up to 4% improvement on some SPEC!)
Go ahead with the commit, we will adjust if things turn out to be bad for other workload, but I do not think it will happen.

The only actionable regression on your end is SingleSource/Benchmarks/Misc/ReedSolomon, which regressed by ~4% on x86_64 -O3. Other than that, this is a net win or the regression are within the noise!

Cheers,
-Quentin

iteratee updated this revision to Diff 62047.Jun 27 2016, 5:32 PM

Modified to work with tail-merging during layout.
Had to make tail-merging during layout less agressive so that it wouldn't undo tail-duplication.

iteratee updated this revision to Diff 62154.Jun 28 2016, 4:40 PM

clean up some spurious debugging.

Couple of inline comments. I think some of the overall design can be separated out and be made a bit simpler - especially as relates to ForcedLayoutPred etc, but we can worry about that as a cleanup that'll happen next.

Thanks!

-eric

lib/CodeGen/BranchFolding.cpp
652

Can you elaborate here? Perhaps with a FIXME or a TODO.

lib/CodeGen/MachineBlockPlacement.cpp
1940

The MachineFunction has an MMI so you don't need this argument. :)

davidxl added inline comments.
test/CodeGen/PowerPC/tail-dup-analyzable-fallthrough.ll
5 ↗(On Diff #62154)

Please simplify this test case and document what is the expected layout with tailDup. The test as it stands now is hard to read.

test/CodeGen/PowerPC/tail-dup-layout.ll
3

Same here : simplify the test case and document the expected output with description.

iteratee updated this revision to Diff 62517.Jul 1 2016, 11:31 AM

Added comments to tests to try and make them more readable

iteratee updated this revision to Diff 62529.Jul 1 2016, 1:18 PM

Added comment explaining why tail-merging can't use the unconditional branch heuristic when it is run during layout.

iteratee added inline comments.Jul 1 2016, 1:19 PM
lib/CodeGen/BranchFolding.cpp
652

I don't think that there is a way to fix it. The heuristic overcounts fractional tails where one block would end up with a jump, but another would not. It's probably fine to use it early, but during layout we have to be stricter about the thresholds.

test/CodeGen/PowerPC/tail-dup-analyzable-fallthrough.ll
5 ↗(On Diff #62154)

This test was reduced from an actual regression that I introduced and then fixed.

I've added comments to explain what went wrong.

test/CodeGen/PowerPC/tail-dup-layout.ll
3

I don't see that this test is that complex at all, except for the optional bodies. Unfortunately, they have to a certain size or they won't get outlined. I've added an easier to read expectation of layout, and a description of what the CHECK lines are looking for.

iteratee updated this revision to Diff 62531.Jul 1 2016, 1:37 PM

Fixed the last 2 tests.
One was minor,
For the WebAssembly test, I just disabled tail-duplication during placement, as previously instructed.

davidxl added inline comments.Jul 1 2016, 2:15 PM
test/CodeGen/PowerPC/tail-dup-analyzable-fallthrough.ll
6 ↗(On Diff #62531)

Kyle, thanks.

I think it is better get a minimal reproducible and use that as test case. for instance, I am not sure if the struct HashBucket etc and some other things are relevant here -- they are all distractions. I have pointed to an example of the minimalist test case.

test/CodeGen/PowerPC/tail-dup-layout.ll
4

The test case should be as simple as possible and remove all irrelevant pieces.

Regarding the size limit, you can use "outline-optional-threshold", option to specify a lower limit so that you can reduce the size of the optional body.

53

An example of a simple clean test case:

CodeGen/X86/code_placement_outline_optional_branches.ll

iteratee updated this revision to Diff 62551.Jul 1 2016, 3:24 PM

Simplify test cases.

iteratee marked 2 inline comments as done.Jul 1 2016, 3:26 PM

Please take a look at the tests again.

test/CodeGen/PowerPC/tail-dup-analyzable-fallthrough.ll
7 ↗(On Diff #62551)

I simplified the code quite a bit. The essence of what was happening was a linked-list walk was getting miscompiled, so I shrunk it to that.

test/CodeGen/PowerPC/tail-dup-layout.ll
5

OK, I made it look more like the other test that you pointed too, with repeated calls for bodies.

Very nice. I like the patch a lot! I will go though it one more time shortly.

thanks,

David

davidxl added inline comments.Jul 2 2016, 10:54 PM
lib/CodeGen/BranchFolding.cpp
606

Since you are here, please document arguments to this method: MBB1, MBB2, PredBB, SuccBB, AfterPlacement.

636

Expand the comment a little more: TailMerging invoked after block placement. Do not undo ...

664

Is there a better way to identify Pred BBs which are tailduplicated into during block placement? Looks to me we should avoid undoing any taildups done in layout stage.

lib/CodeGen/TailDuplicator.cpp
516

Document new parameters. While you are here, document IsSimple as well.

789

Same here -- document parameter.

876

Are there other places in taildup where blocks may get removed? It seems to me that tailDupllicator needs to have member to remember it is in 'layout' state, and the block deletor code should check that flag instead.

iteratee updated this revision to Diff 62995.Jul 6 2016, 4:42 PM
iteratee marked 5 inline comments as done.

Added comments, simplified flags to tail-duplicator

lib/CodeGen/BranchFolding.cpp
664

While we could pass a set of blocks around, the thresholds don't overlap by default. e.g. We tail-duplicate blocks of 2 instructions or fewer, and tail-merge blocks of 3 instructions or more. The problem is with fractional overlap. e.g. a block of 2 instructions was duplicated and in one of the resulting block an additional branch had to be added.

Would you prefer that I create a set of blocks and pass them into the branch optimizer? It would make it faster to skip those blocks altogether.

lib/CodeGen/TailDuplicator.cpp
516

New parameter removed.

876

No, nowhere else that blocks may get removed, and ultimately I want to not have different modes.

I decided that for duplication during layout, it doesn't make sense to try to delete the block. I've updated the code accordingly, and will fix D20505 to match.

iteratee marked an inline comment as done.Jul 6 2016, 4:44 PM
davidxl added inline comments.Jul 8 2016, 12:02 PM
lib/CodeGen/BranchFolding.cpp
664

Depending on the threshold difference can be fragile.

How about changing AfterPlacement into a callback/lambda function using function_ref<..>. It tells tailMerge if a block should be skipped.

iteratee updated this revision to Diff 63585.Jul 11 2016, 2:55 PM

Took a more principled stand on when to tail merge and when not to, so that tail-duplicating and tail merging now don't overlap.

davidxl added inline comments.Jul 11 2016, 5:35 PM
lib/CodeGen/BranchFolding.cpp
612

I think it is cleaner to check at the beginning if anyone of MBB1 or MBB2 need to be skipped (either be the taildupped block or a block tail-dupped into), and return false if so. The set of skipped blocks can be passed into the BF by block placement.

633

Is this change related?

lib/CodeGen/TailDuplicator.cpp
526

Expand the comments a little more.

877

nit: block placement does not expect any blocks to be deleted.

iteratee marked an inline comment as done.Jul 11 2016, 6:25 PM
iteratee added inline comments.
lib/CodeGen/BranchFolding.cpp
636

I changed this heuristic to avoid undoing valuable duplication. In fact, some of the cases where it doesn't merge are cases where we wanted to tail duplicate, but couldn't. See Hexagon/rdf-copy.ll

I'll split the changes here out into a separate patch.

664

Discussed this with chandlerc. He thinks that passing around the list of blocks that were tail-duplicated too tightly couples the passes, and that we should strive to have non-overlap. He suggested that a list of blocks that were duplicated might be a good debug measure to make sure they aren't being re-merged.

Overall I agree with him, and I made a change so there shouldn't be any overlap.
Do you want to see the set as a debugging measure?

I'll handle the 2 comment changes you asked for before I upload a new patch, but please take a look at the comments I added. I wrote them before I saw your comments, because I was thinking through some cases.

lib/CodeGen/BranchFolding.cpp
612

See below for the discussion about this, but I don't think passing in the set of blocks is the right choice.

633

Yes, but I plan to split it out into a separate patch.

davidxl added inline comments.Jul 11 2016, 6:43 PM
lib/CodeGen/BranchFolding.cpp
664

I don't agree. Without making it explicit, the the pass coupling does not disappear -- it is still there but in a more implicit but hard to maintain way.

For instance, can we guarantee this still work when TailMerger and TailDuplicator uses non-default size thresholds?

chandlerc added inline comments.Jul 11 2016, 6:55 PM
lib/CodeGen/BranchFolding.cpp
664

My thought was specifically that we should enforce an invariant that these are defined in a non-overlapping way, *and* check it in asserts builds precisely so that things like non-default size thresholds work in a sane way.

As long as there is the potential for overlap, we have the problem (IMO) that the innards of tail dup or tail merge can do things that user programs cannot. I would much prefer that we have a set of rules that are sufficiently precise and enforced that one pass doesn't have to have a back-channel to communicate to the other pass.

But I do agree that it is important that this is actually a guaranteed property and one we check that we don't violate.

iteratee updated this revision to Diff 67432.Aug 9 2016, 4:12 PM

Use single explicit threshold between tail-dup and tail-merge during layout.
Tail-duplicate indirect branches with a larger threshold during layout (Mirrors pre reg-alloc behavior).

iteratee updated this revision to Diff 67440.Aug 9 2016, 5:44 PM

Fix WebAssembly cfg stackify test.
Broke because we don't tail-duplicate loop latches early any more.

iteratee updated this revision to Diff 67607.Aug 10 2016, 2:34 PM

Fixed additional tests.
Early tail-duplication now respects loops.
Explicit threshold for merging/duplication during layout.

iteratee updated this revision to Diff 67644.Aug 10 2016, 6:05 PM

Factored out explicit threshold code.

iteratee added inline comments.Aug 11 2016, 2:25 PM
lib/CodeGen/BranchFolding.cpp
667

This should all be resolved now. See D23383 and D23390 for the explicit threshold. The changes to make merging not overlap during layout have already gone in.

lib/CodeGen/TailDuplicator.cpp
877–879

This is resolved by allowing the deletion to occur.

iteratee updated this revision to Diff 68634.Aug 18 2016, 4:38 PM

Updated tests. No other changes.

Quentin, can you re-run the benchmark suite? This patch has changed a fair bit since you last ran them.
If you could do D20505 as well, which depends on this, that would be awesome.

Thanks,
Kyle.

davidxl added inline comments.Aug 19 2016, 5:00 PM
include/llvm/CodeGen/TailDuplicator.h
56–57

Document the new parameters and behavior (zero default etc)

lib/CodeGen/BranchFolding.cpp
694

CurMPIter = HighestMPIter;

Maybe a separate cleanup patch

702

MBB1 can be computed in common path

704

On what platforms? Is there a better way to query it?

708

where does 21 magic number come from? Why not just define a special value to signal merging is disabled/skipped for such blocks?

lib/CodeGen/MachineBlockPlacement.cpp
139

This name is very confusing. Just name it as TailDupThresholdInLayout. Perhaps document that this is also used to 'force' the tail merge at the high threshold to avoid it undoing tailDup decisions.

1009

This makes the primary loop body too large. Outline it into its own method.

lib/CodeGen/TailDuplicator.cpp
543

Irrelevant change?

555

Irrelevant change?

558

Is this change needed (handled by previous taildup invocation)?

iteratee updated this revision to Diff 68942.Aug 22 2016, 5:33 PM
iteratee marked 3 inline comments as done.

Fixes from review.

I have some questions, but most of the suggestions are done.

lib/CodeGen/BranchFolding.cpp
704

I believe ARM marks the actual return as an indirect branch.
I don't see a better way. Indirect tail-calls will likely get caught here as well.
Since what we're looking for is jump tables within the function, checking the successor size seems reasonable.
Would you like me to re-write the comment so that it seems more intentional?

708

It's one more than 20, from tail-duplication of the same pattern. I'm open to other suggestions as to how to make these two cooperate.

lib/CodeGen/TailDuplicator.cpp
543

No. With tail-duplication occurring only once, we had to duplicate those blocks when we got the chance. Now with layout-duplication we can wait and duplicate them later.

555

No. Because this check is only necessary if you want to expand these blocks later in layout mode.
This change is required by the previous one above, because otherwise it breaks a test.

558

Yes, there are tests where the loop-latch is a switch. I presume it occurs in the real world as well.

iteratee added inline comments.Aug 23 2016, 12:10 PM
lib/CodeGen/TailDuplicator.cpp
543

This was related to a performance regression I found. I can probably pull this out into a separate patch that gets committed at the same time. Would that be better?

davidxl added inline comments.Aug 23 2016, 1:22 PM
lib/CodeGen/TailDuplicator.cpp
543

Yes this should be in a different patch with different review (and test cases etc).

iteratee updated this revision to Diff 69036.Aug 23 2016, 2:05 PM

Changes missed from previous upload

davidxl added inline comments.Aug 23 2016, 2:10 PM
lib/CodeGen/BranchFolding.cpp
704

How abut Block1.back().isIndirectBranch() && !Block1.back().isReturn() ?

708

I think this value should be passed in either via constructor or a separate API:

setMergeThresholdForBBwithIndirectBr(...).

In tailDup, the value 20 should also be controlled by an option.

lib/CodeGen/MachineBlockPlacement.cpp
1015

Why can't this part be folded into maybeTailDuplicateBlock? If not, the big chunk of code between if (Removed) { ..} belongs to its own method.

lib/CodeGen/TailDuplicator.cpp
555

I don't get it. This one looks like a heuristic change to not give block with return instruction (as indirect branch) a larger threshold for duplication, which is independent of this patch.

iteratee marked an inline comment as done.Aug 23 2016, 2:52 PM
iteratee added inline comments.
lib/CodeGen/TailDuplicator.cpp
555

I'll split these out, but they're all intertwined.

  1. Don't duplicate latches/pre-headers early.
  2. There are now tests that fail, so duplicate switches during layout.
  3. Because the indirect branch test is now post Reg-Alloc, it has to handle the actual return instructions.
davidxl added inline comments.Aug 23 2016, 3:29 PM
lib/CodeGen/TailDuplicator.cpp
555

thx for the expalnation. Getting Part-1 right is the key here.

iteratee updated this revision to Diff 69058.Aug 23 2016, 5:28 PM

Removed changes as requested. They'll show up in another patch shortly.

davidxl added inline comments.Aug 24 2016, 10:28 AM
lib/CodeGen/MachineBlockPlacement.cpp
1016

BB = *std::prev(Chain.end())

1844

early return if not removed to reduce code nesting level

1845

remove 'into'

1850

std::prev

1854

This does not look right the place to update. The successors of DupBB may also get more unscheduled predecessors. The update should be done in one place where the total number of Preds (BB is dupped into) is known.

1859

Can you give more explanation on the iterative tailDup?

1895

Should an early return be done here?
if (!IsSimple && BB->succ_size() == 1) return false;
if (!TailDup.shouldTailDuplicate(...) return false;

1900

should also skip if Pred is in the same chain as BB (e.g, when BB is a loop head of another loop, and Pred is that loop's latch)

1903

Need to count the number of other Pred that BB can tail duplicate into.

If that number is zero, should return false.

1915

To early to do the update. Wait until the tailDup actually happens. Actually, do we really need to update? When tailDup happens, Pred will become NewSucc's new predecessor, and Pred is still not scheduled.

1920

Early return false when '!CanDupToChain'

1944

It is cleaner to use RemoveList.erase(remove_if(...)..) pattern

Also bypass if RemBB's unscheduled pred number is not zero.

iteratee updated this revision to Diff 69153.Aug 24 2016, 12:23 PM
iteratee marked 8 inline comments as done.

Simplifications from review.

Applied some suggestions, answered others.

lib/CodeGen/MachineBlockPlacement.cpp
1850

Can't. I need to actually update the iterator so that I can check to see if there is an additional block in the chain.

1854

You're correct, It only needs to be done once because only one block is actually going from unscheduled to scheduled: BB.
That makes the function simpler overall and easier to understand.

1900

No, even if they're in the same chain, blocks that end up with both as predecessors still have an additional predecessor.

I think you're thinking of the check that I do below of NewChain vs PredChain.

1903

No. because although we don't have predecessors to worry about (Loop filter), there may be unfiltered predecessors that we can duplicate into, and now is the only time to do so.

1915

We can't do it any later. BB may disappear after tail duplication.

I could make tail duplication return a vector of blocks that were duplicated into, and save the successor list before duplication.
That may be easier to follow, but wouldn't be any more correct.

Yes, we need to update. For each duplication the successors now have an additional unscheduled predecessor

1920

That would be incorrect. There are plenty of cases where we can't duplicate into the layout predecessor where we still want to perform tail duplication.

iteratee updated this revision to Diff 69176.Aug 24 2016, 3:14 PM

Remove unused variable.

davidxl added inline comments.Aug 24 2016, 3:55 PM
lib/CodeGen/MachineBlockPlacement.cpp
1903

It is not correct to do. BB and Pred are in the same nested loop that already been laid out. You can not duplicate BB into Pred here.

LPred

|
|

BB <-----

|               \
..                 \
Pred              |
 \__________|
1915

I think a list of Preds that BB can be duplicated into need to be passed in.

The reason doing update here is not bullet proof is that if later if there are other heuristic to decide not to tailDup, then the early update will be wrong. It needs to be tied to the actual transformation

1920

There are two problems here

  1. many Pred (other than LPred) has been skipped as illegal candidates, but tailDupiicator does not know about will happily do tailDup -- as the tailDuplicator does not know about the new constraints from MBP
  2. if the total number of candidate predecessors that BB can tailDuped into is fewer than 2, there is no point do the tail duplication
iteratee added inline comments.Aug 24 2016, 4:49 PM
lib/CodeGen/MachineBlockPlacement.cpp
1903

I don't see why not. There's not a correctness issue, and if BB was the loop top of the inner loop, it was never considered to be duplicated. Now is the chance that we have to do it.

1915

The goal of this loop is not to find constraints.

I agree it would be cleaner to get the list out of tail-duplication. I'll do that.

1920

I think you misunderstand the point of this loop. The point is not to find "legal candidates", but to find the candidates for which duplicating will require update.

Because of that, we don't have an accurate count, and don't want one. We'll gladly let tailDuplicator do its job.

davidxl added inline comments.Aug 24 2016, 5:15 PM
lib/CodeGen/MachineBlockPlacement.cpp
1859

This one needs a test case to cover.

1920

Ok.

Do you see the possibility that some Pred simply can not be selected as dup target bb (for legality or performance reasons)? Overall, the code should be organizied in the way that the list of BBs that MBP thinks it can/should be tailDup should match the actual transformations.

iteratee added inline comments.Aug 24 2016, 5:21 PM
lib/CodeGen/MachineBlockPlacement.cpp
1920

I can see that we might want to limit the set of predecessors duplicate to. A good case would be to filter out predecessors that are too cold.

The performance results have been reasonable so far without that change, and It would certainly make this more complicated.
For now, I think it would be best to simply go with the cleaner version of getting a list of blocks back from tailDuplicateAndUpdate.

This will make it easy to add filters later, and to add them inside of tailDuplicator as well without worrying about breaking this loop.

davidxl added inline comments.Aug 24 2016, 5:26 PM
lib/CodeGen/MachineBlockPlacement.cpp
1920

I am fine with unfiltered solution for now -- but the code structure should make future tuning easy to do :)

iteratee updated this revision to Diff 69262.Aug 25 2016, 10:08 AM

Update unscheduled predecessors after tail duplication.

iteratee updated this revision to Diff 69293.Aug 25 2016, 2:35 PM
iteratee marked an inline comment as done.

Changes missed in previous patch. Get list of duplicated predecessors from TailDuplicator and use that list to update unscheduled predecessors.

iteratee updated this revision to Diff 69434.Aug 26 2016, 2:02 PM

Remove extra debugging.

iteratee marked 2 inline comments as done.Aug 30 2016, 3:03 PM
iteratee added inline comments.
lib/CodeGen/BranchFolding.cpp
704

I checked. There are platforms where the "return" instruction does not return true for isReturn after epilog lowering.
Mips, Hexagon, Thumb.

I think I'll re-write the comment so that it sounds less like a workaround.

iteratee updated this revision to Diff 69762.Aug 30 2016, 3:04 PM

Add test case for repeated tail-duplication.

iteratee marked an inline comment as done.Aug 30 2016, 5:14 PM
davidxl added inline comments.Sep 12 2016, 1:48 PM
test/CodeGen/X86/tail-dup-repeat.ll
7 ↗(On Diff #69762)

Can this function's declaration be simplified? If the parameters are not necessary, remove them

11 ↗(On Diff #69762)

Clean up the parameters (shorten names etc)). Remove unnecessary attribute

15 ↗(On Diff #69762)

Please also clean up variable names and label names.

For label names, make them more meaningful. For instance, if a block is duplicated, the label for the block can be 'dup:'. For iterative tail dup, also name them in sequence numbers dup1, dup2 etc. Others can be cleaned up suchas header, loopexit etc.

Some simple comments in the test case (where tail dup happens) is also helpful.

iteratee updated this revision to Diff 71247.Sep 13 2016, 3:21 PM
iteratee marked 3 inline comments as done.

Found new, shorter test case for repeated tail duplication.

davidxl added inline comments.Sep 13 2016, 3:51 PM
lib/CodeGen/MachineBlockPlacement.cpp
1883

Is this comment correct? The TailBB may be dup'ed in to Lpred, but still not removed as there are some other predecessors it can not be dup'ed into ?

1972

Should DuplicatedToLPred == CanDupToChain?

iteratee updated this revision to Diff 71274.Sep 13 2016, 5:04 PM

Simplified computation of whether a block was duplicated to the layout pred.

lib/CodeGen/MachineBlockPlacement.cpp
1883

It is currently correct.
The tail duplication code only copies the block into the layout predecessor if it can be copied into all other predecessors as well.

1972

Even better, since we now return the list of predecessors that received a copy of BB, we should check to see if LPred is in that list. Basically free since we loop over the list anyway.

davidxl added inline comments.Sep 13 2016, 6:56 PM
lib/CodeGen/MachineBlockPlacement.cpp
1016

The update of BB should be wrapped into 'repeatedlyTailDuplicateBlock'.

1847

Missing coma. ".. it was removed, markChainSuccessors .."

1857

Add a comment here that Chain End is updated when DupBB is removed by the removal call back.

There also seem to be a bug here: if TailDup happens, but DupBB is not removed, the loop will check the same BB again. Should it just break the loop when removed is not true?

1884

It is cleaner to make this interface return bool (either Removed or DuplicatedToLPred)

iteratee updated this revision to Diff 73330.Oct 3 2016, 1:20 PM
iteratee marked 2 inline comments as done.

Added comments.

Made changes from review:
Chain end is now updated in repeatedlyTailDuplicateBlock
MaybeTailDuplicate now returns Removed instead of using out parameter.

Please take a look. I cleaned up the issues you pointed out and added comments.

lib/CodeGen/MachineBlockPlacement.cpp
1847

I split the sentence into 2 and then added a comma. It's easier to read now.

1857

I added some comments.

Currently, duplication to layout predecessor only occurs if the block is removed. I added a comment about that as well.

davidxl accepted this revision.Oct 3 2016, 3:35 PM
davidxl edited edge metadata.

lgtm.

This is large change so it bounds to run into some corner case issues. Please test thoroughly and watch for any fallouts.

lib/CodeGen/MachineBlockPlacement.cpp
1873

Add an assert that Removed is true when DuplicatedToLPred is true

This revision was automatically updated to reflect the committed changes.