This is an archive of the discontinued LLVM Phabricator instance.

[ExecutionDepsFix] Improve clearance calculation for loops
ClosedPublic

Authored by loladiro on Jan 15 2017, 9:32 PM.

Details

Summary

In revision rL278321, ExecutionDepsFix learned how to pick a better
register for undef register reads, e.g. for instructions such as
vcvtsi2sdq. While this revision improved performance on a good number
of our benchmarks, it unfortunately also caused significant regressions
(up to 3x) on others. This regression turned out to be caused by loops
such as:

PH -> A -> B (xmm<Undef> -> xmm<Def>) -> C -> D -> EXIT
      ^                                  |
      +----------------------------------+

In the previous version of the clearance calculation, we would visit
the blocks in order, remembering for each whether there were any
incoming backedges from blocks that we hadn't processed yet and if
so queuing up the block to be re-processed. However, for loop structures
such as the above, this is clearly insufficient, since the block B
does not have any unknown backedges, so we do not see the false
dependency from the previous interation's Def of xmm registers in B.

To fix this, we need to consider all blocks that are part of the loop
and reprocess them one the correct clearance values are known. As
an optimization, we also want to avoid reprocessing any later blocks
that are not part of the loop.

In summary, the iteration order is as follows:
Before: PH A B C D A'
Corrected (Naive): PH A B C D A' B' C' D'
Corrected (w/ optimization): PH A B C A' B' C' D

To facilitate this optimization we introduce two new counters for each
basic block. The first counts how many of it's predecssors have
completed primary processing. The second counts how many of its
predecessors have completed all processing (we will call such a block
*done*. Now, the criteria to reprocess a block is as follows:

  • All Predecessors have completed primary processing
  • For x the number of predecessors that have completed primary processing *at the time of primary processing of this block*, the number of predecessors that are done has reached x.

The intuition behind this criterion is as follows:
We need to perform primary processing on all predecessors in order to
find out any direct defs in those predecessors. When predecessors are
done, we also know that we have information about indirect defs (e.g.
in block B though that were inherited through B->C->A->B). However,
we can't wait for all predecessors to be done, since that would
cause cyclic dependencies. However, it is guaranteed that all those
predecessors that are prior to us in reverse postorder will be done
before us. Since we iterate of the basic blocks in reverse postorder,
the number x above, is precisely the count of the number of predecessors
prior to us in reverse postorder.

Event Timeline

loladiro updated this revision to Diff 84518.Jan 15 2017, 9:32 PM
loladiro retitled this revision from to [ExecutionDepsFix] Improve clearance calculation for loops.
loladiro updated this object.
loladiro added reviewers: MatzeB, myatsina, mkuper, atrick.
loladiro added a subscriber: llvm-commits.
loladiro updated this object.Jan 15 2017, 9:33 PM
sanjoy added a subscriber: sanjoy.Jan 15 2017, 9:49 PM
myatsina added inline comments.Jan 16 2017, 8:01 AM
lib/CodeGen/ExecutionDepsFix.cpp
146

Not --> Note ?

416

Can you add an assertion message please?

418

Is OutRegs null when it's a back edge from a BB we haven't seen yet? Are there other cases where it can be null?
Please add a comment explaining the cases.

456–457

Some of the comments in this function need to be updated (we no longer have LiveOuts, we always change the defs to be ralive to the end of the block etc)

575

For readability purpose - how about changing "BlockDone" to "breakDependency" and add your comment regarding done blocks before the call to processDefs?

processDefs() will look like this:
if (breakDependency) {
// calc Pref
...
}

and processBasicBlock will look like this:

// If this block is not done, it makes little sense ...
bool breakDependency = isBlockDone(Done)
processDefs(MI, breakDependency, ...)

793

Do you need "Done" here? I don't see you using it.

803

Is there a point going over it when we're not isBlockDone?
processDefs pushes instructions into the undef read only when Done =true.
processUndefReads does break if the undef reads are empty, but perhaps for readability purpose it's worth writing this explicitly.

820

Why not check isBasicBlockDone on MBB?

823

At first glance, it wasn't clear to me why you need it here and why can't you just do one "primary" pass and then then process the basic blocks that are still not done.
If I understood correctly, you're doing it here for the optimization you've talked about (Making sure the order is: PH A B C A' B' C' D). Am I right?
I would add a comment here elaborating that.
I would even consider adding as a comment somewhere with the loop example from the description of this patch and the "optimized" order the algorithm visits the nodes. I think it is a great example and it will make the traverse order here much clearer.

876

Better use a constructor with default initialization like DomainValue does.

914

I would add a comment that IncomingProcessed and IncomingCompleted of this block were already updated during the processing of predecessor blocks.

loladiro updated this revision to Diff 85009.Jan 19 2017, 12:32 PM

Address review comments

loladiro updated this revision to Diff 85012.Jan 19 2017, 12:34 PM
loladiro marked 11 inline comments as done.

Fix small typo in assertion message

lib/CodeGen/ExecutionDepsFix.cpp
418

Yes, that's correct. If it's null it's a back edge. Will add a comment to this extent.

820

Just to avoid doing the calculation twice. Probably premature optimization. Will simplify.

823

If I understood correctly, you're doing it here for the optimization you've talked about (Making sure the order is: PH A B C A' B' C' D). Am I right?

Yes, that's correct. It's discussed below when going over the blocks that are not done. I'll add a small comment here. I will add the loop example from the commit message below and add a small comment here pointing people to the discussion below.

loladiro updated this revision to Diff 85013.Jan 19 2017, 12:37 PM

Forgot to run clang-format

myatsina accepted this revision.Jan 24 2017, 6:25 AM

Added a few minor comments.

LGTM once they are addressed.

lib/CodeGen/ExecutionDepsFix.cpp
418

Don't forget to add the comment :)

test/CodeGen/X86/break-false-dep.ll
313

xmm7 --> xmm6 ?

This revision is now accepted and ready to land.Jan 24 2017, 6:25 AM
This revision was automatically updated to reflect the committed changes.
mehdi_amini added inline comments.
llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837 ↗(On Diff #86362)

What is the limit on the depth of the stack?
We're seeing a crash because of stack explosion here, so I fear it can grow with the CFG (which wouldn't seem reasonable to me). Can you comment on this?

mehdi_amini added inline comments.Mar 7 2017, 10:49 PM
llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837 ↗(On Diff #86362)

Note: I haven't spent time figuring out what ExeDepsFix is doing, don't assume I have any context. The crash we're tracking is a ThinLTO bootstrap failure, we're still working on the exact reproducer.

loladiro added inline comments.Mar 7 2017, 10:56 PM
llvm/trunk/lib/CodeGen/ExecutionDepsFix.cpp
837 ↗(On Diff #86362)

Yes, I suppose it can grow with the number of nested loops. Must be quite a reproducer to cause this problem though. Unless of course there's something more fundamental wrong with the logic here (though for that I'd need the reproducer). In any case, it would be fine to change this to keep a working set in a SmallVector or something equivalent.

MatzeB edited edge metadata.Mar 17 2017, 10:31 PM

Yes, I suppose it can grow with the number of nested loops. Must be quite a reproducer to cause this problem though. Unless of course there's something more fundamental wrong with the logic here (though for that I'd need the reproducer). In any case, it would be fine to change this to keep a working set in a SmallVector or something equivalent.

This is a recursion over the control flow graph and you only need a single loop. Recursing over the structure of the program is a no-go in a compiler as your stack is always limited and comparatively small and the input can grow arbitrarily.

We are seeing this in a real stage2 build of clang and we have to fix the buildbot! If you need to have a reproducer update llvm to r298184, limit your stack to 512kb (ulimit -s 512) as that is what is currently used by ThinLTO and use this python snippet to make a reproducer:

n_blocks=4000

print '''
---
name: func
tracksRegLiveness: true
body: |
  bb.0:
    successors: %bb.1, %bb.{n_blocks}
    liveins: %xmm0
    NOOP implicit %xmm0
    JE_1 %bb.{n_blocks}, implicit undef %eflags
    JMP_1 %bb.1
'''.format(**locals())

for i in range(1, n_blocks):
    print '  bb.%s:' % (i)
    if i < n_blocks-1:
        print '    successors: %%bb.%s, %%bb.%s' % (i+1, n_blocks)
        print '    JE_1 %%bb.%s, implicit undef %%eflags' % (i+1)
    else:
        print '    successors: %%bb.%s' % (n_blocks)
    print '    JMP_1 %%bb.%s' % n_blocks

print '''
  bb.{n_blocks}:
    RETQ undef %eax
'''.format(**locals())

We are seeing this in a real stage2 build of clang and we have to fix the buildbot!

FYI this is the failing job right now : http://green.lab.llvm.org/green/view/Clang/job/clang-stage2-Rthinlto/

Can you try https://reviews.llvm.org/D31681? I wasn't able to reproduce the problem with your example, it kept crashing other parts of the compiler ;).

Can you try https://reviews.llvm.org/D31681? I wasn't able to reproduce the problem with your example, it kept crashing other parts of the compiler ;).

I think after we saw the stackoverflow in the pass on the build I did all my further testing with "llc -run-pass=x86-execution-deps-fix" only so I didn't see the other passes. Anyway the build seems to be back to green now. Thanks!