Add a fence elimination pass
Needs ReviewPublic

Authored by morisset on Oct 13 2014, 12:43 PM.

Details

Summary

Following the plan described on LLVM-dev (http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-September/076732.html), this patch provides a first version
of a new fence-elimination pass. It has several known restrictions/weaknesses,
but I would like some feedback early, hence this review.

The main problem is that there is no easy way to only run the pass on a
function if AtomicExpand introduced fences in that function. This is made
worse by the requirement on BlockFrequencyInfo and BreakCriticalEdges of
this pass:

  • There is a non-negligible cost incurred even on code with no atomics/fences
  • Because BreakCriticalEdges changes control-flow, it breaks a test (and running SimplifyCFG afterwards which would be nice would break even more tests).

I can see several ways to fix this:

  • Merge the pass in AtomicExpand. Beyond the ugliness, this would require a way of running BreakCriticalEdges and BlockFrequencyInfo from inside AtomicExpand which does not seem supported by the current PassManager infrastructure.
  • Break critical edges lazily inside this pass. This would still incurs costs for BlockFrequencyInfo; worse it would require updating the block frequency as well, and would add some significant complexity to the pass.
  • Have a flag for running this pass and set it to false for the tests this break. This is the easiest but would do nothing for the performance costs.

What would you suggest ?

Some other parts remain to be done, but can be added later, and I hope to first
fix the above issue. For example:

  • the pass is only enabled for Power, and not yet for ARM
  • the min-cut implementation is rather basic and not super optimized
  • the pass does not make use of information passed by AtomicExpand to optimize fences more agressively (for example it cannot sink the fence out of a spinloop yet), details on how I plan to do that later are in the proposal I sent to LLVM-dev some time ago.

Some more questions for reviewers:

  • should the min-cut implementation be cut in a separate file ?
  • I left lots of DEBUG() statements, should they be removed ?

Thanks for taking the time of reading this !

Diff Detail

morisset retitled this revision from to Add a fence elimination pass.Oct 13 2014, 12:43 PM
morisset updated this object.
morisset edited the test plan for this revision. (Show Details)
morisset added reviewers: jfb, echristo, rengolin.
morisset added a subscriber: Unknown Object (MLST).
jfb added a comment.Oct 16 2014, 10:26 AM

In the description, could you add a link to the LLVM-dev discussion for this?

What's missing to enable ARM, x86, and other architectures? Could you also use this pass on non-platform-specific fences (C++11 style fences instead)?

It would be nice to have more complex examples with bigger CFG, and with memory accesses (not just atomic ones).

As we discussed yesterday: do you think that it would be easy to generate dot graphs too, to make is easier to see what's going on?

Do you have a feel for the runtime of this pass on large codebases? I understand that there are overheads due to the required passes, but I want to make sure that you try profiling just this pass, to make sure there aren't glaring inefficiencies added.

include/llvm/Transforms/Scalar.h
44

I'm not a fan of default arguments, I think you shouldn't have them.

Can you add a description of what stronger/weaker mean, as well as FenceArgs?

std::function isn't in common use in the code base, can you see what similar functions use instead?

Overall you're trying to create a partial ordering for target-specific intrinsics for fences, and then manually adding this pass many times to a backend. Could you instead have the backend specify the ordering, and add the pass only once? The pass would handle each fence type that it's made aware of, in the proper strong->weak ordering.

lib/Target/PowerPC/PPCTargetMachine.cpp
20

Keep these sorted.

lib/Transforms/Scalar/FencesPRE.cpp
53

"before FencesPRE" is less odd. Pre-PRE sounds weird :)

59

clang-format the entire file.

95

You don't need to typedef struct A {} A; in C++, just struct A {}; is enough.

113

You don't need struct here (C versus C++).

116

You can use a delegating constructor to FlowGraphNode(Instruction * Inst), though it doesn't look like this constructor is used, so you could mark it as = delete instead.

171

Does this ever happen? It seems like it's an error and should be an assert.

180

Why 1 in all cases? That seems pretty low, especially since this is all stack allocated it's pretty cheap to start with 4 or 8, though real-world numbers are better.

210

What is this loop doing? Putting it in its own function could make things easier to understand.

428

I'd avoid the macros here and in other places. You have a typedef for the types, so it would be better to use std::numeric_limits.

Thanks for the review !

Enabling it for ARM/x86 is really easy: it is just a cut-&-paste of the code added by this patch for the Power backend, adapting the arguments to the pass constructor to give it the right fences. I did not do it in this patch to avoid having to change it all if the interface to the pass changes during review (which is looking likely).

I will try to add a bigger example, it is just time consuming and I wanted to get feedback as soon as possible.

For the interaction with dot, it would probably be conceptually easy but require a fair bit of plumbing, I will look at how the -view-isel-dags and friends options work.

The main difficulty with profiling this patch is that the pass will basically do nothing on code without atomics (well, break critical edges followed by block frequency info, followed by going through every instruction and testing them for seeing if they are fences). So I would need a codebase that is at the same time large, full of atomics, and with an easy to hack build system so that I could cross-compile it for power.... So I can test a few small testcases, but I don't see how to do large-scale profiling. If the codebase has no atomic, the cost of this pass is completely dominated by its requirements.

Answers to the inline comments below, I will update this patch to apply your suggestions soon.

include/llvm/Transforms/Scalar.h
44

OK for removing the default argument.

For the ordering, I will try to see if having the backend pass an ArrayRef<> of the different fences, sorted by strength works. The main weakness is that it would make it a total ordering which is fine for most architectures but may fail for some (only Alpha which is not supported by LLVM out of my head, but I don't know every architecture out there). How could the backend cleanly specify a partial ordering ?

lib/Target/PowerPC/PPCTargetMachine.cpp
20

OK

lib/Transforms/Scalar/FencesPRE.cpp
53

OK

59

OK

95

Thanks, I didn't know that.

113

Same

116

I will look into it, I think I had to add this constructor to silence a warning, it should indeed be unused.

171

OK.

180

My reasoning is that most function will have no fences, so 0 element for these, or just be a tiny helper with only one atomic access. But I can easily change it to 4 or 8 as it is so cheap.

210

OK, I will try to put it into its own function, I hoped the comments inside would be enough.

428

OK (sorry I forgot this, I remember you suggested it already).

morisset updated this object.Oct 16 2014, 11:21 AM

I missed your idea of running this pass for C++11 fences instead. In short: this is probably useless, as C++11 fences are expected to be rather rare (compared to atomic accesses that result in target-specific fences), and the semantics of C++11 fences are hairy enough I don't want to try without a clear benefit. So while the infrastructure could support it, I don't plan on doing it.

morisset updated this revision to Diff 15039.Oct 16 2014, 12:24 PM

Apply clang-format, fix small style issues pointed by jfb.
I haven't done the bigger modifications yet.

morisset added inline comments.Oct 17 2014, 2:30 PM
include/llvm/Transforms/Scalar.h
44

According to the LLVM programmer's manual, std::function is fine for closures if they are to be stored (like here). And they are used in a few places in the code.

An additional difficulty with using an ArrayRef, is that there are lots of information we may want to compare: the intrinsic ID, the operands to the intrinsic, and possibly metadata in the future. So it would be an ArrayRef of tuples of monstrous arity. Is it ok if I keep the current architecture at least for now ?

jfb added inline comments.Oct 17 2014, 2:58 PM
include/llvm/Transforms/Scalar.h
44

I'm convinced :)

morisset updated this revision to Diff 15195.Oct 21 2014, 10:59 AM

Add an option "-view-fences-pre-flow-graph" to visualize the graph being used,
per the request of jfb (it makes things much easier to debug).

Also some general cleanup.

Reviews most welcome !

morisset updated this revision to Diff 15211.Oct 21 2014, 2:41 PM

Initialize the FlowGraphNode::Next iterator more cleanly

Hi Hal,

Thanks for accepting at the LLVM conference to take a look at this patch !

dberlin added a subscriber: dberlin.Nov 3 2014, 6:49 PM

Wheee.
This brings me back to the world of speculative PRE.

There is a little known paper that makes this actually practical without doing min-cut work.
A person who was at IBM did some work on this.

The paper you want is:
http://link.springer.com/chapter/10.1007%2F11860990_22
A better version is in David J. Pereira's thesis
See https://dspace.library.uvic.ca/bitstream/handle/1828/292/Pereira_0336403_Thesis-Final_21Dec2007_.pdf?sequence=1&isAllowed=y

This should enable you to do what you want without min-cut, paying a small cost.

I know it's *actually* been implemented in production compilers.

jfb added a comment.Nov 5 2014, 8:07 AM

(Discussing this offline with Danny)

I'd like measurements for your approach. I'm not sure it'll be slow on big functions because your graph is a subset of all the IR in the function.

As we discussed, csmith should be able to generate big functions with lots of fences. Measuring runtime at different CFG size and fence+surrounding memory size would be useful, and basic profiling would show where overheads are in the current code.

I think that should inform whether another algorithm should be tried.

Hey, if you can get it to work fast with a min-cut formulation on programs
with lots of fences, that would be awesome :)
As I told JF, I haven't played with such formulations in about 7 years, so
it's entirely possible the world is a better place now.

They are certainly conceptually simpler to work with and reason about :)

Have you been able to get compile-time data for large functions yet? We should move this forward unless there are practical problems with the min-cut algorithm.

lib/Target/PowerPC/PPCTargetMachine.cpp
184

If doing this is generally a good thing, then we should always do it (and on what target would that not be true?).

Otherwise, the CFG simplification pass is mostly a wrapper around the SimplifyCFG utility function, and perhaps it should just be called directly from the FencesPRE pass?

Hi,

To dberlin: I looked at this article, but they explain well that the reason min-cut is so expensive for PRE is because it must be repeated for each computation in the function (of which there can be 10s of thousand in very large function) and must look at a potentially huge graph. In comparison we only run this twice: once for hwsync and one for lwsync. Furthermore, because the graph is stopped by any memory access (and not just use/kill of some very specific computation as in PRE), I expect each of these runs of min-cut to be quite cheap. I have not had the time to benchmark the compile-time cost of this pass (deadline tomorrow for PLDI..), but in summary I expect it to be small, even for large functions full of fences.

Thanks for the comments.

lib/Target/PowerPC/PPCTargetMachine.cpp
184

I agree it might be a good thing to run it anyway on all targets, but some tests (at least) on Power contain conditional jumps based on undef, and SimplifyCFG makes a complete mess of them.

The cleanup is mostly because of requiring BreakCriticalEdges (that I have not found how to do on demand while preserving BlockFrequencyInfo yet), so calling it directly from FencesPRE would not solve the issue.

jfb added a comment.Nov 12 2014, 7:02 AM

I agree it might be a good thing to run it anyway on all targets, but

some tests (at least) on Power contain conditional jumps based on undef,
and SimplifyCFG makes a complete mess of them.

Would it be worth fixing the tests and turning on SimplifyCFG in a first
pass? Or as you suggested having a disable flag for this (icky...).

In D5758#28, @jfb wrote:

I agree it might be a good thing to run it anyway on all targets, but

some tests (at least) on Power contain conditional jumps based on undef,
and SimplifyCFG makes a complete mess of them.

Would it be worth fixing the tests and turning on SimplifyCFG in a first
pass? Or as you suggested having a disable flag for this (icky...).

There might be no good solution here if we want to run SimplifyCFG generally -- these are bugpoint-generated tests, and thus have a lot of undefs, and will be sensitive to this kind of change. I think a flag is fine.

As noted below, however, a better solution is to have the new fences pass call the associated utility functions directly -- and as noted, this might require some amount of improvement to those utility functions to preserve BFI.

lib/Target/PowerPC/PPCTargetMachine.cpp
184

BreakCriticalEdges is a simple wrapper around the SplitCriticalEdge utility function. If we need to teach SplitCriticalEdge how to preserve BlockFrequencyInfo, then let's do that.

reames removed a reviewer: reames.Dec 29 2014, 6:20 PM
rengolin resigned from this revision.Mar 10 2015, 4:41 AM
rengolin removed a reviewer: rengolin.