This is an archive of the discontinued LLVM Phabricator instance.

[GVN] Rewrite IsValueFullyAvailableInBlock()
ClosedPublic

Authored by lebedev.ri on Jul 20 2020, 9:20 AM.

Details

Summary

While this doesn't appear to help with the perf issue being exposed by D84108,
the function as-is is very weird, convoluted, and what's worse, recursive.

There was no need for SpeculativelyAvaliableAndUsedForSpeculation,
tri-state choice is enough. We don't even ever check for that state.

The basic idea here is that we need to perform a depth-first traversal of the
predecessors of the basic block in question, either finding a preexisting
state for the block in a map, or inserting a "placeholder" SpeculativelyAvaliable,

If we encounter an Unavaliable block, then we need to give up search,
and back-propagate the Unavaliable state to the each successor of said block,
more specifically to the each SpeculativelyAvaliable we've just created.

However, if we have traversed entirety of the predecessors and have not
encountered an Unavaliable block, then it must mean the value is fully available.
We could update each inserted SpeculativelyAvaliable into a Avaliable,
but we don't need to, as assertion excersizes, because we can assume that if
we see an SpeculativelyAvaliable entry, it is actually Avaliable,
because during the time we've produced it, if we would have found that it
has an Unavaliable predecessor, we would have updated it's successors,
including this block, into Unavaliable

Diff Detail

Event Timeline

lebedev.ri created this revision.Jul 20 2020, 9:20 AM
xbolva00 added inline comments.
llvm/lib/Transforms/Scalar/GVN.cpp
669

Available

(Same typo below)

lebedev.ri marked 2 inline comments as done.

Fix typo Avaliable->Available everywhere in the patch.

llvm/lib/Transforms/Scalar/GVN.cpp
669

Ugh, thanks!

nikic added a comment.Jul 20 2020, 2:55 PM

While this doesn't appear to help with the perf issue being exposed by D84108, the function as-is is very weird, convoluted, and what's worse, recursive.

Would it be possible to share the pre-GVN IR for the problematic case? (Independently of this cleanup, just curious)

But i'm not sure instructions is *really* the right metric here, Because task-clock stat really improved, by -1.5% .. -5% all across the board.

This is just (very large) noise. I've added a disclaimer to the page that the task-clock/wall-time results are best ignored.

lattner resigned from this revision.Jul 20 2020, 10:02 PM
lebedev.ri edited the summary of this revision. (Show Details)Jul 21 2020, 2:14 AM

While this doesn't appear to help with the perf issue being exposed by D84108, the function as-is is very weird, convoluted, and what's worse, recursive.

Would it be possible to share the pre-GVN IR for the problematic case? (Independently of this cleanup, just curious)

I don't have any particular problematic case in mind, but here's pre-GVN (i disabled all passes in pipeline starting with GVN)
llvm-test-suite/MultiSource/Applications/JM/lencod/context_ini.c (produced with D84108)

But i'm not sure instructions is *really* the right metric here, Because task-clock stat really improved, by -1.5% .. -5% all across the board.

This is just (very large) noise. I've added a disclaimer to the page that the task-clock/wall-time results are best ignored.

And here's standalone results:

y$ perf stat --repeat=100 /builddirs/llvm-project/build-Clang10-Release/bin/opt-old /tmp/lencod-context_ini-pre-gvn.bc -o /dev/null -gvn ; perf stat --repeat=100 /builddirs/llvm-project/build-Clang10-Release/bin/opt /tmp/lencod-context_ini-pre-gvn.bc -o /dev/null -gvn

 Performance counter stats for '/builddirs/llvm-project/build-Clang10-Release/bin/opt-old /tmp/lencod-context_ini-pre-gvn.bc -o /dev/null -gvn' (100 runs):

            795.35 msec task-clock                #    0.999 CPUs utilized            ( +-  0.04% )
                 3      context-switches          #    0.004 K/sec                    ( +-  4.66% )
                 0      cpu-migrations            #    0.000 K/sec                  
              2723      page-faults               #    0.003 M/sec                    ( +-  0.23% )
        3186511786      cycles                    #    4.006 GHz                      ( +-  0.04% )  (83.20%)
         522856359      stalled-cycles-frontend   #   16.41% frontend cycles idle     ( +-  0.09% )  (83.36%)
         956532477      stalled-cycles-backend    #   30.02% backend cycles idle      ( +-  0.11% )  (33.44%)
        2668685815      instructions              #    0.84  insn per cycle         
                                                  #    0.36  stalled cycles per insn  ( +-  0.04% )  (50.05%)
         514286258      branches                  #  646.619 M/sec                    ( +-  0.03% )  (66.64%)
          19609714      branch-misses             #    3.81% of all branches          ( +-  0.06% )  (83.24%)

          0.795919 +- 0.000313 seconds time elapsed  ( +-  0.04% )


 Performance counter stats for '/builddirs/llvm-project/build-Clang10-Release/bin/opt /tmp/lencod-context_ini-pre-gvn.bc -o /dev/null -gvn' (100 runs):

            787.80 msec task-clock                #    0.999 CPUs utilized            ( +-  0.04% )
                 3      context-switches          #    0.003 K/sec                    ( +-  6.11% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +- 70.35% )
              2703      page-faults               #    0.003 M/sec                    ( +-  0.20% )
        3156293784      cycles                    #    4.006 GHz                      ( +-  0.04% )  (83.23%)
         520963102      stalled-cycles-frontend   #   16.51% frontend cycles idle     ( +-  0.09% )  (83.24%)
         931974714      stalled-cycles-backend    #   29.53% backend cycles idle      ( +-  0.14% )  (33.52%)
        2681503031      instructions              #    0.85  insn per cycle         
                                                  #    0.35  stalled cycles per insn  ( +-  0.05% )  (50.28%)
         515472885      branches                  #  654.318 M/sec                    ( +-  0.03% )  (66.96%)
          18427045      branch-misses             #    3.57% of all branches          ( +-  0.08% )  (83.43%)

          0.788365 +- 0.000320 seconds time elapsed  ( +-  0.04% )

So there it's an improvement of -0.949% +- 0.04%

Though I don't see obvious problems in the new algorithm, I cannot wrap my mind around the old one and can't say if they are doing the same thing. I'd suggest the following reliable way to make sure it is NFC:

  • Instead of changing logic of IsValueFullyAvailableInBlock, introduce a new version of it and use it.
  • In debug mode, also call the old one and assert that the results are the same.
  • Then, after it has been in for a long enough while and we are certain those algorithms do the same thing, remove the old one.

WDYT?

llvm/lib/Transforms/Scalar/GVN.cpp
686–689

This name only makes sense because it is used as DFS traversal root. In function signature, BB is more clear.

lebedev.ri added a subscriber: lattner.

Though I don't see obvious problems in the new algorithm, I cannot wrap my mind around the old one

Yeah, i had that feeling myself :/ That's why i've put @lattner as reviewer,
since it was him who seems to have wrote/modified it in rL60408/rL60588 ~12 years ago.

and can't say if they are doing the same thing.

Any particular part that is causing trouble?
The code essentially consists of two separate parts, "maze solving" and backpropagation.

I think, backpropagation is obviously identical at least?
We start at the first block we've found to be an unavailable.
We need to mark each successor block, that isn't known to be available, as unavailable.
Also, to not pay for what we don't use, we should only mark the blocks that we care about,
i.e. those that are already present in the FullyAvailableBlocks map.
So we put said block into worklist, and do pretty typical worklist dance,
take block from worklist, and query (with insert!) it's availability status in a map.
WARNING that in the old version, if the block wasn't in a map already, this adds it as unavailable!
Then, if we have found out that it was unavailable, we just go to the next entry in worklist.
If it wasn't unavailable, we mark it as such
(WARNING: in old version, it would also mark available blocks as unavailable, which is obviously bogus,
while we should only do that for speculatively available blocks),
and proceed to further backpropagate that information to it's successors (by putting them into worklist).

The "maze solving" seems pretty intuitive in hindsight, too:
We start at some block, we want to know if some value is available in that block.
So we see if the map already contains an entry for the block, then we know our answer,
it is either explicitly marked as unavailable, or it's available/speculatively available.
Otherwise, as a perf optimization, we think it will be available, so that added an entry to the map,
marking the block as speculatively available.
Now, we need to check if the value can live-in from each(!) predecessor.
The value can't live-in if there are no predecessors and the block is unavailable.
If there are predecessors, we need to visit each one and do the all over again.

As soon as we find out that the block is unavailable, we have a problem,
we may have marked blocks as speculatively available along the way to the unavailable block.
It is important to notice that we don't deviate from our exploration path,
so all the blocks we have just marked as speculatively available are successors of this block.

Otherwise, if we never encounter an unavailable block, then we're all good.
Since we would have retrospectively fixed all the blocks we have set as speculatively available
as unavailable, from now on, all these speculatively available blocks are available.

All this logic is verified with asserts in the end.

I'd suggest the following reliable way to make sure it is NFC:

  • Instead of changing logic of IsValueFullyAvailableInBlock, introduce a new version of it and use it.
  • In debug mode, also call the old one and assert that the results are the same.
  • Then, after it has been in for a long enough while and we are certain those algorithms do the same thing, remove the old one.

WDYT?

I'm afraid it's not that easy to do.

  1. We'd need to workaround the situation where the old code gave up due to the gvn-max-recurse-depth recursion limit
  2. We'll need to duplicate FullyAvailableBlocks so we pass the identical one into both the old version and the new one. Note that we can't equality-compare them afterwards, because the new code, unlike the old one, in backpropagation, doesn't insert new nodes
  3. The new code doesn't (erroneously) mark previously available successor nodes of an unavailable node unavailable. This technically i guess makes it a non-NFC patch,
fhahn added a comment.Jul 23 2020, 2:02 AM

This overall makes sense I think.

I'd suggest the following reliable way to make sure it is NFC:

  • Instead of changing logic of IsValueFullyAvailableInBlock, introduce a new version of it and use it.
  • In debug mode, also call the old one and assert that the results are the same.
  • Then, after it has been in for a long enough while and we are certain those algorithms do the same thing, remove the old one.

WDYT?

I'm afraid it's not that easy to do.

  1. We'd need to workaround the situation where the old code gave up due to the gvn-max-recurse-depth recursion limit

I think we also need some kind of limit on the worklist size for the iterative algorithm. Without it we might traverse the whole function in some cases, which could lead to huge compile-times in some degenerate cases I think.

  1. We'll need to duplicate FullyAvailableBlocks so we pass the identical one into both the old version and the new one. Note that we can't equality-compare them afterwards, because the new code, unlike the old one, in backpropagation, doesn't insert new nodes
  2. The new code doesn't (erroneously) mark previously available successor nodes of an unavailable node unavailable. This technically i guess makes it a non-NFC patch,

Yes I think this shouldn't be marked as NFC.

It would be interesting however, how often the answers diverge (with respect the how the result is used in GVN). For example, it would be interesting to know how often PRE triggers in GVN on some large programs (e.g. MultiSource/SPEC/LLVM bootstrap) and how many differences codegen differences there are between the old and new implementation. Hopefully that would be a very small number, which might lead to a test case for the new behavior.

llvm/lib/Transforms/Scalar/GVN.cpp
687

Not sure about the name change. It's not clear to me what fixpoint means here. More accurately, its the map of blocks with known states, right? Whatever name is chosen, the comment above needs updating and we should keep the name consistent at the call sites.

753

The place of the comment next to the return seems a bit strange to me. IMO it would make more sense to have a comment above Worklist.append, stating we queue the successors to continue back-propagating.

759

Again, it is not entirely clear what "non-fixpoint" refers to here. It only means blocks marked as SpeculativelyAvailable, right? Also, back-propagating seems a bit counter-intuitive here at a first glance, if you think of the CFG as a directed graph (edges directed from BB to its successors).

Saying something like

// Okay, we have encountered an "unavailable" block.
// Mark SpeculativelAvailable blocks reachable from UnavailableBB as unavailable as well. Paths are terminated when they reach blocks no in Fixpoints or they are not marked as SpeculativelAvailable.
779

nit: could this just return !UnabilableBB?

lebedev.ri marked 5 inline comments as done.
lebedev.ri retitled this revision from [NFC][GVN] Rewrite IsValueFullyAvailableInBlock() to [GVN] Rewrite IsValueFullyAvailableInBlock().

This overall makes sense I think.

I'd suggest the following reliable way to make sure it is NFC:

  • Instead of changing logic of IsValueFullyAvailableInBlock, introduce a new version of it and use it.
  • In debug mode, also call the old one and assert that the results are the same.
  • Then, after it has been in for a long enough while and we are certain those algorithms do the same thing, remove the old one.

WDYT?

I'm afraid it's not that easy to do.

  1. We'd need to workaround the situation where the old code gave up due to the gvn-max-recurse-depth recursion limit

I think we also need some kind of limit on the worklist size for the iterative algorithm. Without it we might traverse the whole function in some cases, which could lead to huge compile-times in some degenerate cases I think.

Okay, added one. Right now it's based on the number of times we find a previously unknown block,
mark it as speculatively available, and recurse into it. I think this is the correct way,
i wouldn't think we should just count every block queried, even if we'd found an entry in map?

I don't have SPEC, and didn't try it on whole LLVM, but on vanilla test-suite + RawSpeed + darktable,
said IsValueFullyAvailableInBlockNumSpeculationsMax stat is at max 300, so i've picked a limit of 600.

  1. We'll need to duplicate FullyAvailableBlocks so we pass the identical one into both the old version and the new one. Note that we can't equality-compare them afterwards, because the new code, unlike the old one, in backpropagation, doesn't insert new nodes
  2. The new code doesn't (erroneously) mark previously available successor nodes of an unavailable node unavailable. This technically i guess makes it a non-NFC patch,

Yes I think this shouldn't be marked as NFC.

It would be interesting however, how often the answers diverge (with respect the how the result is used in GVN). For example, it would be interesting to know how often PRE triggers in GVN on some large programs (e.g. MultiSource/SPEC/LLVM bootstrap) and how many differences codegen differences there are between the old and new implementation. Hopefully that would be a very small number, which might lead to a test case for the new behavior.

Let's see if i can catch such a case..

This overall makes sense I think.

I'd suggest the following reliable way to make sure it is NFC:

  • Instead of changing logic of IsValueFullyAvailableInBlock, introduce a new version of it and use it.
  • In debug mode, also call the old one and assert that the results are the same.
  • Then, after it has been in for a long enough while and we are certain those algorithms do the same thing, remove the old one.

WDYT?

I'm afraid it's not that easy to do.

  1. We'd need to workaround the situation where the old code gave up due to the gvn-max-recurse-depth recursion limit

I think we also need some kind of limit on the worklist size for the iterative algorithm. Without it we might traverse the whole function in some cases, which could lead to huge compile-times in some degenerate cases I think.

Okay, added one. Right now it's based on the number of times we find a previously unknown block,
mark it as speculatively available, and recurse into it. I think this is the correct way,
i wouldn't think we should just count every block queried, even if we'd found an entry in map?

I don't have SPEC, and didn't try it on whole LLVM, but on vanilla test-suite + RawSpeed + darktable,
said IsValueFullyAvailableInBlockNumSpeculationsMax stat is at max 300, so i've picked a limit of 600.

  1. We'll need to duplicate FullyAvailableBlocks so we pass the identical one into both the old version and the new one. Note that we can't equality-compare them afterwards, because the new code, unlike the old one, in backpropagation, doesn't insert new nodes
  2. The new code doesn't (erroneously) mark previously available successor nodes of an unavailable node unavailable. This technically i guess makes it a non-NFC patch,

Yes I think this shouldn't be marked as NFC.

It would be interesting however, how often the answers diverge (with respect the how the result is used in GVN). For example, it would be interesting to know how often PRE triggers in GVN on some large programs (e.g. MultiSource/SPEC/LLVM bootstrap) and how many differences codegen differences there are between the old and new implementation. Hopefully that would be a very small number, which might lead to a test case for the new behavior.

Let's see if i can catch such a case..

Done. It's rather horrible, but better than nothing i guess.

fhahn added inline comments.Jul 24 2020, 1:37 AM
llvm/test/Transforms/GVN/loadpre-missed-opportunity.ll
4 ↗(On Diff #280271)

those should not be needed.

7 ↗(On Diff #280271)

Not used currently? I think the test below would benefit from having some actual users of %i & i7 that cannot be folded away.

66 ↗(On Diff #280271)

nit: unnecessary attribute/dso_local. Same for _Z2axv

lebedev.ri marked 4 inline comments as done.

@fhahn thanks for taking a look!
Indeed, the test is bad.

llvm/test/Transforms/GVN/loadpre-missed-opportunity.ll
7 ↗(On Diff #280271)

It would benefit indeed, but that breaks the test :/
It's overreduced, but i'm not sure what extra interestingness tests i should have used.

lattner removed a subscriber: lattner.Jul 24 2020, 8:31 AM
fhahn added a comment.Jul 24 2020, 9:22 AM

@fhahn thanks for taking a look!
Indeed, the test is bad.

Perhaps the problem was a missing read none on the use calls?

E.g. something like the following seems to work as expected. Might still be possible to remove some parts of the cfg.

define i32 @loadpre_opportunity(i32** %arg, i1 %arg1, i1 %arg2, i1 %arg3) {
bb:
  %i = load i32*, i32** %arg, align 8
  %i4 = getelementptr inbounds i32, i32* %i, i64 0
  br label %bb5

bb5:
  %v.1 = call i32 @use(i32* %i4)
  br label %bb9

bb6:
  %i7 = load i32*, i32** %arg, align 8
  %i8 = getelementptr inbounds i32, i32* %i7, i64 0
  %v.2 = call i32 @use(i32* %i8)
  br label %bb9

bb9:
  %p = phi i32 [ %v.1, %bb5 ], [ %v.2, %bb6]
  br i1 %arg1, label %bb6, label %bb10

bb10:
  call void @somecall()
  br i1 %arg2, label %bb12, label %bb15

bb12:
  br label %bb13

bb13:
  br i1 %arg3, label %bb14, label %bb13

bb14:
  br label %bb15

bb15:
  %c = call i1 @cond()
  br i1 %c, label %bb6, label %exit

exit:
  ret i32 %p
}

declare void @somecall()
declare i32 @use(i32*) readnone
declare i1 @cond() readnone

@fhahn thanks! that is indeed better.

fhahn added a comment.Jul 27 2020, 3:13 AM

Thanks for the test case! I think this makes sense and is in the spirit of the original implementation, with some improvements. Some more comments, mostly related to wording inline.

llvm/lib/Transforms/Scalar/GVN.cpp
102

For consistency, I would suggest updating the wording to use a similar style to above, e.g. Number of blocks speculated as available in IsVal... Same for the second stat.

116

Is the default not 600?

707

nit: use try_emplace to skip std::make_pair? (if that compiles)

756–757

nit: maybe adjust with something like as fixpoint yet (the Unavailable and Available states are fixpoints) to make clear what we mean with fixpoints.

772

As mentioned earlier, it seems odd to me to call propagating to successors as 'back-propagating'. I guess it makes sense if you think about it as back-propagating to the starting node. Might be good to clarify the comment. Might just say Queue successors for further processing.

llvm/test/Transforms/GVN/loadpre-missed-opportunity.ll
2 ↗(On Diff #280514)

also add a line with the limit set so we don't perform the optimization, to check that the limit works?

lebedev.ri marked 6 inline comments as done.

Thank you for taking a look!

Thanks for the test case! I think this makes sense and is in the spirit of the original implementation, with some improvements. Some more comments, mostly related to wording inline.

Nits addressed.

fhahn accepted this revision.Jul 27 2020, 4:10 AM

LGTM, thanks. Might be good to wait a day with committing, in case there are any additional concerns.

llvm/lib/Transforms/Scalar/GVN.cpp
757

nit: add period at end of sentence

This revision is now accepted and ready to land.Jul 27 2020, 4:10 AM

LGTM, thanks. Might be good to wait a day with committing, in case there are any additional concerns.

Thank you for the review!
I'll wait a bit.

This revision was landed with ongoing or failed builds.Jul 28 2020, 12:27 AM
This revision was automatically updated to reflect the committed changes.