This is an archive of the discontinued LLVM Phabricator instance.

[Coroutines] Improve rematerialization stage
ClosedPublic

Authored by dstuttard on Jan 26 2023, 5:41 AM.

Details

Summary

As originally implemented, the rematerialization of valid instructions across
the suspend point would iterate 4 times, meaning that up to 4 instructions could
be rematerialized.

This implementation changes that approach to instead build a graph of
rematerializable instructions, then move all of them. This is faster than the
original approach and is not limited to an arbitrary limit.

Diff Detail

Event Timeline

dstuttard created this revision.Jan 26 2023, 5:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2023, 5:41 AM
dstuttard requested review of this revision.Jan 26 2023, 5:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2023, 5:41 AM
sebastian-ne accepted this revision.Jan 27 2023, 10:55 AM

Looks good to me, but please give it a few days in case someone else has a comment.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
366–372

Maybe it makes sense to use a Set for the Worklist?

This revision is now accepted and ready to land.Jan 27 2023, 10:55 AM

LGTM -- one inline comment on style.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
360

Maybe the indentation here can be reduced a bit with early exiting out of the outermost if, moving the initialization of D out of the if, merging the two ifs above, and early exiting here as well?

jsilvanus accepted this revision.Jan 30 2023, 12:51 AM
ChuanqiXu requested changes to this revision.Jan 30 2023, 10:18 PM

Thanks for working on this! But a problem I found is that it is expensive to construct ReversePostOrderTraversal and this patch tries to construct it in a loop. So it looks not so good to me.

And I am wondering if it is necessary to have such a complex structure and algorithm. What I had in mind is that we can use a worklist to store the materialized instructions and we can operate on that list. So we can avoid duplicate and meaningless iterations. I feel this is easier to implement and it looks not bad.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
326

What does Children mean here?

387–388

Are the 2 methods used?

2227–2228

The comment looks not precise after we land this patch.

2251

It is expensive to create ReversePostOrderTraversal. So it looks not good to construct it in a loop.

2252

I feel it is not so necessary and helpful to declare the type for the iterator.

2902–2908

This is not good. It may cause the the behavior become inconsistent after we materialize DVI instructions. See https://github.com/llvm/llvm-project/issues/55276 for an example.

2929
2931–2934

We may prefer such styles to shorten the indentation.

2933–2934

nit: We can use auto if we can see the type in the right hand side.

2939–2941

It looks not bad to use auto in this case.

2953–2956

We can construct the IRBuilder in rewriteMaterializableInstructions and we don't need to clear the Spills clearly.

This revision now requires changes to proceed.Jan 30 2023, 10:18 PM
jsilvanus added inline comments.Jan 31 2023, 12:41 AM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
2251

Pedantically speaking, I'm not sure constructing the ReversePostOrderTraversal in a loop here is an issue: It being "expensive" just means it does the graph traversal in the constructor, so its run time is linear in the size of the graph.
But here we are using it to traverse *different* graphs, all of which have been constructed before, so the runtime can be amortized into the construction of those graphs, or also into the traversal that is done later.

What we should not do is re-creating ReversePostOrderTraversal iterator objects for the same graph in a loop, because that wastes runtime.

Still, one might argue that constructing all those graphs with overlapping nodes, i.e. possibly multiple graphs having a node for the same Instruction*, is a fundamental runtime issue. Not sure if that really can become an issue?

ChuanqiXu added inline comments.Jan 31 2023, 1:11 AM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
2251

Yeah, the key point here is that how many overlapping nodes there is. Have you measured the compile-time, run time performance or memory usages? Then we can have a better feeling. For example, we can decide if we want to limit the depth of the graph then.

dstuttard updated this revision to Diff 495212.Feb 6 2023, 10:39 AM
dstuttard marked 9 inline comments as done.

Addressing reviewer comment

Thanks for the feedback - see the comments and the udpated patch(es)

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
326

I just needed a reasonable name for the next nodes - they are defined as being one edge further away from the root of the graph, so seemed like a reasonable name to use.
Do you think something else would be better?

360

I think I get what you mean - I've updated it with less indenting.

366–372

Maybe - are you thinking that a set would remove the need to check for duplicates? I'm not sure it makes things much better - maybe it removes the needs to iterate the worklist, I can't remember if there's a requirement to do this in order though.

387–388

No - it appears they aren't!
Based on the examples for using the RPOT template I thought they were.

2227–2228

I'm not sure that the result of this patch is any different from what happened before - other than you might get more than 4 dependent instructions rematerialized.
What do you think needs changing here?

2251

This work was done to speed up materialization. We needed a lot of rematerialization to happen, and initially just increased the number of iterations from 4 to a larger number.
This didn't work very well and was extremely slow - hence this re-work.

I haven't done timings for smaller amounts of remat, but I can do that if you think it is useful.

I did wonder though if limiting the depth with an option might be useful - we want as much as possible, but that's probably not true for all applications.

I'm not sure about the overlapping nodes actually being an issue here - I did attempt to create a test case that demonstrated this, but I'm not sure I was entirely successful (all the tests ended up with the minimum set for the instructions being rematerialized).

2902–2908

I think this is here because I created the original patch on an older version of CoroFrame which did this.
Is removing this the right approach?

ChuanqiXu added inline comments.Feb 6 2023, 6:31 PM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
324

Since RematNode is only used in RematGraph. I feel slightly better to make RematNode a private class definition for RematGraph.

326

I guess operands or reversed_successor or something similar to that may be better. What's more important here is that we lack a lot of comments here for RematNode and RematGraph. Otherwise the code readers can't understand what they mean.

339

I feel slightly better to add an assertion here.

360

We can still improve this. For example:

if (Remats.count(N->Node)) 
    return;
if (Remats.count(D)) {
          // Already have this in the graph
          N->Children.push_back(Remats[D].get());
          continue;
}
366–372

Given the graph should be a DAG, the order should not be important here. So a set here may be better.

388

nit: It looks slightly better to provide a in-class definition for dump(). So we can reduce one !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP).

2227–2228

The comment is For every use of "the value". However, rewriteMaterializableInstructions don't have the value from the signature. Also the reader can't know what is RematGraph. So it may be hard for users to understand.

2251

I haven't done timings for smaller amounts of remat, but I can do that if you think it is useful.

I did wonder though if limiting the depth with an option might be useful - we want as much as possible, but that's probably not true for all applications.

I am not sure if it is a good idea to limit the depth too. I mean we need more data to make the decision. Since this change is not a pure win theoretically. There may be edge cases or everything would be fine. I am not blocking this. I just say we need more things to convince ourselves that this is a good change generally. Specially, folly may be a good choice to have a test. Or any other libraries that have a lot of coroutines.

2902–2908

Yes, we should remove it.

dstuttard updated this revision to Diff 495564.Feb 7 2023, 8:44 AM
dstuttard marked 11 inline comments as done.

More changes based on feedback

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
324

Defining the GraphTraits is harder if RematNode is a private class definition for RematGraph - but I agree there's no reason that RematNode shouldn't be a class definition in RematGraph, so I've done that.

326

I think I prefer Operands, so I've changed it to that.
I've also added some more comments explaining RematGraph and RematNode.

366–372

I don't think there's much of an advantage to doing this, so I've left it as a deque for now. I'm open to being convinced that a set is better, but I'd rather not rework this if possible.

2227–2228

I've reworded this. Hopefully it's clearer now.

2251

I tried using folly to test this - not sure my methodology is sound though:

Compile clang and clang++ with/without these changes
Set up to build folly by setting:
CC=/just/built/clang
CXX=/just/built/clang++
CXXFLAGS="-std=c++20"
CCACHE_DISABLE=1 (to allow for multiple build runs)

I think that should be sufficient to enable it?

Also ran folly tests
(Verified from build output that the correct compiler is being used).

Results:
Build time WITH changes (tried it a couple of times):
Run 1
real 1m58.627s
user 39m49.991s
sys 1m27.272s

Run 2
real 1m54.844s
user 39m55.198s
sys 1m28.330s

Test time WITH changes (multiple runs):
32.28 secs
41.39 secs
34.91 secs
36.18 secs
35.94 secs

Build time WITHOUT changes (multiple runs):
Run 1
real 1m55.352s
user 39m33.938s
sys 1m25.716s

Run 2
real 1m58.287s
user 39m30.488s
sys 1m26.247s

Run 3
real 1m54.915s
user 39m31.783s
sys 1m25.420s

Test time WITHOUT changes (multiple runs):
42.23 secs
36.24 secs
41.64 secs
40.92 secs

If this is valid - then it doesn't seem to make much difference. Arguably the test time is slightly better with the changes, but there seems to be more variation run-to-run than there is with/without.

ChuanqiXu accepted this revision.Feb 7 2023, 5:59 PM

LGTM. Thanks.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
2251

Good enough to know that this won't make things worse.

This revision is now accepted and ready to land.Feb 7 2023, 5:59 PM
dstuttard updated this revision to Diff 496088.Feb 9 2023, 4:42 AM

Removed assert that was incorrect (and causing build-bot pre-checkin failures)

dstuttard updated this revision to Diff 496091.Feb 9 2023, 4:45 AM

clang-format change

This revision was landed with ongoing or failed builds.Feb 13 2023, 3:06 AM
This revision was automatically updated to reflect the committed changes.