This is an archive of the discontinued LLVM Phabricator instance.

[MachineLateInstrsCleanup] Improve compile time for huge functions.
ClosedPublic

Authored by jonpa on Apr 4 2023, 6:56 AM.

Details

Summary

It was discovered that this pass can be slow on huge functions, meaning 20% compile time instead of the usual ~0.5% (with a test case spending ~19 mins on SystemZ just in the backend with llc).

The problem relates to the necessary clearing of the kill flags when a redundant instruction is removed. This was made by scanning backwards over the instructions until a use(/def) was found (I tried simply clearing all kill flags on the involved registers, but that did change a few cases of later optimizations, so it seems that this should not be simplified like that).

This patch replaces the scanning of instructions with a map from register to the last seen use (kill).

Compile time on the huge file has been remedied fully (-> 0.4%), but I see on SystemZ/SPEC a slight average compile time increase (0.46% -> 0.52%). This is small enough that it seems wise to go ahead and use the map to help the huge functions (or..?).

Some alternate experiments with this has already been made at https://reviews.llvm.org/D146810, but this turned out to be the simpler version with the same performance.

Fixes: https://github.com/llvm/llvm-project/issues/61397

Diff Detail

Event Timeline

jonpa created this revision.Apr 4 2023, 6:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2023, 6:56 AM
jonpa requested review of this revision.Apr 4 2023, 6:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2023, 6:56 AM
Herald added a subscriber: wdng. · View Herald Transcript
vpykhtin added inline comments.Apr 5 2023, 9:27 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
46

If the number of map pairs is usually small I would use SmallDenseMap as it has pre-allocated storage, especially when you're placing it in a vector

53

Capitalize mi to match coding style

craig.topper added inline comments.Apr 5 2023, 4:04 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
47

this just std::map::lookup?

54

return mi && mi->isIdenticalTo(*MI)

craig.topper added inline comments.Apr 5 2023, 4:07 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
47

Nevermind. I guess lookup is unique to DenseMap.

jonpa updated this revision to Diff 511345.Apr 6 2023, 2:59 AM
jonpa marked 5 inline comments as done.

Updated per review -- see inline comment about the map type.

llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
46

I tried replacing for SmallDenseMap here again, but no change in compile time at all that I could see.

vpykhtin accepted this revision.Apr 6 2023, 9:52 PM

LGTM, thank you!

This revision is now accepted and ready to land.Apr 6 2023, 9:52 PM
jonpa added a comment.Apr 7 2023, 3:43 AM

LGTM, thank you!

Thanks, Valery!

I'll wait a few days before committing in case there are any comments from anyone else on this...

arsenm added inline comments.Apr 7 2023, 4:15 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
46

I thought subclassing STL types was discouraged. I'd also default to using the LLVM map types if there's no observable difference anyway

jonpa added inline comments.Apr 8 2023, 2:29 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
46

My viewpoint was that if we use an LLVM map it looks like it's faster than STL, but if it isn't, we could use STL for now until we find a better way, and in the future it will be clear that this has not been improved upon yet.

I am fine either way, so I could change it if that's what the rest of you want...

jonpa updated this revision to Diff 513606.Apr 14 2023, 8:29 AM

Use SmallDenseMap instead.

Iteration change necessary due to this (at "// Clear any entries in map that MI clobbers."), so you might as well take another look...

jonpa requested review of this revision.Apr 14 2023, 8:30 AM

Everybody happy with this?

vpykhtin added inline comments.Apr 25 2023, 12:12 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
240

Sorry, I didn't notice this change first.

You can still use erase by iterator (it's slightly faster than another lookup) because you're using make_early_inc_range.

vpykhtin added inline comments.Apr 25 2023, 12:20 AM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
240

bad advice - DefI is already incremented after Register Reg = DefI.first.

May be you can leave previous variant of iteration, you can increment the iterator like this:

MBBDefs.erase(DefI++) instead of DefI = MBBDefs.erase(DefI)

jonpa updated this revision to Diff 517978.Apr 28 2023, 10:18 AM
jonpa marked 2 inline comments as done.

Sorry for delay (vaccation). Patch updated -- see inline comments.

llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
240

Yeah - I went for readability at first here as I did not see any change in compile time.

240

I did not find anything to convince me 100% that this use of erase with iterator is legal, so I will have to trust you on this ;-)

I think the readability is ok now as well with this.

vpykhtin added inline comments.Apr 29 2023, 7:39 PM
llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
240

Now you're incrementing DefI after erase and in general this is incorrect for STL containers (though it's not the case here) because erase invalidates iterators to erased elements.

MBBDefs.erase(DefI++) works because arguments are fully evaluated before function call, so this is equivalent to:

auto DefIBefore = DefI;
++DefI; // Now DefI is valid after erase
MBBDefs.erase(DefIBefore);

jonpa marked an inline comment as done.Apr 30 2023, 4:58 AM

It seems that this would result in this loop:

for (auto DefI = MBBDefs.begin(); DefI != MBBDefs.end();) {
  Register Reg = DefI->first;
  if (MI.modifiesRegister(Reg, TRI)) {
    MBBDefs.erase(DefI++);
    MBBKills.erase(Reg);
    continue;
  } else if (MI.findRegisterUseOperandIdx(Reg, false /*isKill*/, TRI) != -1)
    // Keep track of the last use seen so far.
    MBBKills[Reg] = &MI;
  DefI++;
}

I find that acceptable, but the iteration logic is slightly complicated. That's why I chose to ignore the extra lookup when calling erase(Reg) in the first place:

for (auto DefI : llvm::make_early_inc_range(MBBDefs)) {
  Register Reg = DefI.first;
  if (MI.modifiesRegister(Reg, TRI)) {
    MBBDefs.erase(Reg);
    MBBKills.erase(Reg);
  } else if (MI.findRegisterUseOperandIdx(Reg, false /*isKill*/, TRI) != -1)
    // Keep track of the last use seen so far.
    MBBKills[Reg] = &MI;
}

I compared both these versions just now on a z15 with your test case, but found no difference:

      ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
itr:   4.8783 (  0.4%)   0.0011 (  0.0%)   4.8794 (  0.4%)   4.8833 (  0.4%)  Machine Late Instructions Cleanup Pass
reg:   4.8384 (  0.4%)   0.0009 (  0.0%)   4.8393 (  0.4%)   4.8439 (  0.4%)  Machine Late Instructions Cleanup Pass

I guess this could mean that this test case does a lot of reuse of the earlier Candidate, so it doesn't call erase() here that much, maybe.

On SPEC/SystemZ they both average 0.52% compile time across all files, so there is no difference there either.

The situation may be different on other systems, I suppose. I was using GCC version used to build clang: 12.2.1. Do you see any compile time difference at all on your platform?

I would personally vote for the readability (ignore the erase(Reg) potential overhead), but I will let you the reviewer(s) decide :)

llvm/lib/CodeGen/MachineLateInstrsCleanup.cpp
240

Sorry,I thought you were implying that increment after erase() was fine with this SmallDenseMap...

jonpa marked an inline comment as done.Apr 30 2023, 4:58 AM
vpykhtin accepted this revision.May 3 2023, 12:50 AM

Ok, since this isn't critical I don't have strong preference to erase by the iterator, I'm fine with shorter and nicer version with make_early_inc_range :) Thank you :)

This revision is now accepted and ready to land.May 3 2023, 12:50 AM
This revision was landed with ongoing or failed builds.May 8 2023, 12:05 AM
This revision was automatically updated to reflect the committed changes.
jonpa added a comment.May 8 2023, 8:16 AM

Ok, since this isn't critical I don't have strong preference to erase by the iterator, I'm fine with shorter and nicer version with make_early_inc_range :) Thank you :)

Thanks for review!

I pushed it, but ran into a buildbot failure, which I fixed with 10f0158.