This is an archive of the discontinued LLVM Phabricator instance.

[thinlto] Basic thinlto fdo heuristic
ClosedPublic

Authored by Prazek on Sep 15 2016, 3:31 PM.

Details

Summary

This patch improves thinlto importer
by importing 3x larger functions that are called from hot block.

I compared performance with the trunk on spec, and there
were about 2% on povray and 3.33% on milc. These results seems
to be consistant and match the results Teresa got with her simple
heuristic. Some benchmarks got slower but I think they are just
noisy (mcf, xalancbmki, omnetpp)- running the benchmarks again with
more iterations to confirm. Geomean of all benchmarks including the noisy ones
were about +0.02%.

I see much better improvement on google branch with Easwaran patch
for pgo callsite inlining (the inliner actually inline those big functions)
Over all I see +0.5% improvement, and I get +8.65% on povray.
So I guess we will see much bigger change when Easwaran patch will land
(it depends on new pass manager), but it is still worth putting this to trunk
before it.

Implementation details changes:

  • Removed CallsiteCount.
  • ProfileCount got replaced by Hotness
  • hot-import-multiplier is set to 3.0 for now,

didn't have time to tune it up, but I see that we get most of the interesting
functions with 3, so there is no much performance difference with higher, and
binary size doesn't grow as much as with 10.0.

Diff Detail

Repository
rL LLVM

Event Timeline

Prazek updated this revision to Diff 71571.Sep 15 2016, 3:31 PM
Prazek retitled this revision from to [thinlto] Basic thinlto fdo heuristic.
Prazek updated this object.
Prazek added reviewers: tejohnson, eraman, mehdi_amini.
Prazek added a subscriber: llvm-commits.
tejohnson edited edge metadata.Sep 16 2016, 7:40 AM

Thanks!

I haven't gone through your new test cases yet, but there are a few comments below so far (one correctness issue with handling old versions).

lib/Analysis/ModuleSummaryAnalysis.cpp
66 ↗(On Diff #71571)

Add blank line before

lib/Bitcode/Reader/BitcodeReader.cpp
6220 ↗(On Diff #71571)

Use something like OldProfileFormat to be more explicit. That way the name of this variable won't need changing when we bump the version again.

6303 ↗(On Diff #71571)

Only skipping 2 fields if HasProfile, otherwise just skip 1 (callsite_count). Can you make sure there is a test that has the old format bitcode (both with and without old profile)? I.e. you would commit old bitcode - there are some existing tests that do this (look at the .bc files committed in test/Bitcode for examples).

6362 ↗(On Diff #71571)

Record descriptions in include/llvm/Bitcode/LLVMBitCodes.h need to be updated accordingly.

6393 ↗(On Diff #71571)

Ditto here.

lib/LTO/ThinLTOCodeGenerator.cpp
382 ↗(On Diff #71571)

The PSI is not too useful without a BFI - perhaps construct one here as we do e.g. in PartialInlinerImpl::unswitchFunction? Ok as a follow-on though. Otherwise change PSI to be passed as an optional pointer.

lib/Transforms/IPO/FunctionImport.cpp
287 ↗(On Diff #71571)

Add a FIXME about using the Code hotness to reduce threshold

Prazek updated this revision to Diff 71672.Sep 16 2016, 10:30 AM
Prazek marked 4 inline comments as done.
Prazek edited edge metadata.
  • small fixes
Prazek added inline comments.Sep 16 2016, 10:35 AM
lib/Bitcode/Reader/BitcodeReader.cpp
6303 ↗(On Diff #71571)

good catch. I didn't know how to properly test it :(

mehdi_amini edited edge metadata.Sep 16 2016, 10:43 AM

Great to see this!

Can you add a test with three functions, hot/cold/unknown, and show the impact of the importing with a fixed -import-instr-limit and varying -import-hot-multiplier?

lib/Analysis/ModuleSummaryAnalysis.cpp
113 ↗(On Diff #71672)

Can you split the statement:

auto Hotness = ScaledCount ? getHotness(ScaledCount.getValue(), PSI)
                        : CalleeInfo::HotnessType::Unknown;
CallGraphEdges[CalleeId].updateHotness(Hotness);

Same number of lines, but the flow is easier to read.

lib/LTO/ThinLTOCodeGenerator.cpp
382 ↗(On Diff #71672)

At this point, not having PSI (i.e. making it optional is fine).

Great to see this!

Can you add a test with three functions, hot/cold/unknown, and show the impact of the importing with a fixed -import-instr-limit and varying -import-hot-multiplier?

Good point, I forgot to test this.

lib/Bitcode/Reader/BitcodeReader.cpp
6303 ↗(On Diff #71672)

Do you have some idea how this test should look like?

I was thinking about saving bc files from old version, and then reading them and see what happens.
Is there a way to just read bc file and then write it again? I could then compare the output with some existing test.

lib/LTO/ThinLTOCodeGenerator.cpp
382 ↗(On Diff #71672)

Will look into your example in a bit, but it seems to add additional handling of not having PSI, where PSI already have this logic inside.

Prazek added inline comments.Sep 16 2016, 4:56 PM
lib/Bitcode/Reader/BitcodeReader.cpp
6303 ↗(On Diff #71672)

I guess this would be good
llvm-dis < %s.bc -o - | llvm-as | llvm-dis | FileCheck

mehdi_amini added inline comments.Sep 16 2016, 4:58 PM
lib/Bitcode/Reader/BitcodeReader.cpp
6303 ↗(On Diff #71672)

This does not read summaries.
Try llvm-lto -thinlto-index-stats %s.bc

Is it in some recent patch right?

Prazek updated this revision to Diff 72220.Sep 22 2016, 3:39 PM
Prazek edited edge metadata.
  • small fixes
  • Added regression test
  • something
  • fixes
  • added test
Prazek marked 7 inline comments as done.Sep 22 2016, 3:41 PM
tejohnson added inline comments.Sep 23 2016, 10:36 AM
lib/Bitcode/Reader/BitcodeReader.cpp
6457 ↗(On Diff #72220)

Blank line above (clang-format should fix?), and document

6464 ↗(On Diff #72220)

I think this block would be clearer if restructured like:

if (IsOldProfileFormat) {

I += 1; // Skip old callsitecount field
if (HasProfile)
   I += 1; // Skip old profilecount field

} else

Hotness = static_cast<CalleeInfo::HotnessType>(Record[++I]);

...

(Note the ++I change in the Record access too)

lib/LTO/ThinLTOCodeGenerator.cpp
382 ↗(On Diff #72220)

Not sure I follow. The example was to show how to also create a BFI, so that the PSI is useful. But instead you could follow Mehdi's suggestion and make the PFI argument optional.

lib/Transforms/IPO/FunctionImport.cpp
288 ↗(On Diff #72220)

s/Fixme/FIXME/

test/Transforms/FunctionImport/hotness_based_import.ll
14 ↗(On Diff #72220)

Also check that hot3 not imported?

19 ↗(On Diff #72220)

s/threat/treat/

27 ↗(On Diff #72220)

hot3?

30 ↗(On Diff #72220)

s/form/from/

38 ↗(On Diff #72220)

hot3?

53 ↗(On Diff #72220)

why is hot2 here?

54 ↗(On Diff #72220)

ditto for none1

Prazek added inline comments.Sep 23 2016, 12:59 PM
lib/Bitcode/Reader/BitcodeReader.cpp
6464 ↗(On Diff #72220)

what do you mean with "(Note the ++I change in the Record access too)"
I agree that your structure is better.

lib/LTO/ThinLTOCodeGenerator.cpp
382 ↗(On Diff #72220)

I will look into that, but my point is that PSI already have handling of not having the profile data, so when asked if count is hot/cold it returns always false.
changing it to optional will require checking if PSI is set. It won't be big husstle, but I guess it will make it more clean that the PSI does not always
returns useful information

test/Transforms/FunctionImport/hotness_based_import.ll
14 ↗(On Diff #72220)

oh yes

53 ↗(On Diff #72220)

to check if it takes max of the hotness of block. It is also in hot block.

54 ↗(On Diff #72220)

same

Prazek marked 9 inline comments as done.Sep 23 2016, 5:51 PM
Prazek updated this revision to Diff 72384.Sep 23 2016, 5:54 PM
  • [thinlto] replace profile count with hotness
  • fixes
Prazek updated this revision to Diff 72549.Sep 26 2016, 1:16 PM
  • fixes
Prazek updated this revision to Diff 72551.Sep 26 2016, 1:25 PM
  • fixes
tejohnson accepted this revision.Sep 26 2016, 1:25 PM
tejohnson edited edge metadata.

LGTM thanks!

lib/Bitcode/Reader/BitcodeReader.cpp
6464 ↗(On Diff #72220)

what do you mean with "(Note the ++I change in the Record access too)"

I just was calling attention to the fact that this line was different than what your old structure required. Looks like you adapted this change fine.

This revision is now accepted and ready to land.Sep 26 2016, 1:25 PM
This revision was automatically updated to reflect the committed changes.

I'm not sure this is correct, see inline.

llvm/trunk/lib/Transforms/IPO/FunctionImport.cpp
291

This does not seem correct to me. The multiplier should be apply only on the initial "seed" of the call chain, not at every chain.

368

Here the ImportInstrLimit can benefit of the multiplier.

tejohnson added inline comments.Sep 27 2016, 1:53 PM
llvm/trunk/lib/Transforms/IPO/FunctionImport.cpp
291

The multiplier is applied to each call independently based on whether it is marked hot. E.g. if we have:
A -> B (hot call)
and then B has 2 calls:
B -> C1 (hot call)
and
B -> C2 (cold call)

I assume you are referring to A as the "seed" of the call chain? We want to treat the two different calls from B differently, it doesn't matter that the call from A -> B is hot.

368

No since the calls from FuncSummary will be treated appropriately if they are hot.

Aggree what Teresa said. In other example:

A->B->C
where B and C are external, and C is called from hot block and B is called from normal block.

Threshold for B will be 100, and for C will be 70*3 (100*0.7*bonus)

mehdi_amini added inline comments.Sep 27 2016, 6:17 PM
llvm/trunk/lib/Transforms/IPO/FunctionImport.cpp
291

Ok I can see the logic. But I'm not convinced by the multiplier effect.
i.e. right now as you get further from the original function, the threshold would *increase* since if I follow correctly we multiply by 3 here and by 0.7 later?
I may miss something here.

i.e. with a sequence of A->B->C->D (all hot) the threshold evolves this way:

  • call to B: 100
  • call to C: 100*3*0.7 = 210
  • call to D: 210*3*0.7 = 441

I wonder if a Bonus wouldn't be more appropriate.

tejohnson added inline comments.Sep 28 2016, 6:21 AM
llvm/trunk/lib/Transforms/IPO/FunctionImport.cpp
291

Ok I can see the logic. But I'm not convinced by the multiplier effect.
i.e. right now as you get further from the original function, the threshold would *increase* since if I follow correctly
we multiply by 3 here and by 0.7 later?
I may miss something here.

i.e. with a sequence of A->B->C->D (all hot) the threshold evolves this way:

call to B: 100
call to C: 100*3*0.7 = 210
call to D: 210*3*0.7 = 441
I wonder if a Bonus wouldn't be more appropriate.

This isn't the case since (original) Threshold, not NewThreshold, is pushed into the Worklist. So the next level of callees again start with the original threshold, which is then decayed before being passed in here and multiplied. So in your example above, unless I am missing something, we get:

call to B: 100*3 = 300
call to C: 100*0.7*3 = 210
call to D: 100*0.7*0.7*3 = 147

mehdi_amini added inline comments.Sep 28 2016, 8:17 AM
llvm/trunk/lib/Transforms/IPO/FunctionImport.cpp
291

OK, you're right, I got confused by the name NewThreshold.

I still wonder if we shouldn't push another threshold with bonus on the stack.
And in fast this kind of what D24976 attempt to do.