Page MenuHomePhabricator

[CSSPGO] Introducing distribution factor for pseudo probe.
AcceptedPublic

Authored by hoy on Dec 14 2020, 6:39 PM.

Details

Summary

Sample re-annotation is required in LTO time to achieve a reasonable post-inline profile quality. However, we have seen that such LTO-time re-annotation degrades profile quality. This is mainly caused by preLTO code duplication that is done by passes such as loop unrolling, jump threading, indirect call promotion etc, where samples corresponding to a source location are aggregated multiple times due to the duplicates. In this change we are introducing a concept of distribution factor for pseudo probes so that samples can be distributed for duplicated probes scaled by a factor. We hope that optimizations duplicating code well-maintain the branch frequency information (BFI) based on which probe distribution factors are calculated. Distribution factors are updated at the end of preLTO pipeline to reflect an estimated portion of the real execution count.

This change also introduces a pseudo probe verifier that can be run after each IR passes to detect duplicated pseudo probes.

A saturated distribution factor stands for 1.0. A pesudo probe will carry a factor with the value ranged from 0.0 to 1.0. A 64-bit integral distribution factor field that represents [0.0, 1.0] is associated to each block probe. Unfortunately this cannot be done for callsite probes due to the size limitation of a 32-bit Dwarf discriminator. A 7-bit distribution factor is used instead.

Changes are also needed to the sample profile inliner to deal with prorated callsite counts. Call sites duplicated by PreLTO passes, when later on inlined in LTO time, should have the callees’s probe prorated based on the Prelink-computed distribution factors. The distribution factors should also be taken into account when computing hotness for inline candidates. Also, Indirect call promotion results in multiple callisites. The original samples should be distributed across them. This is fixed by adjusting the callisites' distribution factors.

Diff Detail

Event Timeline

hoy created this revision.Dec 14 2020, 6:39 PM
hoy requested review of this revision.Dec 14 2020, 6:39 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 14 2020, 6:39 PM
hoy retitled this revision from [CSSPGO] Introducing distribution factor for pseudo probe to [CSSPGO] Introducing distribution factor for pseudo probe..Dec 14 2020, 6:46 PM
hoy edited the summary of this revision. (Show Details)
hoy updated this revision to Diff 312339.Dec 16 2020, 5:07 PM

Rebasing.

wmi added inline comments.Tue, Jan 12, 11:07 AM
llvm/include/llvm/IR/PseudoProbe.h
41

The bits in discriminator is a scare resource. Have you considered using less bits to represent probe distribution factor? I guess it is possible that using a little more coarse grain distribution factor won't affect performance.

llvm/include/llvm/Passes/StandardInstrumentations.h
273

Before PseudoProbeUpdate pass, there is no need to verify because PseudoProbeUpdate will make distribution factor consistent. PseudoProbeUpdate run in a late stage in the lto/thinlto prelink pipeline, and no many passes need the verification, so what is the major usage of PseudoProbeVerifier?

llvm/lib/Transforms/IPO/SampleProfileProbe.cpp
133–134

Why not issue warning/error message when verification fails? That will make enabling the verification in release compiler possible.

hoy added inline comments.Tue, Jan 12, 12:25 PM
llvm/include/llvm/IR/PseudoProbe.h
41

That's a good point. We are using seven bits to represent [0, 100] so that integral numbers can be distinguished. Yes, we could use fewer bits to represent, say 4 bits to represent only even numbers. We could also not use any bits here but instead use the distribution factor of the outer block probes when the competition of those bits are high. I can do an experiment to see how well that works.

llvm/include/llvm/Passes/StandardInstrumentations.h
273

Yeah, there's no need to verify intermediate passes. The verifier pass is just a handy utility that tracks those passes that do code duplication for debugging. Perhaps I should give it a better name like PseudoCloningTracker?

llvm/lib/Transforms/IPO/SampleProfileProbe.cpp
133–134

The verifier is for debugging only. It doesn't really do any verification. It just helps to track code duplication. Sorry for the naming confusion.

hoy added inline comments.Wed, Jan 13, 10:12 PM
llvm/include/llvm/IR/PseudoProbe.h
41

On a second thought, using the distribution factor of block probes for call probe may not work well since a callsite may be surrounded by more than one block probes.

We could use also fewer bits like 6 bits to encode even numbers in the range [0, 100], or 5 bits to encoding multiples of 3 in [0, 100]. I did a profile quality measurement with the even number encoding. It's OK overall except for two SPEC benchmarks. I guess it's a trade-off we'll have to take when there's a competition on those bits.

hoy updated this revision to Diff 316567.Wed, Jan 13, 10:13 PM

Adding support in the priority-based inliner.

hoy edited the summary of this revision. (Show Details)Wed, Jan 13, 10:16 PM
wmi added inline comments.Wed, Jan 13, 10:40 PM
llvm/include/llvm/IR/PseudoProbe.h
41

Could you elaborate a little bit about the case that a callsite is surrounded by more than one block probe? Is it because bb merge like in cfg simplification?

hoy added inline comments.Wed, Jan 13, 10:58 PM
llvm/include/llvm/IR/PseudoProbe.h
41

Yes, block merge in cfg simplification is a good example. Inlining can also end up with callee code and caller code in one block. Jump threading or other cfg optimizations that convert a conditional jump into an unconditional jump can result in block merge too.

So far our way to track block weight for blocks with multiple probes is to take the maximum count out of those probes. When it comes to tracking callsite count, it is handy and accurate to attach a dedicated distribution factor for each individual call. For example, when a call is inlined, the inlinee's probes will be cloned into the caller, and they will be prorated based on the callsite's dedicated distribution factor.

wmi added inline comments.Thu, Jan 14, 11:11 AM
llvm/include/llvm/IR/PseudoProbe.h
41

Actually, I think we may be able to extend Discriminator and PseudoProbeDwarfDiscriminator. To emit Discriminator into Dwarf, we need to follow Dwarf standard about how many bits Discrminator is going to occupy. But inside compiler, Discriminator is represented as MetaData so it hasn't to be 32bits. For example, we can extend Discriminator MetaData to be 64bits or even larger and specify only lower 32bits will be actually emitted into Dwarf section. For intermediate information like distribution factors, we can put it into the higher bits.

hoy added inline comments.Thu, Jan 14, 11:49 AM
llvm/include/llvm/IR/PseudoProbe.h
41

That's a good idea, I like that. Actually we thought about that int the past and our concern was about memory cost since the discriminator filed in DILexicalBlockFile metadata is not optional. It is probably OK for pseudo probe since discriminators are only used for callsites. It might be a problem with -fdebug-info-for-profiling where discriminators can be used more often.

It sounds to me extending the size of discriminator is desirable for pseudo probes and potentially FS-AFDO. It might be worth evaluating the cost at some time. What do you think?

wmi added inline comments.Thu, Jan 14, 10:58 PM
llvm/include/llvm/IR/PseudoProbe.h
41

Yes, it is worth evaluating the cost. It is only about intermediate data in compiler and it won't affect the binary and profile output, therefore it won't introduce backward compatibility issue. I think it is up to you to choose whether to evaluate it now or later.

llvm/lib/IR/PseudoProbe.cpp
65

Add assertion message.

llvm/lib/Transforms/IPO/SampleProfile.cpp
367–370

CallsiteCount will be the count before being prorated or after if CallsiteDistribution is not 1.0?

hoy added inline comments.Thu, Jan 14, 11:15 PM
llvm/include/llvm/IR/PseudoProbe.h
41

Sounds good. Will do a measurement for both -fpseudo-probe-for-profiling and -fdebug-info-for-profiling later.

llvm/lib/IR/PseudoProbe.cpp
65

Will do.

llvm/lib/Transforms/IPO/SampleProfile.cpp
367–370

It is the count after prorated. The prorated count will be used to guide inlining. For example, if a callsite is duplicated in LTO prelink, then in LTO postlink the two copies will get their own distribution factors and their prorated counts are used to decide if they should be inlined independently.

wmi accepted this revision.Fri, Jan 15, 9:53 AM

LGTM. Thanks.

llvm/lib/Transforms/IPO/SampleProfile.cpp
367–370

Ok, better comment it.

This revision is now accepted and ready to land.Fri, Jan 15, 9:53 AM
hoy updated this revision to Diff 317726.Tue, Jan 19, 4:52 PM
hoy marked an inline comment as done.

Addressing Wei's feedbacks.

Also prorating indirect callsite target count annotation.