Page MenuHomePhabricator

[CSSPGO] Pseudo probe instrumentation for basic blocks.
Needs ReviewPublic

Authored by hoy on Aug 18 2020, 9:46 PM.

Details

Reviewers
wenlei
davidxl
wmi
Summary

This change introduces a low-cost instrumentation technique for AutoFDO, namely pseudo probe. Please see RFC here: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s

Being able to profile production binaries is a key advantage of AutoFDO over Instrumentation PGO, but it also comes with a big challenge. While using line number and discriminator as anchor for profile mapping incurs zero run time overhead for AutoFDO, it’s not as accurate as instrumented probes. This is because the instrumented probes are part of the IR, rather than metadata attached to the IR like !dbg. That has two implications: 1) it’s easier to maintain IR than metadata for optimization passes; 2) probe blocks some CFG transformations that can mess up profile correlation.
 
With the proposed pseudo instrumentation, we can achieve most of the benefit of instrumentation PGO in little runtime overhead. We instrument each basic block with a pseudo probe associated with the block Id. Unlike in PGO instrumentation where a counter is implemented as a persisting operation such as atomic read/write or runtime helper call, a pseudo probe is implemented as a dedicated intrinsic call with the IntrInaccessibleMemOnly attribute. The intrinsic comes with most of the semantics of a PGO counter but is much less optimization-intrusive. 
 
The pseudo probe intrinsic calls are on the IR throughout the optimization and code generation pipeline and are materialized as a piece of binary data stored in a separate .pseudo_probe data section. The section is then used to map binary samples back to blocks of CFG during profile generation. There are also no real machine instructions generated for a pseudo probe and the.pseudo_probe section won’t be loaded into memory at runtime, therefore they should incur very little runtime overhead.

Let's now look at an example. Given the following LLVM IR:

define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
  %cmp = icmp eq i32 %x, 0
   br i1 %cmp, label %bb1, label %bb2
bb1:
   br label %bb3
bb2:
   br label %bb3
bb3:
   ret void
}

The instrumented IR will look like below. Note that each llvm.pseudoprobe intrinsic call represents a pseudo probe at a block, of which the first parameter is the GUID of the probe’s owner function and the second parameter is the probe’s ID.

define internal void @foo2(i32 %x, void (i32)* %f) !dbg !4 {
bb0:
   %cmp = icmp eq i32 %x, 0
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 1)
   br i1 %cmp, label %bb1, label %bb2
bb1:                                             
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 2)
   br label %bb3
bb2:                                              
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 3)
   br label %bb3
bb3:                                              
   call void @llvm.pseudoprobe(i64 837061429793323041, i64 4)
   ret void
}

One implication from pseudo-probe instrumentation is that the profile is now sensitive to CFG changes. We perform the pseudo instrumentation very early in the pre-LTO pipeline, before any CFG transformation. This ensures that the CFG instrumented and annotated is stable. We added SampleProfileProber that performs the pseudo instrumentation and runs independent of profile annotation.

An llvm.pseudoprobe intrinsic call will be lowered into a target-independent operation named PSEUDO_PROBE. The MIR shown below corresponds to the previous example. Note that block bb3 is duplicated into bb1 and bb2 where its probe is duplicated too. This allows for an accurate execution count to be collected for bb3, which is basically the sum of the counts of bb1 and bb2.

bb.0.bb0:
   frame-setup PUSH64r undef $rax, implicit-def $rsp, implicit $rsp
   TEST32rr killed renamable $edi, renamable $edi, implicit-def $eflags
   PSEUDO_PROBE 837061429793323041, 1, 0
   $edi = MOV32ri 1, debug-location !13; test.c:0
   JCC_1 %bb.1, 4, implicit $eflags

bb.2.bb2:
   PSEUDO_PROBE 837061429793323041, 3, 0
   PSEUDO_PROBE 837061429793323041, 4, 0
   $rax = frame-destroy POP64r implicit-def $rsp, implicit $rsp
   RETQ

bb.1.bb1:
   PSEUDO_PROBE 837061429793323041, 2, 0
   PSEUDO_PROBE 837061429793323041, 4, 0
   $rax = frame-destroy POP64r implicit-def $rsp, implicit $rsp
   RETQ

The target op PSEUDO_PROBE will be converted into a piece of binary data by the object emitter with no machine instructions generated.

As a starter patch of the whole pseudo probe work, this change focus on the block instrumentation part. The callsite instrumentation and the materialization/encoding of probes will come in separate changes.

A new clang switch -fpseudo-probe-for-profiling is added to enable AutoFDO with pseudo instrumentation, similar to -fdebug-info-for-profiling for AutoFDO.

Diff Detail

Event Timeline

hoy created this revision.Aug 18 2020, 9:46 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 18 2020, 9:46 PM
hoy requested review of this revision.Aug 18 2020, 9:46 PM
hoy updated this revision to Diff 286476.Aug 18 2020, 10:06 PM

Updating D86193: [CSSPGO] Pseudo probe instrumentation for basic blocks

hoy edited the summary of this revision. (Show Details)Aug 18 2020, 10:07 PM
hoy edited the summary of this revision. (Show Details)Aug 18 2020, 10:11 PM
hoy edited the summary of this revision. (Show Details)Aug 18 2020, 10:16 PM
hoy retitled this revision from [CSSPGO] Pseudo probe instrumentation for basic blocks to [CSSPGO] Pseudo probe instrumentation for basic blocks..
hoy added a reviewer: wenlei.
ychen added a subscriber: ychen.Aug 18 2020, 10:30 PM
hoy edited the summary of this revision. (Show Details)Aug 18 2020, 10:51 PM

A heads up -- I won't be able to review patch until mid Sept. Hope this is fine.

hoy updated this revision to Diff 286664.Aug 19 2020, 2:18 PM
hoy edited the summary of this revision. (Show Details)

Updating D86193: [CSSPGO] Pseudo probe instrumentation for basic blocks.

hoy added a comment.Aug 19 2020, 2:18 PM

A heads up -- I won't be able to review patch until mid Sept. Hope this is fine.

Thanks for the heads-up. That's fine. We can wait for your input.

wmi added a comment.Aug 23 2020, 11:35 AM

Thanks for the patch! A few questions:

probe blocks some CFG transformations that can mess up profile correlation.

Can you enumerate some CFG transformations which be blocked? Is it possible that some CFG transformations being blocked are actually beneficial for later optimizations?

Are the intrinsic probes counted when computing bb size and function size?

And could you split the patches into small parts for easier review. For example, Add the intrinsic support in IR and MIR. SampleProfileProbe pass. -fpseudo-probe-for-profiling support. changes in various passes.

hoy added a comment.Aug 24 2020, 11:20 AM
In D86193#2232609, @wmi wrote:

Thanks for the patch! A few questions:

probe blocks some CFG transformations that can mess up profile correlation.

Can you enumerate some CFG transformations which be blocked? Is it possible that some CFG transformations being blocked are actually beneficial for later optimizations?

There are some optimizations such as if-convert, tail call elimination, that were initially blocked by the pseudo probe intrinsic but is now unblocked by fixes included in this change. With the current change we do not see perf degradation out of SPEC and one of our internal large services.

The main optimizations left blocked intentionally are those that merge blocks for smaller code size, such as tail merge which is the opposite of jump threading. We believe that those optimizations are not very beneficial for performance and AutoFDO. But if things are changed we can always unblock them.

Are the intrinsic probes counted when computing bb size and function size?

That's a good question. On the IR level, pseudo probe intrinsics are treated in a similar way of the debug intrinsics and the side-effect intrinsics. On the MIR level, pseudo probe intrinsics are implemented as a StandardPseudoInstruction. So they should not be counted towards real code size.

And could you split the patches into small parts for easier review. For example, Add the intrinsic support in IR and MIR. SampleProfileProbe pass. -fpseudo-probe-for-profiling support. changes in various passes.

Thanks for the suggestion. Agreed the current patch is too big to review. Will come up with a list of breakdowns.

wmi added a comment.Aug 26 2020, 2:43 PM

There are some optimizations such as if-convert, tail call elimination, that were initially blocked by the pseudo probe intrinsic but is now unblocked by fixes included in this change. With the current change we do not see perf degradation out of SPEC and one of our internal large services.
The main optimizations left blocked intentionally are those that merge blocks for smaller code size, such as tail merge which is the opposite of jump threading. We believe that those optimizations are not very beneficial for performance and AutoFDO.

If the optimizations are not very beneficial for performance and AutoFDO and should be blocked, it may be better to block them in a more general way and not depend on pseudo probe, because blocking them may also be beneficial for debug info based AutoFDO.

Another reason is that pseudo probe looks pretty much like debug information to me. They are used to annotate the IR but shouldn't affect the transformation. Binaries built w/wo debug information are required to be identical in LLVM. I think that requirement could be applied on pseudo probe as well. It is even better to have some test to enforce it so that no change in the future could break the requirement.

hoy added a comment.Aug 26 2020, 4:39 PM
In D86193#2240353, @wmi wrote:

There are some optimizations such as if-convert, tail call elimination, that were initially blocked by the pseudo probe intrinsic but is now unblocked by fixes included in this change. With the current change we do not see perf degradation out of SPEC and one of our internal large services.
The main optimizations left blocked intentionally are those that merge blocks for smaller code size, such as tail merge which is the opposite of jump threading. We believe that those optimizations are not very beneficial for performance and AutoFDO.

If the optimizations are not very beneficial for performance and AutoFDO and should be blocked, it may be better to block them in a more general way and not depend on pseudo probe, because blocking them may also be beneficial for debug info based AutoFDO.

In theory, yes, we should have a black list of transforms (mainly related to block merge) that are not needed by AutoFDO and block them. In reality it might take quite some efforts to figure them out. Pseudo probe, on the other hand, starts with blocking those transforms in the first place and relax the ones that might actually help AutoFDO.

Another reason is that pseudo probe looks pretty much like debug information to me. They are used to annotate the IR but shouldn't affect the transformation. Binaries built w/wo debug information are required to be identical in LLVM. I think that requirement could be applied on pseudo probe as well. It is even better to have some test to enforce it so that no change in the future could break the requirement.

Good point! Yes, pseudo probe is implemented in a similar way with the debug intrinsics. However they are not guaranteed to not affect the codegen since its main purpose is to achieve an accurate profile correlation with low cost. Regarding the cost, it sits somewhere between the debug intrinsics and the PGO instrumentation and close to a zero cost in practice. Agreed that it would be better to have tests protect the pseudo probe cost from going too high, but not sure which optimizations we should start with. Maybe to start with some critical optimizations like inlining, vectorization?

wmi added a comment.Aug 26 2020, 5:12 PM
In D86193#2240502, @hoy wrote:
In D86193#2240353, @wmi wrote:

There are some optimizations such as if-convert, tail call elimination, that were initially blocked by the pseudo probe intrinsic but is now unblocked by fixes included in this change. With the current change we do not see perf degradation out of SPEC and one of our internal large services.
The main optimizations left blocked intentionally are those that merge blocks for smaller code size, such as tail merge which is the opposite of jump threading. We believe that those optimizations are not very beneficial for performance and AutoFDO.

If the optimizations are not very beneficial for performance and AutoFDO and should be blocked, it may be better to block them in a more general way and not depend on pseudo probe, because blocking them may also be beneficial for debug info based AutoFDO.

In theory, yes, we should have a black list of transforms (mainly related to block merge) that are not needed by AutoFDO and block them. In reality it might take quite some efforts to figure them out. Pseudo probe, on the other hand, starts with blocking those transforms in the first place and relax the ones that might actually help AutoFDO.

Another reason is that pseudo probe looks pretty much like debug information to me. They are used to annotate the IR but shouldn't affect the transformation. Binaries built w/wo debug information are required to be identical in LLVM. I think that requirement could be applied on pseudo probe as well. It is even better to have some test to enforce it so that no change in the future could break the requirement.

Good point! Yes, pseudo probe is implemented in a similar way with the debug intrinsics. However they are not guaranteed to not affect the codegen since its main purpose is to achieve an accurate profile correlation with low cost. Regarding the cost, it sits somewhere between the debug intrinsics and the PGO instrumentation and close to a zero cost in practice.

I see. It makes sense to fix up some important transformations to achieve the goal of low cost. To achieve the goal of not affecting codegen needs a lot more effort to test and fix up all over the pipeline. I don't mean to have it ready in the patch, but I think it maybe something worthy to strive for later on.

Agreed that it would be better to have tests protect the pseudo probe cost from going too high, but not sure which optimizations we should start with. Maybe to start with some critical optimizations like inlining, vectorization?

The test I have in my mind comes from debug info. It is to bootstrap llvm with and without debug information. The test is to check whether the binaries built after stripping the debug information are identical. I am thinking pseudo probe can have such test setup somewhere sometime in the future. Same as above, it doesn't have to be ready currently.

hoy added a comment.Aug 26 2020, 5:26 PM
In D86193#2240596, @wmi wrote:
In D86193#2240502, @hoy wrote:
In D86193#2240353, @wmi wrote:

There are some optimizations such as if-convert, tail call elimination, that were initially blocked by the pseudo probe intrinsic but is now unblocked by fixes included in this change. With the current change we do not see perf degradation out of SPEC and one of our internal large services.
The main optimizations left blocked intentionally are those that merge blocks for smaller code size, such as tail merge which is the opposite of jump threading. We believe that those optimizations are not very beneficial for performance and AutoFDO.

If the optimizations are not very beneficial for performance and AutoFDO and should be blocked, it may be better to block them in a more general way and not depend on pseudo probe, because blocking them may also be beneficial for debug info based AutoFDO.

In theory, yes, we should have a black list of transforms (mainly related to block merge) that are not needed by AutoFDO and block them. In reality it might take quite some efforts to figure them out. Pseudo probe, on the other hand, starts with blocking those transforms in the first place and relax the ones that might actually help AutoFDO.

Another reason is that pseudo probe looks pretty much like debug information to me. They are used to annotate the IR but shouldn't affect the transformation. Binaries built w/wo debug information are required to be identical in LLVM. I think that requirement could be applied on pseudo probe as well. It is even better to have some test to enforce it so that no change in the future could break the requirement.

Good point! Yes, pseudo probe is implemented in a similar way with the debug intrinsics. However they are not guaranteed to not affect the codegen since its main purpose is to achieve an accurate profile correlation with low cost. Regarding the cost, it sits somewhere between the debug intrinsics and the PGO instrumentation and close to a zero cost in practice.

I see. It makes sense to fix up some important transformations to achieve the goal of low cost. To achieve the goal of not affecting codegen needs a lot more effort to test and fix up all over the pipeline. I don't mean to have it ready in the patch, but I think it maybe something worthy to strive for later on.

Sounds good, we will be accumulating a list of AutoFDO-unfriendly transforms over time.

Agreed that it would be better to have tests protect the pseudo probe cost from going too high, but not sure which optimizations we should start with. Maybe to start with some critical optimizations like inlining, vectorization?

The test I have in my mind comes from debug info. It is to bootstrap llvm with and without debug information. The test is to check whether the binaries built after stripping the debug information are identical. I am thinking pseudo probe can have such test setup somewhere sometime in the future. Same as above, it doesn't have to be ready currently.

I like the idea. It would catch a regression on pseudo probe with new optimization changes. Let me think about it. Thanks!

wmi added a comment.Sep 21 2020, 11:35 AM

The patches split from the main one look good to me. Please see if David has further comments.

hoy added a comment.Sep 23 2020, 11:44 AM

@davidxl I'm wondering if it is a good time for you to start reviewing the patches. Please let me know if you need more time. Thanks!