This is an archive of the discontinued LLVM Phabricator instance.

[AddDiscriminators] Assign discriminators to memset/memcpy/memmove intrinsic calls.
ClosedPublic

Authored by andreadb on Apr 10 2017, 11:33 AM.

Details

Summary

Hi,

Currently, pass AddDiscriminators always skips intrinsic calls.
This approach makes sense for DbgInfoIntrinsic calls (and potentially many other classes of intrinsics). However, it may be problematic in some cases for MemIntrinsic calls as demonstrated by the following example.

;;
%struct.B = type { %struct.A, i32 }
%struct.A = type { i32, i16 }

@g_b = external global %struct.B, align 4

define i32 @foo(i32 %cond) #0 !dbg !5 {
entry:

%g_b.coerce = alloca { i64, i32 }, align 4
%tobool = icmp ne i32 %cond, 0, !dbg !7
br i1 %tobool, label %cond.true, label %cond.end, !dbg !7

cond.true:

%0 = bitcast { i64, i32 }* %g_b.coerce to i8*, !dbg !8
%1 = bitcast %struct.B* @g_b to i8*, !dbg !8
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 12, i32 4, i1 false), !dbg !8
%2 = getelementptr inbounds { i64, i32 }, { i64, i32 }* %g_b.coerce, i32 0, i32 0, !dbg !8
%3 = load i64, i64* %2, align 4, !dbg !8
%4 = getelementptr inbounds { i64, i32 }, { i64, i32 }* %g_b.coerce, i32 0, i32 1, !dbg !8
%5 = load i32, i32* %4, align 4, !dbg !8
%call = call i32 @bar(i64 %3, i32 %5, i32 33), !dbg !8
br label %cond.end, !dbg !7

cond.end: ; preds = %entry, %cond.true

%cond1 = phi i32 [ %call, %cond.true ], [ 42, %entry ], !dbg !7
ret i32 %cond1, !dbg !9

}
;;

Where:
!6 = !DISubroutineType(types: !2)
!7 = !DILocation(line: 16, column: 16, scope: !5)
!8 = !DILocation(line: 16, column: 23, scope: !5)
!9 = !DILocation(line: 17, column: 3, scope: !5)

Here, global variable g_b is passed by copy to function bar. That copy is located on the stack (alloca %g_b.coerce), and it is initialized by a memcpy call. It turns out that the copy can be fully bypassed, and each load fro %g_can be safely rewritten as a load from %g_b.

This is what SROA would do in this example. Our original alloca %g_b.coerce is firstly split into two (smaller disjoint) slices: slice [0,8) and slice [8, 12). Then users of the original alloca are rewritten accordingly.

The initializing memcpy intrinsic call is therefore rewritten as two pairs of loads and stores (one per each new slice of the alloca). Each load/store pair inherits the debug location from the original memcpy call. A later mem2reg pass bypasses stack load %3 and stack load %5 by propagating the new loads generated during the alloca partitioning stage.

However, pass AddDiscriminators (which is run before SROA) didn't assign a discriminator to the intrinsic memcpy call, and therefore the loads obtained from partitioning the alloca have been attributed to a debug location which hasn't been updated with a discriminator.

Test memcpy-discriminator-2.ll is a minimal reproducible obtained from an internal codebase.
In the original code, the "then" portion of the select statement was extremely cold according to the profile. However, SampleProfile ended up attributing a very large weight to the "then" block due to the absence of a discriminator on the load instructions at the end of sroa.
As a consequence, the branch weight metadata for the conditional branch incorrectly reported a too big number for the "then" block (later on, this led to a sub-optimal block placement).

This patch fixes the problem by adding an extra check for memory intrinsic calls.

As a side note:
Currently, SampleProfile skips intrinsic calls when computing basic block weights. I guess, that is the main reason why we also want to skip intrinsic calls from AddDiscriminators. I wonder if that approach is a bit too conservative and whether we should sometimes allow discriminators for intrinsic calls. There are obvious cases where we shouldn't ever assign discriminators. For example, we don't need a discriminator for debug info intrinsic calls, or Instrumented PGO intrinsic calls, or lifetime markers; etc. However, we also have intrinsics generated from the expansion of target specific builtin calls. By skipping those, don't we risk to lose a bit of coverage in the sample profile?.

Please let me know if okay to commit.

Thanks,
Andrea

Diff Detail

Repository
rL LLVM

Event Timeline

andreadb created this revision.Apr 10 2017, 11:33 AM
danielcdh edited edge metadata.Apr 10 2017, 7:19 PM

I don't quite understand why you want to set a new discriminator for memcpy builtin. What optimization would it enable? For normal function calls, we need to have discriminator to distinguish callsites in the same BB so that we can annotate the inlined callee correctly. But for the memcpy case, looks like adding a new discriminator does not help down-stream optimizations like inlining?

I don't quite understand why you want to set a new discriminator for memcpy builtin. What optimization would it enable? For normal function calls, we need to have discriminator to distinguish callsites in the same BB so that we can annotate the inlined callee correctly. But for the memcpy case, looks like adding a new discriminator does not help down-stream optimizations like inlining?

Hi Dehao,

This patch is not about enabling more optimization (or more inlining). This is about improving the debug location of loads and stores introduced by SROA when rewriting memory intrinsics.

Basically, SROA knows how to rewrite a memcpy intrinsic as load/store pairs. Each new load/store would obtain the debug location from the original memcpy intrinsic.
However, AddDiscriminators doesn't add a discriminator to intrinsic calls, so the debug location of those loads/stores would not have a discriminator.

Normally, most of those loads and stores don't survive mem2reg. However, there are cases (see memcpy-discriminator.ll) where some loads survive mem2reg.

This is problematic for pass SampleProfile which uses the debug location of instructions to compute block weights.
Essentially, we risk to attribute the wrong block weight to those blocks which contain loads/stores expanded from memory intrinsics.

We have seen this manifest in a relatively hot function of a quite large internal codebase. It led to a suboptimal block placement.

-Andrea

Thanks for explaining. My understanding is that memcpy will introduce new basic block, which will share the same discriminator with other basic block without this patch. As a result, that basic block is incorrectly annotated so that the block placement is suboptimal. If my understanding is correct, could you include a that in the unittest?

The reason I'm careful on adding new discriminator is that as we are encoding discriminator, we have limited bit to represent the base discriminator. We only want to set new discriminator when it can improve profile quality, which is valid if we can have a test showing how this can improve profile quality.

Thanks for explaining. My understanding is that memcpy will introduce new basic block, which will share the same discriminator with other basic block without this patch. As a result, that basic block is incorrectly annotated so that the block placement is suboptimal. If my understanding is correct, could you include a that in the unittest?

SROA would expand the memcpy call into a straight sequence of loads and stores since the size of each new alloca slice is known statically.
So, no new basic block is created during the process; each new load/store pair lives in the same basic block as the original memcpy call.

We can see this by running sroa on test memcpy-discriminator.ll.

SROA partitions the alloca g_b.coerce in two smaller alloca ( one for slice [0,8) and another for slice [8,12) ).

Slices of alloca:   %g_b.coerce = alloca { i64, i32 }, align 4
  [0,12) slice #0 (splittable)
    used by:   call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 12, i32 4, i1 false), !dbg !10
  [0,8) slice #1 (splittable)
    used by:   %3 = load i64, i64* %2, align 4, !dbg !8
  [8,12) slice #2 (splittable)
    used by:   %5 = load i32, i32* %4, align 4, !dbg !8

Each user of the original alloca is rewritten. The memcpy call is one of the users, so it is split into two pairs of loads and store (one pair per each slice).

Later on SROA uses mem2reg to promote the newly created alloca slices to register. In this example, the promotion is successful, and load %3 and %5 are made redundant by the loads materialized as a result of the memcpy expansion.

This is the IR at the end of SROA:

define i32 @foo(i32 %cond) !dbg !5 {
entry:
  %tobool = icmp ne i32 %cond, 0, !dbg !7
  br i1 %tobool, label %cond.true, label %cond.end, !dbg !7

cond.true:                                        ; preds = %entry
  %g_b.coerce.sroa.0.0.copyload = load i64, i64* bitcast (%struct.B* @g_b to i64*), align 4, !dbg !8    <<< NO DISCRIMINATOR
  %g_b.coerce.sroa.2.0.copyload = load i32, i32* getelementptr inbounds (%struct.B, %struct.B* @g_b, i64 0, i32 1), align 4, !dbg !8 <<< NO DISCRIMINATOR
  %call = call i32 @bar(i64 %g_b.coerce.sroa.0.0.copyload, i32 %g_b.coerce.sroa.2.0.copyload, i32 33), !dbg !9 
  br label %cond.end, !dbg !11

cond.end:                                         ; preds = %cond.true, %entry
  %cond1 = phi i32 [ %call, %cond.true ], [ 42, %entry ], !dbg !12
  ret i32 %cond1, !dbg !14
}

Where:
!7 = !DILocation(line: 16, column: 16, scope: !5)
!8 = !DILocation(line: 16, column: 23, scope: !5)
!9 = !DILocation(line: 16, column: 23, scope: !10)
!10 = !DILexicalBlockFile(scope: !5, file: !1, discriminator: 2)
!14 = !DILocation(line: 17, column: 3, scope: !5)

As you can see, the two loads are in the same basic block as the original memcpy call. However, those now reference the wrong location.

The reason I'm careful on adding new discriminator is that as we are encoding discriminator, we have limited bit to represent the base discriminator. We only want to set new discriminator when it can improve profile quality, which is valid if we can have a test showing how this can improve profile quality.

I understand that we don't want to waste discriminators, and I totally agree with you that we shall avoid to consume discriminators as much as possible.

I guess I could add a simple unit test that shows how branch weight metadata is incorrecty for the terminator in the %entry block.

For example, we could have a profile like this:

foo:10000:10000
 16: 10000
 16.1: 0
 17: 10000

SampleProfile would compute a wrong block weight of 10000 for block %cond.true because of the presence of two loads referencing line 16. That weight would then be propagated to the edge from %entry to %cond.true. So we would end up with an incorrect edge weight of

br i1 %tobool, label %cond.true, label %cond.end, !dbg !7 !prof !36
!36 = !{!"branch_weights", i32 10002, i32 1}

Without this patch, this would result in a sub-optimal placement (see x86 assembly below):

foo:                                    # @foo
        testl   %edi, %edi
        je      .LBB0_1
# BB#2:                                 # %cond.true
        movq    g_b(%rip), %rdi
        movl    g_b+8(%rip), %esi
        movl    $33, %edx
        jmp     bar                     # TAILCALL
.LBB0_1:                                # %cond.end
        movl    $42, %eax
        retq

With this patch, the cold block %cond.true is correctly pushed at the bottom of the function.

foo:                                    # @foo
        testl   %edi, %edi
        jne     .LBB0_2
# BB#1:                                 # %cond.end
        movl    $42, %eax
        retq
.LBB0_2:                                # %cond.true
        movq    g_b(%rip), %rdi
        movl    g_b+8(%rip), %esi
        movl    $33, %edx
        jmp     bar                     # TAILCALL

That being said, I am under the impression that memcpy-discriminator.ll is already enough to show the problem with discriminators and I am not sure if a unit test would add extra value in this case.

-Andrea

Thanks for the explanation. This makes sense. The reasons that we try to avoid IntrinsicInst is 2 fold:

  1. avoid non-deterministic discriminator assignment for different debug level
  2. minimize the number of base discriminators

Apparently, bypassing all IntrinsicInst is an overkill to solve #1, and in fact it will cause the bug as shown in the unittest, and your patch managed to fix it. But for #2, your patch introduces extra base discriminators that is unnecessary. I suggest only keep the first callsite of shouldHaveDiscriminator, and revert the 2nd callsite to just check IntrinsicInst. Please add comment at both places to explain what's the motivation (the first place, i.e. call to shouldHaveDiscriminator is aiming at addressing #1, the 2nd place, i.e. checking IntrinsicInst, is aiming at addressing #1 and #2). And also in the unittest, please check explicitly that the discriminator for the memcpy intrinsic is the same as other instructions in that BB.

Thanks,
Dehao

Thanks for the explanation. This makes sense. The reasons that we try to avoid IntrinsicInst is 2 fold:

  1. avoid non-deterministic discriminator assignment for different debug level
  2. minimize the number of base discriminators

Apparently, bypassing all IntrinsicInst is an overkill to solve #1, and in fact it will cause the bug as shown in the unittest, and your patch managed to fix it. But for #2, your patch introduces extra base discriminators that is unnecessary. I suggest only keep the first callsite of shouldHaveDiscriminator, and revert the 2nd callsite to just check IntrinsicInst. Please add comment at both places to explain what's the motivation (the first place, i.e. call to shouldHaveDiscriminator is aiming at addressing #1, the 2nd place, i.e. checking IntrinsicInst, is aiming at addressing #1 and #2). And also in the unittest, please check explicitly that the discriminator for the memcpy intrinsic is the same as other instructions in that BB.

Thanks for the feedback.
I will add the comments you have requested and I will restore the 2nd callsite to the check for IntrinsicInst.

Do you want me to keep test memcpy-discriminator.ll in the final patch?

About the unit test: is there a template that I can use for this particular case? I never had to write an llvm unittest before, so I don't know where to look at for examples.

Thanks,
Andrea

Thanks,
Dehao

Thanks for the explanation. This makes sense. The reasons that we try to avoid IntrinsicInst is 2 fold:

  1. avoid non-deterministic discriminator assignment for different debug level
  2. minimize the number of base discriminators

Apparently, bypassing all IntrinsicInst is an overkill to solve #1, and in fact it will cause the bug as shown in the unittest, and your patch managed to fix it. But for #2, your patch introduces extra base discriminators that is unnecessary. I suggest only keep the first callsite of shouldHaveDiscriminator, and revert the 2nd callsite to just check IntrinsicInst. Please add comment at both places to explain what's the motivation (the first place, i.e. call to shouldHaveDiscriminator is aiming at addressing #1, the 2nd place, i.e. checking IntrinsicInst, is aiming at addressing #1 and #2). And also in the unittest, please check explicitly that the discriminator for the memcpy intrinsic is the same as other instructions in that BB.

Thanks for the feedback.
I will add the comments you have requested and I will restore the 2nd callsite to the check for IntrinsicInst.

Do you want me to keep test memcpy-discriminator.ll in the final patch?

About the unit test: is there a template that I can use for this particular case? I never had to write an llvm unittest before, so I don't know where to look at for examples.

Thanks,
Andrea

Just to avoid misunderstandings, when you say unittest do you mean my original memcpy-discriminator.ll?

Thanks for the explanation. This makes sense. The reasons that we try to avoid IntrinsicInst is 2 fold:

  1. avoid non-deterministic discriminator assignment for different debug level
  2. minimize the number of base discriminators

Apparently, bypassing all IntrinsicInst is an overkill to solve #1, and in fact it will cause the bug as shown in the unittest, and your patch managed to fix it. But for #2, your patch introduces extra base discriminators that is unnecessary. I suggest only keep the first callsite of shouldHaveDiscriminator, and revert the 2nd callsite to just check IntrinsicInst. Please add comment at both places to explain what's the motivation (the first place, i.e. call to shouldHaveDiscriminator is aiming at addressing #1, the 2nd place, i.e. checking IntrinsicInst, is aiming at addressing #1 and #2). And also in the unittest, please check explicitly that the discriminator for the memcpy intrinsic is the same as other instructions in that BB.

Thanks for the feedback.
I will add the comments you have requested and I will restore the 2nd callsite to the check for IntrinsicInst.

Do you want me to keep test memcpy-discriminator.ll in the final patch?

About the unit test: is there a template that I can use for this particular case? I never had to write an llvm unittest before, so I don't know where to look at for examples.

You can look at test/Transforms/AddDiscriminators/call.ll as an example.

Thanks,
Andrea

Just to avoid misunderstandings, when you say unittest do you mean my original memcpy-discriminator.ll?

Sorry for the confusion. I meant your original memcpy-discriminator.ll

Thanks,
Dehao

andreadb updated this revision to Diff 94862.Apr 11 2017, 11:24 AM

Rebase and address code review comments.

Removed the 2nd call to shouldHaveDiscriminator(), and added code comments as suggested by Dehao.

Improved test memcpy-discriminator.ll. Now we also check that the same discriminator is used by all instructions within basic block %cond.true.

Please let me know if okay to commit.

Thanks,
Andrea

danielcdh accepted this revision.Apr 11 2017, 11:30 AM
This revision is now accepted and ready to land.Apr 11 2017, 11:30 AM
This revision was automatically updated to reflect the committed changes.