Hi,
Currently, pass AddDiscriminators always skips intrinsic calls.
This approach makes sense for DbgInfoIntrinsic calls (and potentially many other classes of intrinsics). However, it may be problematic in some cases for MemIntrinsic calls as demonstrated by the following example.
;;
%struct.B = type { %struct.A, i32 }
%struct.A = type { i32, i16 }
@g_b = external global %struct.B, align 4
define i32 @foo(i32 %cond) #0 !dbg !5 {
entry:
%g_b.coerce = alloca { i64, i32 }, align 4 %tobool = icmp ne i32 %cond, 0, !dbg !7 br i1 %tobool, label %cond.true, label %cond.end, !dbg !7
cond.true:
%0 = bitcast { i64, i32 }* %g_b.coerce to i8*, !dbg !8 %1 = bitcast %struct.B* @g_b to i8*, !dbg !8 call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 12, i32 4, i1 false), !dbg !8 %2 = getelementptr inbounds { i64, i32 }, { i64, i32 }* %g_b.coerce, i32 0, i32 0, !dbg !8 %3 = load i64, i64* %2, align 4, !dbg !8 %4 = getelementptr inbounds { i64, i32 }, { i64, i32 }* %g_b.coerce, i32 0, i32 1, !dbg !8 %5 = load i32, i32* %4, align 4, !dbg !8 %call = call i32 @bar(i64 %3, i32 %5, i32 33), !dbg !8 br label %cond.end, !dbg !7
cond.end: ; preds = %entry, %cond.true
%cond1 = phi i32 [ %call, %cond.true ], [ 42, %entry ], !dbg !7 ret i32 %cond1, !dbg !9
}
;;
Where:
!6 = !DISubroutineType(types: !2)
!7 = !DILocation(line: 16, column: 16, scope: !5)
!8 = !DILocation(line: 16, column: 23, scope: !5)
!9 = !DILocation(line: 17, column: 3, scope: !5)
Here, global variable g_b is passed by copy to function bar. That copy is located on the stack (alloca %g_b.coerce), and it is initialized by a memcpy call. It turns out that the copy can be fully bypassed, and each load fro %g_can be safely rewritten as a load from %g_b.
This is what SROA would do in this example. Our original alloca %g_b.coerce is firstly split into two (smaller disjoint) slices: slice [0,8) and slice [8, 12). Then users of the original alloca are rewritten accordingly.
The initializing memcpy intrinsic call is therefore rewritten as two pairs of loads and stores (one per each new slice of the alloca). Each load/store pair inherits the debug location from the original memcpy call. A later mem2reg pass bypasses stack load %3 and stack load %5 by propagating the new loads generated during the alloca partitioning stage.
However, pass AddDiscriminators (which is run before SROA) didn't assign a discriminator to the intrinsic memcpy call, and therefore the loads obtained from partitioning the alloca have been attributed to a debug location which hasn't been updated with a discriminator.
Test memcpy-discriminator-2.ll is a minimal reproducible obtained from an internal codebase.
In the original code, the "then" portion of the select statement was extremely cold according to the profile. However, SampleProfile ended up attributing a very large weight to the "then" block due to the absence of a discriminator on the load instructions at the end of sroa.
As a consequence, the branch weight metadata for the conditional branch incorrectly reported a too big number for the "then" block (later on, this led to a sub-optimal block placement).
This patch fixes the problem by adding an extra check for memory intrinsic calls.
As a side note:
Currently, SampleProfile skips intrinsic calls when computing basic block weights. I guess, that is the main reason why we also want to skip intrinsic calls from AddDiscriminators. I wonder if that approach is a bit too conservative and whether we should sometimes allow discriminators for intrinsic calls. There are obvious cases where we shouldn't ever assign discriminators. For example, we don't need a discriminator for debug info intrinsic calls, or Instrumented PGO intrinsic calls, or lifetime markers; etc. However, we also have intrinsics generated from the expansion of target specific builtin calls. By skipping those, don't we risk to lose a bit of coverage in the sample profile?.
Please let me know if okay to commit.
Thanks,
Andrea