This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
-
SampleProfile.cpp
-
test/Transforms/SampleProfile/
-
Transforms/
-
SampleProfile/
-
Inputs/
-
cold-callsite.prof
-
inline.compactbinary.afdo
-
inline.prof
-
warm-inline-instance.prof
-
cold-callsite.ll
-
compact-binary-profile.ll
-
warm-inline-instance.ll

Differential D60086

[SampleProfile] Check entry count instead of total count to decide if inlined callsite is hot.
Needs ReviewPublic

Authored by twoh on Apr 1 2019, 11:55 AM.

Download Raw Diff

Details

Reviewers

danielcdh
wmi

Summary

As we're interested in the "callsite" hotness, not the hotness of the inlined function itself, it makes more sense to check entry sample count rather than the total sample count. Also, considering that PSI histogram is based on the execution cound of each individual line, I think PSI->isHotCount should be invoked with a single instruction/line profile count, not with an aggregated count for a function.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 29924
Build 29923: arc lint + arc unit

Event Timeline

twoh created this revision.Apr 1 2019, 11:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2019, 11:55 AM

Herald added subscribers: jdoerfert, hiraditya, eraman, javed.absar. · View Herald Transcript

Harbormaster completed remote builds in B29924: Diff 193145.Apr 1 2019, 11:55 AM

Theorectically, as you said checking entry sample count rather than the total sample count makes more sense. However, the entry sample count of callsite is not as precise as function sample entry count, which is got from lbr directly. For callsite, it can get very wrong entry sample count because of missing debug information after optimization. That is why we chose total sample count instead of entry sample count when evaluating the hotness of callsite.

@wmi Thanks for the reply! I can totally understand that entry count is not as precise as total count, but still don't think current implementation is the right way to address the issue. As I mentioned in the summary it compares two different things (instruction level counter vs function level counter), opens up a possibility for optimizing against wrong function (e.g. long and cold function), and makes it hard to find the root cause of the performance issue.

If we can't have a precise entry count, the right way to address the issue would be not using PSI based heuristic but using a heuristic that actually considers a total count of the function.

In D60086#1450766, @twoh wrote:

@wmi Thanks for the reply! I can totally understand that entry count is not as precise as total count, but still don't think current implementation is the right way to address the issue. As I mentioned in the summary it compares two different things (instruction level counter vs function level counter), opens up a possibility for optimizing against wrong function (e.g. long and cold function), and makes it hard to find the root cause of the performance issue.

If we can't have a precise entry count, the right way to address the issue would be not using PSI based heuristic but using a heuristic that actually considers a total count of the function.

Thanks Taewook. I understand your concern and I want to add some extra explanation to my last reply. In addition to the impreciseness of entry count of inline instance, another reason we choose total sample count instead of entry count is it is more conservative way to prevent missing inlining in sample profile load phase. Total sample count is always equal to or larger than the entry count, which means it is less likely we miss a function which better be inlined but not. Want to emphasize that the main purpose of inlining inside of AutoFDO is not to reduce call overhead or expose other optimization opportunities, but to enable context sensitive profile annotation. Suppose if there is an inline instance with cold entry but a hot spot inside of it, although we will not reduce call overhead by inlining it, inlining it is still important so the hot spot can get context sensitive profile during profile annotation.

As you said, one case we want to prevent is the inline instance is long and doesn't have any hot spot inside of it. The only reason the total sample count looks hot is because the body is large enough. For such case, we may be able to prevent it by adding some simple heuristic in callsiteIsHot. Like in addition to the total sample count being larger than hot threshold, the max sample count inside of the inline instance must also be larger than a certain threshold, i.e., there is at least one hot spot inside for the inline instance to be hot.

What do you think?

@wmi Thank you for the detailed explanation!

Want to emphasize that the main purpose of inlining inside of AutoFDO is not to reduce call overhead or expose other optimization opportunities, but to enable context sensitive profile annotation. Suppose if there is an inline instance with cold entry but a hot spot inside of it, although we will not reduce call overhead by inlining it, inlining it is still important so the hot spot can get context sensitive profile during profile annotation.

I can understand the value of context-sensitive profile information, but wouldn't it be only valuable if the callsite is actually worth inlining? IIUC, LLVM doesn't propagate the context-sensitive profile information collected from the inlined callsite back to the original standalone function. For me it seems that the current implementation drives the wrong inlining decision to collect the information useful only under the wrong decision.

As you said, one case we want to prevent is the inline instance is long and doesn't have any hot spot inside of it. The only reason the total sample count looks hot is because the body is large enough. For such case, we may be able to prevent it by adding some simple heuristic in callsiteIsHot. Like in addition to the total sample count being larger than hot threshold, the max sample count inside of the inline instance must also be larger than a certain threshold, i.e., there is at least one hot spot inside for the inline instance to be hot.

I think this will mitigate the problem, but still don't think it is the right way to solve the problem. My biggest concern with the current implementation is that we're comparing function total count against PSI threshold values. If we really cannot trust function entry count for sample profile data we may try function total count based heuristic, but still comparing total count against PSI threshold, which is computed from the instruction count histogram, seems wrong to me.

I can understand the value of context-sensitive profile information, but wouldn't it be only valuable if the callsite is actually worth inlining?

I think an example may be helpful to illustrate why inlining for better profile annotation is important even if the inlining doesn't remove call overhead because of entry being cold.

foo() {
  for (i = 0; i < N; i++) {
    if (condition = getcond()) {

    } else {

    }
  }
}

goo {
  ...
  call foo  // condition is true in 99% iterations of loop in foo.
  ...
}

hoo {
  ...
  call foo  // condition is false in 99% iterations of loop in foo.
  ...
}

Suppose foo is only called twice, once in goo and once in hoo, so is very cold at the function entry. But the loop inside of foo is very hot (N is very large). Suppose in non-FDO build, both callsites of foo are inlined although they are cold because size cost is lower than threshold (we can reduce size by inlining foo). So we will have context sensitive profiles for foo in the profile generated from non-FDO binary.

If we inline the two callsites in sample profile loader, after profile annotation, we know the branch probablity of the if statement inlined into goo will be 99% vs 1% while branch probablity in hoo will be 1% vs 99%.

If we don't inline the above two callsites in sample profile loader, it could be the case either we don't get any profile for foo because foo are all inlined in non-FDO binary, or we get the profile but the profile is not context sensitive so the branch probability of if statement in foo will be regarded as 50% vs 50%.

@wmi Thank you for the concrete example! I think what we need for your example is context-sensitive profiling and function specialization, not inlining. Admittedly we don't have an infrastructure in LLVM to support context-sensitive profiling for non-inlined case and we don't perform context sensitive function specialization...

Again, my bigger concern is with using PSI->isHotCount to check the function hotness. If we want to stay with the function total count based heuristic, wouldn't it make more sense if we have something like ProfileSummaryInfo::isFunctionHotInCallGraph, which actually checks the hotness of the function itself, and use it?

In D60086#1452234, @twoh wrote:

@wmi Thank you for the concrete example! I think what we need for your example is context-sensitive profiling and function specialization, not inlining. Admittedly we don't have an infrastructure in LLVM to support context-sensitive profiling for non-inlined case and we don't perform context sensitive function specialization...

Again, my bigger concern is with using PSI->isHotCount to check the function hotness. If we want to stay with the function total count based heuristic, wouldn't it make more sense if we have something like ProfileSummaryInfo::isFunctionHotInCallGraph, which actually checks the hotness of the function itself, and use it?

I agree with that. We can create ProfileSummaryInfo::isFunctionHotInCallGraph instead of simply using PSI->isHotCount. Comparing against PSI->isHotCount is actually the simplest implementation of ProfileSummaryInfo::isFunctionHotInCallGraph. The mitigating strategy mentioned above could also be added here. You can definitely come up with better way to model the global hotness of inline instance. Another goodness of splitting up the logics to comparing global hotness in different scenarios is it may be easier for performance tuning.

wenlei added a subscriber: wenlei.Apr 3 2019, 5:59 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

SampleProfile.cpp

4 lines

test/

Transforms/

SampleProfile/

Inputs/

cold-callsite.prof

10 lines

inline.compactbinary.afdo

inline.prof

6 lines

warm-inline-instance.prof

6 lines

cold-callsite.ll

55 lines

compact-binary-profile.ll

2 lines

warm-inline-instance.ll

2 lines

Diff 193145

llvm/lib/Transforms/IPO/SampleProfile.cpp

	Show First 20 Lines • Show All 381 Lines • ▼ Show 20 Lines
	/// sample count with the hot cutoff computed by ProfileSummaryInfo, it is			/// sample count with the hot cutoff computed by ProfileSummaryInfo, it is
	/// regarded as hot if the count is above the cutoff value.			/// regarded as hot if the count is above the cutoff value.
	static bool callsiteIsHot(const FunctionSamples *CallsiteFS,			static bool callsiteIsHot(const FunctionSamples *CallsiteFS,
	ProfileSummaryInfo *PSI) {			ProfileSummaryInfo *PSI) {
	if (!CallsiteFS)			if (!CallsiteFS)
	return false; // The callsite was not inlined in the original binary.			return false; // The callsite was not inlined in the original binary.

	assert(PSI && "PSI is expected to be non null");			assert(PSI && "PSI is expected to be non null");
	uint64_t CallsiteTotalSamples = CallsiteFS->getTotalSamples();			uint64_t CallsiteEntrySamples = CallsiteFS->getEntrySamples();
	return PSI->isHotCount(CallsiteTotalSamples);			return PSI->isHotCount(CallsiteEntrySamples);
	}			}

	/// Mark as used the sample record for the given function samples at			/// Mark as used the sample record for the given function samples at
	/// (LineOffset, Discriminator).			/// (LineOffset, Discriminator).
	///			///
	/// \returns true if this is the first time we mark the given record.			/// \returns true if this is the first time we mark the given record.
	bool SampleCoverageTracker::markSamplesUsed(const FunctionSamples *FS,			bool SampleCoverageTracker::markSamplesUsed(const FunctionSamples *FS,
	uint32_t LineOffset,			uint32_t LineOffset,
	▲ Show 20 Lines • Show All 1,306 Lines • Show Last 20 Lines

llvm/test/Transforms/SampleProfile/Inputs/cold-callsite.prof

This file was added.

				foo:99900:1
				1: 100
				2: 49900
				3: 49900
				4: 49900
				1: bar:50000
				1: 100
				2: 49900
				3: 49900
				4: 49900

llvm/test/Transforms/SampleProfile/Inputs/inline.compactbinary.afdo

llvm/test/Transforms/SampleProfile/Inputs/inline.prof

	main:225715:0			main:225715:0
	2.1: 5553			2.1: 5553
	3: 5391			3: 5391
	3.1: _Z3sumii:5860			3.1: _Z3sumii:5860
	0: 5279			0: 5392
	1: 5279			1: 5392
	2: 5279			2: 5392

llvm/test/Transforms/SampleProfile/Inputs/warm-inline-instance.prof

	main:2257150:0			main:2257150:0
	2.1: 5553			2.1: 5553
	3: 5391			3: 5391
	3.1: foo:5860			3.1: foo:5860
	0: 5279			0: 5392
	1: 5279			1: 5392
	2: 5279			2: 5392
	4.1: goo:60			4.1: goo:60
	0: 20			0: 20
	1: 20			1: 20
	2: 20			2: 20

llvm/test/Transforms/SampleProfile/cold-callsite.ll

This file was added.

				; RUN: opt < %s -sample-profile -sample-profile-file=%S/Inputs/cold-callsite.prof -S \| FileCheck %s

				define i32 @foo(i32 %a) !dbg !3 {
				; CHECK: define i32 @foo
				; CHECK: call i32 @bar
				entry:
				%call = call i32 @bar(i32 %a), !dbg !4
				br label %while.body

				while.body:
				%cnt = phi i32 [ 0, %entry ], [ %cnt1, %while.body ]
				%add = phi i32 [ %call, %entry ], [ %add1, %while.body],!dbg !5
				%cnt1 = add nsw i32 %cnt, 1, !dbg !6
				%add1 = add nsw i32 %add, 1, !dbg !7
				%cmp1 = icmp slt i32 %add, 5000
				br i1 %cmp1, label %while.body, label %ret

				ret:
				ret i32 %add1
				}

				define i32 @bar(i32 %a) !dbg !8 {
				entry:
				%init = add nsw i32 %a, 10, !dbg !9
				br label %while.body

				while.body:
				%cnt = phi i32 [ 0, %entry ], [ %cnt1, %while.body ]
				%add = phi i32 [ %init, %entry ], [ %add1, %while.body],!dbg !10
				%cnt1 = add nsw i32 %cnt, 1, !dbg !11
				%add1 = add nsw i32 %add, 1, !dbg !12
				%cmp1 = icmp slt i32 %cnt, 5000
				br i1 %cmp1, label %while.body, label %ret

				ret:
				ret i32 %add1
				}


				!llvm.dbg.cu = !{!0}
				!llvm.module.flags = !{!2}

				!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus, file: !1)
				!1 = !DIFile(filename: "foo.cc", directory: "/")
				!2 = !{i32 2, !"Debug Info Version", i32 3}
				!3 = distinct !DISubprogram(name: "foo", scope: !1, file: !1, line: 4, unit: !0)
				!4 = !DILocation(line: 5, scope: !3)
				!5 = !DILocation(line: 6, scope: !3)
				!6 = !DILocation(line: 7, scope: !3)
				!7 = !DILocation(line: 8, scope: !3)
				!8 = distinct !DISubprogram(name: "bar", scope: !1, file: !1, line: 10, unit: !0)
				!9 = !DILocation(line: 11, scope: !8)
				!10 = !DILocation(line: 12, scope: !8)
				!11 = !DILocation(line: 13, scope: !8)
				!12 = !DILocation(line: 14, scope: !8)

llvm/test/Transforms/SampleProfile/compact-binary-profile.ll

	Show All 23 Lines
	; Check sample-profile phase using compactbinary format profile will annotate			; Check sample-profile phase using compactbinary format profile will annotate
	; the IR with exactly the same result as using text format.			; the IR with exactly the same result as using text format.
	; CHECK: br i1 %cmp, label %while.body, label %while.end{{.}} !prof ![[IDX1:[0-9]]]			; CHECK: br i1 %cmp, label %while.body, label %while.end{{.}} !prof ![[IDX1:[0-9]]]
	; CHECK: br i1 %cmp1, label %if.then, label %if.else{{.}} !prof ![[IDX2:[0-9]]]			; CHECK: br i1 %cmp1, label %if.then, label %if.else{{.}} !prof ![[IDX2:[0-9]]]
	; CHECK: call i32 (i8, ...) @printf{{.}} !prof ![[IDX3:[0-9]*]]			; CHECK: call i32 (i8, ...) @printf{{.}} !prof ![[IDX3:[0-9]*]]
	; CHECK: = !{!"TotalCount", i64 10944}			; CHECK: = !{!"TotalCount", i64 10944}
	; CHECK: = !{!"MaxCount", i64 5553}			; CHECK: = !{!"MaxCount", i64 5553}
	; CHECK: ![[IDX1]] = !{!"branch_weights", i32 5392, i32 163}			; CHECK: ![[IDX1]] = !{!"branch_weights", i32 5392, i32 163}
	; CHECK: ![[IDX2]] = !{!"branch_weights", i32 5280, i32 113}			; CHECK: ![[IDX2]] = !{!"branch_weights", i32 5393, i32 1}
	; CHECK: ![[IDX3]] = !{!"branch_weights", i32 1}			; CHECK: ![[IDX3]] = !{!"branch_weights", i32 1}

	; Function Attrs: nounwind uwtable			; Function Attrs: nounwind uwtable
	define i32 @_Z3sumii(i32 %x, i32 %y) !dbg !4 {			define i32 @_Z3sumii(i32 %x, i32 %y) !dbg !4 {
	entry:			entry:
	%x.addr = alloca i32, align 4			%x.addr = alloca i32, align 4
	%y.addr = alloca i32, align 4			%y.addr = alloca i32, align 4
	store i32 %x, i32* %x.addr, align 4			store i32 %x, i32* %x.addr, align 4
	▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

llvm/test/Transforms/SampleProfile/warm-inline-instance.ll

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	while.body: ; preds = %while.cond
%t1 = load i32, i32* %i, align 4, !dbg !16		%t1 = load i32, i32* %i, align 4, !dbg !16
%cmp1 = icmp ne i32 %t1, 100, !dbg !16		%cmp1 = icmp ne i32 %t1, 100, !dbg !16
br i1 %cmp1, label %if.then, label %if.else, !dbg !16		br i1 %cmp1, label %if.then, label %if.else, !dbg !16

if.then: ; preds = %while.body		if.then: ; preds = %while.body
%t2 = load i32, i32* %i, align 4, !dbg !18		%t2 = load i32, i32* %i, align 4, !dbg !18
%t3 = load i32, i32* %s, align 4, !dbg !18		%t3 = load i32, i32* %s, align 4, !dbg !18
; Although the ratio of total samples of @foo vs total samples of @main is		; Although the ratio of total samples of @foo vs total samples of @main is
; small, since the total samples count is larger than hot cutoff computed by		; small, since the entry samples count is larger than hot cutoff computed by
; ProfileSummaryInfo, we will still regard the callsite of foo as hot and		; ProfileSummaryInfo, we will still regard the callsite of foo as hot and
; early inlining will inline it.		; early inlining will inline it.
; CHECK-LABEL: @main(		; CHECK-LABEL: @main(
; CHECK-NOT: call i32 @foo(i32 %t2, i32 %t3)		; CHECK-NOT: call i32 @foo(i32 %t2, i32 %t3)
%call1 = call i32 @foo(i32 %t2, i32 %t3), !dbg !18		%call1 = call i32 @foo(i32 %t2, i32 %t3), !dbg !18
store i32 %call1, i32* %s, align 4, !dbg !18		store i32 %call1, i32* %s, align 4, !dbg !18
br label %if.end, !dbg !18		br label %if.end, !dbg !18

▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines