This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
-
CaptureTracking.cpp
-
test/Transforms/MemCpyOpt/
-
Transforms/
-
MemCpyOpt/
-
callslot.ll

Differential D90688

[CaptureTracking] Avoid overly restrictive dominates check
AbandonedPublic

Authored by anna on Nov 3 2020, 8:26 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
apilipenko
reames
hfinkel
nikic

Commits

rG15694fd6ad95: [CaptureTracking] Avoid overly restrictive dominates check

Summary

CapturesBefore tracker has an overly restrictive dominates check when
the BeforeHere and the capture point are in different basic blocks.
All we need to check is that there is no path from the capture point
to BeforeHere (which is less stricter than the dominates check).
See added testcase in one of the users of CapturesBefore.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

anna created this revision.Nov 3 2020, 8:26 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 3 2020, 8:26 AM

Herald added subscribers: dantrushin, hiraditya. · View Herald Transcript

anna requested review of this revision.Nov 3 2020, 8:26 AM

LGTM. Attributor does the same with an "optimistic" version of reachability via AAReachability. I guess that is why no test there changed.

This revision is now accepted and ready to land.Nov 3 2020, 9:01 AM

Harbormaster completed remote builds in B77413: Diff 302589.Nov 3 2020, 9:06 AM

In D90688#2371455, @jdoerfert wrote:

LGTM. Attributor does the same with an "optimistic" version of reachability via AAReachability. I guess that is why no test there changed.

Thanks Johannes. We see these limitations in clients that use this analysis downstream and I was surprised we didn't have a TODO testcase upstream either (i.e. no test cases failing).

This revision was landed with ongoing or failed builds.Nov 5 2020, 8:39 AM

Closed by commit rG15694fd6ad95: [CaptureTracking] Avoid overly restrictive dominates check (authored by anna). · Explain Why

This revision was automatically updated to reflect the committed changes.

anna added a commit: rG15694fd6ad95: [CaptureTracking] Avoid overly restrictive dominates check.

This seems to break tests: http://45.33.8.238/linux/32182/step_7.txt

anna added a reverting change: rGafe92642cc73: Revert "[CaptureTracking] Avoid overly restrictive dominates check".Nov 5 2020, 9:27 AM

nikic reopened this revision.Nov 5 2020, 9:29 AM

This revision is now accepted and ready to land.Nov 5 2020, 9:29 AM

In D90688#2376333, @thakis wrote:

This seems to break tests: http://45.33.8.238/linux/32182/step_7.txt

thanks for the headsup. I've reverted the change and checking the testcase offline.

This change causes a large compile-time regression on some benchmarks: https://llvm-compile-time-tracker.com/compare.php?from=79d16764dd29aeddb7e6400e6b2d89d31653886c&to=15694fd6ad955c6a16b446a6324364111a49ae8b&stat=instructions

tramp3d-v4 is particularly badly affected, with a 25% instruction count regression in the ThinLTO configuration.

This revision now requires changes to proceed.Nov 5 2020, 9:31 AM

In D90688#2376444, @nikic wrote:

This change causes a large compile-time regression on some benchmarks: https://llvm-compile-time-tracker.com/compare.php?from=79d16764dd29aeddb7e6400e6b2d89d31653886c&to=15694fd6ad955c6a16b446a6324364111a49ae8b&stat=instructions

tramp3d-v4 is particularly badly affected, with a 25% instruction count regression in the ThinLTO configuration.

Thanks for pointing that out. I was under the impression the path check is more expensive than the dominates check (and the dominates was done to prevent compile time increase) and our bailouts are also higher (32 basic blocks for the path check).
It looks like this limitation of checking for dominates exists to avoid compile time regressions of this sort.

In D90688#2376495, @anna wrote:

In D90688#2376444, @nikic wrote:

This change causes a large compile-time regression on some benchmarks: https://llvm-compile-time-tracker.com/compare.php?from=79d16764dd29aeddb7e6400e6b2d89d31653886c&to=15694fd6ad955c6a16b446a6324364111a49ae8b&stat=instructions

tramp3d-v4 is particularly badly affected, with a 25% instruction count regression in the ThinLTO configuration.

Thanks for pointing that out. I was under the impression the path check is more expensive than the dominates check (and the dominates was done to prevent compile time increase) and our bailouts are also higher (32 basic blocks for the path check).
It looks like this limitation of checking for dominates exists to avoid compile time regressions of this sort.

Right. It may be possible to salvage this though. One very simple thing to try would be to cache the reachability query. This should make it much cheaper if there are many uses in the same block.

One thing I thought about: Do we have a test where the "nocapture use" has a second operand that captures?

I've applied a couple of fixes to CaptureTracking, the main ones are: https://github.com/llvm/llvm-project/commit/d35366bccae0016418660337ce94e3d7d0ff391e is a potential correctness fix, https://github.com/llvm/llvm-project/commit/f63ab188c63be12871da75bfc5801a7fc752769b is a compile-time improvement.

I also gave caching the reachability query a try: https://llvm-compile-time-tracker.com/compare.php?from=5225c102649b5469d61385e598b744ac8f3dd1da&to=6d1d73dec948a0cbbf0dd6261d5664c4bfce92b1&stat=instructions The result is better, but we still have a 6% regression on tramp3d-v4 with ThinLTO (down from 25%).

One remaining problem in the current CaptureTracking implementation is that the use limit is not fully enforced: It's currently a limit on the the maximum number of direct uses, not transitive uses. If the value has 20 GEP uses, each of which has 20 uses itself, we'll happily explore all 400 of them.

Hi @nikic, some comments below.

In D90688#2381609, @nikic wrote:

I've applied a couple of fixes to CaptureTracking, the main ones are: https://github.com/llvm/llvm-project/commit/d35366bccae0016418660337ce94e3d7d0ff391e is a potential correctness fix, https://github.com/llvm/llvm-project/commit/f63ab188c63be12871da75bfc5801a7fc752769b is a compile-time improvement.

Ah, I think your correctness fix is what Johannes was implying by the "multiple args captured" in a single call.

I also gave caching the reachability query a try: https://llvm-compile-time-tracker.com/compare.php?from=5225c102649b5469d61385e598b744ac8f3dd1da&to=6d1d73dec948a0cbbf0dd6261d5664c4bfce92b1&stat=instructions The result is better, but we still have a 6% regression on tramp3d-v4 with ThinLTO (down from 25%).

Wow. this is great. Thank you for working on this! I'm curious if you have already landed the patch or have it for review? Perhaps further improvement will help reduce the compile time regression here.

One remaining problem in the current CaptureTracking implementation is that the use limit is not fully enforced: It's currently a limit on the the maximum number of direct uses, not transitive uses. If the value has 20 GEP uses, each of which has 20 uses itself, we'll happily explore all 400 of them.

Yeah, you're completely right. We should add a TODO/FIXME at the point we're capturing the direct uses for the worklist (there's a chance this triggers runtime degradations by fixing the "transitive uses" bug).
We also have a comment stating that this max uses restriction can be completely lifted if we cache the capture tracking analysis. The potential downside is that keeping track/adding incremental updates can be a source of bugs.

nikic mentioned this in rGaaf5fd431691: [MemCpyOpt] Add test for unreachable capture (NFC).May 16 2021, 1:50 AM

nikic mentioned this in rG6b8b43e7af30: [CaptureTracking] Do not check domination.

nikic mentioned this in rG656296b1c2ec: Reapply [CaptureTracking] Do not check domination.May 16 2021, 6:49 AM

This has since landed (see above comments from @nikic).

Revision Contents

Path

Size

llvm/

lib/

Analysis/

CaptureTracking.cpp

8 lines

test/

Transforms/

MemCpyOpt/

callslot.ll

33 lines

Diff 303131

llvm/lib/Analysis/CaptureTracking.cpp

Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	bool isSafeToPrune(Instruction *I) {
!BB->getTerminator()->getNumSuccessors())		!BB->getTerminator()->getNumSuccessors())
return true;		return true;

SmallVector<BasicBlock*, 32> Worklist;		SmallVector<BasicBlock*, 32> Worklist;
Worklist.append(succ_begin(BB), succ_end(BB));		Worklist.append(succ_begin(BB), succ_end(BB));
return !isPotentiallyReachableFromMany(Worklist, BB, nullptr, DT);		return !isPotentiallyReachableFromMany(Worklist, BB, nullptr, DT);
}		}

// If the value is defined in the same basic block as use and BeforeHere,		// If the value is defined in a different basic block than BeforeHere,
// there is no need to explore the use if BeforeHere dominates use.		// there is no need to explore the use if there is no path from I to
// Check whether there is a path from I to BeforeHere.		// BeforeHere.
if (BeforeHere != I && DT->dominates(BeforeHere, I) &&		if (BeforeHere != I &&
!isPotentiallyReachable(I, BeforeHere, nullptr, DT))		!isPotentiallyReachable(I, BeforeHere, nullptr, DT))
return true;		return true;

return false;		return false;
}		}

bool shouldExplore(const Use *U) override {		bool shouldExplore(const Use *U) override {
Instruction *I = cast<Instruction>(U->getUser());		Instruction *I = cast<Instruction>(U->getUser());
▲ Show 20 Lines • Show All 291 Lines • Show Last 20 Lines

llvm/test/Transforms/MemCpyOpt/callslot.ll

Show First 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	;
%dest.i8 = bitcast [16 x i8]* %dest to i8*		%dest.i8 = bitcast [16 x i8]* %dest to i8*
%src.i8 = bitcast [16 x i8]* %src to i8*		%src.i8 = bitcast [16 x i8]* %src to i8*
call void @accept_ptr(i8* %dest.i8) ; capture		call void @accept_ptr(i8* %dest.i8) ; capture
call void @accept_ptr(i8* %src.i8) argmemonly		call void @accept_ptr(i8* %src.i8) argmemonly
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dest.i8, i8* %src.i8, i64 16, i1 false)		call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dest.i8, i8* %src.i8, i64 16, i1 false)
ret void		ret void
}		}

		; There is no path from the capture back to the memcpy.
		; So we can perform the call slot optimization.
		define void @capture_nopath_call_argmemonly(i1 %cond) {
		; CHECK-LABEL: @capture_nopath_call_argmemonly(
		; CHECK-NEXT: [[DEST:%.*]] = alloca [16 x i8], align 1
		; CHECK-NEXT: [[SRC:%.*]] = alloca [16 x i8], align 1
		; CHECK-NEXT: [[DEST_I8:%.]] = bitcast [16 x i8] [[DEST]] to i8*
		; CHECK-NEXT: [[SRC_I8:%.]] = bitcast [16 x i8] [[SRC]] to i8*
		; CHECK-NEXT: br i1 [[COND:%.]], label [[CAPTURES:%.]], label [[NOCAPTURES:%.*]]
		; CHECK: captures:
		; CHECK-NEXT: call void @accept_ptr(i8* [[DEST_I8]])
		; CHECK-NEXT: ret void
		; CHECK: nocaptures:
		; CHECK-NEXT: [[DEST1:%.]] = bitcast [16 x i8] [[DEST]] to i8*
		; CHECK-NEXT: call void @accept_ptr(i8* [[DEST1]]) [[ATTR5:#.*]]
		; CHECK-NEXT: ret void
		;
		%dest = alloca [16 x i8]
		%src = alloca [16 x i8]
		%dest.i8 = bitcast [16 x i8]* %dest to i8*
		%src.i8 = bitcast [16 x i8]* %src to i8*
		br i1 %cond, label %captures, label %nocaptures

		captures:
		call void @accept_ptr(i8* %dest.i8) ; capture
		ret void

		nocaptures:
		call void @accept_ptr(i8* %src.i8) argmemonly nounwind
		call void @llvm.memcpy.p0i8.p0i8.i64(i8* %dest.i8, i8* %src.i8, i64 16, i1 false)
		ret void
		}

define void @capture_before_call_argmemonly_nounwind() {		define void @capture_before_call_argmemonly_nounwind() {
; CHECK-LABEL: @capture_before_call_argmemonly_nounwind(		; CHECK-LABEL: @capture_before_call_argmemonly_nounwind(
; CHECK-NEXT: [[DEST:%.*]] = alloca [16 x i8], align 1		; CHECK-NEXT: [[DEST:%.*]] = alloca [16 x i8], align 1
; CHECK-NEXT: [[SRC:%.*]] = alloca [16 x i8], align 1		; CHECK-NEXT: [[SRC:%.*]] = alloca [16 x i8], align 1
; CHECK-NEXT: [[DEST_I8:%.]] = bitcast [16 x i8] [[DEST]] to i8*		; CHECK-NEXT: [[DEST_I8:%.]] = bitcast [16 x i8] [[DEST]] to i8*
; CHECK-NEXT: [[SRC_I8:%.]] = bitcast [16 x i8] [[SRC]] to i8*		; CHECK-NEXT: [[SRC_I8:%.]] = bitcast [16 x i8] [[SRC]] to i8*
; CHECK-NEXT: call void @accept_ptr(i8* [[DEST_I8]])		; CHECK-NEXT: call void @accept_ptr(i8* [[DEST_I8]])
; CHECK-NEXT: [[DEST1:%.]] = bitcast [16 x i8] [[DEST]] to i8*		; CHECK-NEXT: [[DEST1:%.]] = bitcast [16 x i8] [[DEST]] to i8*
Show All 39 Lines