This is an archive of the discontinued LLVM Phabricator instance.

[DA][NFC] Update publication - add remarks
ClosedPublic

Authored by simoll on Nov 18 2021, 1:08 AM.

Download Raw Diff

Details

Reviewers

nhaehnle
sameerds
arsenm
alex-t

Commits

rG56db1c072c92: [DA][NFC] Update publication - add remarks

Summary

Update the reference publication for the SyncDependenceAnalysis and Divergence Analysis. Fix phrasing, formatting. Add comments on reducible loop limitation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

simoll created this revision.Nov 18 2021, 1:08 AM

Herald added subscribers: bmahjour, hiraditya. · View Herald TranscriptNov 18 2021, 1:08 AM

simoll requested review of this revision.Nov 18 2021, 1:08 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 18 2021, 1:08 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

simoll retitled this revision from [DA] Update publication - add remarks to [DA][NFC] Update publication - add remarks.Nov 18 2021, 1:09 AM

Harbormaster completed remote builds in B134852: Diff 388124.Nov 18 2021, 1:55 AM

bmahjour removed a subscriber: bmahjour.Nov 18 2021, 5:48 AM

Not related to the changes proposed in this review, but it seems the PDT is not used by the implementation. It might be good to take out that dependency.

llvm/lib/Analysis/SyncDependenceAnalysis.cpp
103	Good to see both these points spelled out. We are currently working on an implementation that works with irreducible control flow. It's still a work in progress, but involves the new "CycleInfo" being introduced in D112696. I do believe that the single pass of DFA is a strength that need not be lost when handling irreducible control flow. CycleInfo provides a predictable way to work around the lack of a unique header. One just needs to take extra care about entering the same irreducible loop multiple times when constructing ModifiedPO.
149–151	Is it correct to say that this is not critical for correctness? From what I understood, if other blocks (beyond the loop exit) got interleaved, the implementation will work, but unnecessarily visit and skip the interleaved blocks when walking backwards on the ModifiedPO.

Not related to the changes proposed in this review, but it seems the PDT is not used by the implementation. It might be good to take out that dependency.

All good points. PDT and the comment on strict loop compactness are leftovers. I will remove PDT in a followup patch.

llvm/lib/Analysis/SyncDependenceAnalysis.cpp
103	Regarding irreducibility, the unclear part to me always was whether the analysis result (detected joins, synchronization points) in IR still represents the synchronization we will end up getting with the final binary on actual hardware. So the issue is not really getting some result but making sure the entire stack agrees on synchronization. D85603 should help here. However, if a kernel is irreducible and doesn't use the tools of the patch, the documentation says that reconvergence should be maximal, as early as possible - this may be ambiguous in irreducible control (IIUC implementation defined with cycles). This is less of an issue with reducible loops as there is a (mostly unspoken) mutual understanding where synchronization happens and so all transformations will abide to that. This is starting to fade however as more recent hardware breaks with the traditions on synchronization.
149–151	Correct. Exit blocks must have a lower number than the loop header in the traversal (to account for the virtual edge from header to exit). The PO should take care of all other dependences. Strict loop compactness is a relic from the earlier implementation that recursed on the loop tree during propagation. It may help with efficiency though.

In D114146#3142269, @simoll wrote:

All good points. PDT and the comment on strict loop compactness are leftovers. I will remove PDT in a followup patch.

So will you be rephrasing the compactness in this patch? While I am convinced that it does not affect correctness, I have not thought about whether it affects accuracy, i.e., whether the lack of compactness may result in some places being marked divergent when they aren't.

llvm/lib/Analysis/SyncDependenceAnalysis.cpp
103	Indeed, D85603 is a big part of the work being done for the AMDGPU backend. As part of its cleanup, I am hoping to make some conservative statements about the default behaviour when the intrinsics are not being used. The minimal goal is to gracefully handle irreducible control flow, unlike the current DA implementation where we just give up and assume everything is divergent. More as soon as I am in a confident place to make specific statements.

In D114146#3144499, @sameerds wrote:

In D114146#3142269, @simoll wrote:

All good points. PDT and the comment on strict loop compactness are leftovers. I will remove PDT in a followup patch.

So will you be rephrasing the compactness in this patch? While I am convinced that it does not affect correctness, I have not thought about whether it affects accuracy, i.e., whether the lack of compactness may result in some places being marked divergent when they aren't.

I won't touch the comment on loop compactness in this patch. The change does modify how we think about this, which shouldn't happen in the same commit as an updated reference.
Giving up loop compactness (not the order constraints on the loop header/exits though!) shouldn't even change the result of the analysis tbh.

sameerds accepted this revision.Nov 22 2021, 2:17 AM

This revision is now accepted and ready to land.Nov 22 2021, 2:17 AM

This revision was landed with ongoing or failed builds.Nov 22 2021, 3:59 AM

Closed by commit rG56db1c072c92: [DA][NFC] Update publication - add remarks (authored by simoll). · Explain Why

This revision was automatically updated to reflect the committed changes.

simoll added a commit: rG56db1c072c92: [DA][NFC] Update publication - add remarks.

Revision Contents

Path

Size

llvm/

lib/

Analysis/

DivergenceAnalysis.cpp

10 lines

SyncDependenceAnalysis.cpp

59 lines

Diff 388868

llvm/lib/Analysis/DivergenceAnalysis.cpp

	Show All 18 Lines
	// they re-converge.			// they re-converge.
	//			//
	// Due to this execution model, some optimizations such as jump			// Due to this execution model, some optimizations such as jump
	// threading and loop unswitching can interfere with thread re-convergence.			// threading and loop unswitching can interfere with thread re-convergence.
	// Therefore, an analysis that computes which branches in a GPU program are			// Therefore, an analysis that computes which branches in a GPU program are
	// divergent can help the compiler to selectively run these optimizations.			// divergent can help the compiler to selectively run these optimizations.
	//			//
	// This implementation is derived from the Vectorization Analysis of the			// This implementation is derived from the Vectorization Analysis of the
	// Region Vectorizer (RV). That implementation in turn is based on the approach			// Region Vectorizer (RV). The analysis is based on the approach described in
	// described in
	//			//
	// Improving Performance of OpenCL on CPUs			// An abstract interpretation for SPMD divergence
	// Ralf Karrenberg and Sebastian Hack			// on reducible control flow graphs.
	// CC '12			// Julian Rosemann, Simon Moll and Sebastian Hack
				// POPL '21
	//			//
	// This implementation is generic in the sense that it does			// This implementation is generic in the sense that it does
	// not itself identify original sources of divergence.			// not itself identify original sources of divergence.
	// Instead specialized adapter classes, (LoopDivergenceAnalysis) for loops and			// Instead specialized adapter classes, (LoopDivergenceAnalysis) for loops and
	// (DivergenceAnalysis) for functions, identify the sources of divergence			// (DivergenceAnalysis) for functions, identify the sources of divergence
	// (e.g., special variables that hold the thread ID or the iteration variable).			// (e.g., special variables that hold the thread ID or the iteration variable).
	//			//
	// The generic implementation propagates divergence to variables that are data			// The generic implementation propagates divergence to variables that are data
	▲ Show 20 Lines • Show All 370 Lines • Show Last 20 Lines

llvm/lib/Analysis/SyncDependenceAnalysis.cpp

	Show All 9 Lines
	// the set of basic blocks whose phi nodes become divergent due to divergent			// the set of basic blocks whose phi nodes become divergent due to divergent
	// control. These are the blocks that are reachable by two disjoint paths from			// control. These are the blocks that are reachable by two disjoint paths from
	// the branch or loop exits that have a reaching path that is disjoint from a			// the branch or loop exits that have a reaching path that is disjoint from a
	// path to the loop latch.			// path to the loop latch.
	//			//
	// The SyncDependenceAnalysis is used in the DivergenceAnalysis to model			// The SyncDependenceAnalysis is used in the DivergenceAnalysis to model
	// control-induced divergence in phi nodes.			// control-induced divergence in phi nodes.
	//			//
	// -- Summary --			//
	// The SyncDependenceAnalysis lazily computes sync dependences [3].			// -- Reference --
	// The analysis evaluates the disjoint path criterion [2] by a reduction			// The algorithm is presented in Section 5 of
	// to SSA construction. The SSA construction algorithm is implemented as			//
	// a simple data-flow analysis [1].			// An abstract interpretation for SPMD divergence
	//			// on reducible control flow graphs.
	// [1] "A Simple, Fast Dominance Algorithm", SPI '01, Cooper, Harvey and Kennedy			// Julian Rosemann, Simon Moll and Sebastian Hack
	// [2] "Efficiently Computing Static Single Assignment Form			// POPL '21
	// and the Control Dependence Graph", TOPLAS '91,			//
	// Cytron, Ferrante, Rosen, Wegman and Zadeck
	// [3] "Improving Performance of OpenCL on CPUs", CC '12, Karrenberg and Hack
	// [4] "Divergence Analysis", TOPLAS '13, Sampaio, Souza, Collange and Pereira
	//			//
	// -- Sync dependence --			// -- Sync dependence --
	// Sync dependence [4] characterizes the control flow aspect of the			// Sync dependence characterizes the control flow aspect of the
	// propagation of branch divergence. For example,			// propagation of branch divergence. For example,
	//			//
	// %cond = icmp slt i32 %tid, 10			// %cond = icmp slt i32 %tid, 10
	// br i1 %cond, label %then, label %else			// br i1 %cond, label %then, label %else
	// then:			// then:
	// br label %merge			// br label %merge
	// else:			// else:
	// br label %merge			// br label %merge
	// merge:			// merge:
	// %a = phi i32 [ 0, %then ], [ 1, %else ]			// %a = phi i32 [ 0, %then ], [ 1, %else ]
	//			//
	// Suppose %tid holds the thread ID. Although %a is not data dependent on %tid			// Suppose %tid holds the thread ID. Although %a is not data dependent on %tid
	// because %tid is not on its use-def chains, %a is sync dependent on %tid			// because %tid is not on its use-def chains, %a is sync dependent on %tid
	// because the branch "br i1 %cond" depends on %tid and affects which value %a			// because the branch "br i1 %cond" depends on %tid and affects which value %a
	// is assigned to.			// is assigned to.
	//			//
				//
	// -- Reduction to SSA construction --			// -- Reduction to SSA construction --
	// There are two disjoint paths from A to X, if a certain variant of SSA			// There are two disjoint paths from A to X, if a certain variant of SSA
	// construction places a phi node in X under the following set-up scheme [2].			// construction places a phi node in X under the following set-up scheme.
	//			//
	// This variant of SSA construction ignores incoming undef values.			// This variant of SSA construction ignores incoming undef values.
	// That is paths from the entry without a definition do not result in			// That is paths from the entry without a definition do not result in
	// phi nodes.			// phi nodes.
	//			//
	// entry			// entry
	// / \			// / \
	// A \			// A \
	// / \ Y			// / \ Y
	// B C /			// B C /
	// \ / \ /			// \ / \ /
	// D E			// D E
	// \ /			// \ /
	// F			// F
				//
	// Assume that A contains a divergent branch. We are interested			// Assume that A contains a divergent branch. We are interested
	// in the set of all blocks where each block is reachable from A			// in the set of all blocks where each block is reachable from A
	// via two disjoint paths. This would be the set {D, F} in this			// via two disjoint paths. This would be the set {D, F} in this
	// case.			// case.
	// To generally reduce this query to SSA construction we introduce			// To generally reduce this query to SSA construction we introduce
	// a virtual variable x and assign to x different values in each			// a virtual variable x and assign to x different values in each
	// successor block of A.			// successor block of A.
				//
	// entry			// entry
	// / \			// / \
	// A \			// A \
	// / \ Y			// / \ Y
	// x = 0 x = 1 /			// x = 0 x = 1 /
	// \ / \ /			// \ / \ /
	// D E			// D E
	// \ /			// \ /
	// F			// F
				//
	// Our flavor of SSA construction for x will construct the following			// Our flavor of SSA construction for x will construct the following
				//
	// entry			// entry
	// / \			// / \
	// A \			// A \
	// / \ Y			// / \ Y
	// x0 = 0 x1 = 1 /			// x0 = 0 x1 = 1 /
	// \ / \ /			// \ / \ /
	// x2=phi E			// x2 = phi E
	// \ /			// \ /
	// x3=phi			// x3 = phi
				//
	// The blocks D and F contain phi nodes and are thus each reachable			// The blocks D and F contain phi nodes and are thus each reachable
	// by two disjoins paths from A.			// by two disjoins paths from A.
	//			//
	// -- Remarks --			// -- Remarks --
	// In case of loop exits we need to check the disjoint path criterion for loops			// * In case of loop exits we need to check the disjoint path criterion for loops.
	// [2]. To this end, we check whether the definition of x differs between the			// To this end, we check whether the definition of x differs between the
	// loop exit and the loop header (_after_ SSA construction).			// loop exit and the loop header (_after_ SSA construction).
	//			//
				// -- Known Limitations & Future Work --
				sameerdsUnsubmitted Not Done Reply Inline Actions Good to see both these points spelled out. We are currently working on an implementation that works with irreducible control flow. It's still a work in progress, but involves the new "CycleInfo" being introduced in D112696. I do believe that the single pass of DFA is a strength that need not be lost when handling irreducible control flow. CycleInfo provides a predictable way to work around the lack of a unique header. One just needs to take extra care about entering the same irreducible loop multiple times when constructing ModifiedPO. sameerds: Good to see both these points spelled out. We are currently working on an implementation that…
				simollAuthorUnsubmitted Done Reply Inline Actions Regarding irreducibility, the unclear part to me always was whether the analysis result (detected joins, synchronization points) in IR still represents the synchronization we will end up getting with the final binary on actual hardware. So the issue is not really getting some result but making sure the entire stack agrees on synchronization. D85603 should help here. However, if a kernel is irreducible and doesn't use the tools of the patch, the documentation says that reconvergence should be maximal, as early as possible - this may be ambiguous in irreducible control (IIUC implementation defined with cycles). This is less of an issue with reducible loops as there is a (mostly unspoken) mutual understanding where synchronization happens and so all transformations will abide to that. This is starting to fade however as more recent hardware breaks with the traditions on synchronization. simoll: Regarding irreducibility, the unclear part to me always was whether the analysis result…
				sameerdsUnsubmitted Not Done Reply Inline Actions Indeed, D85603 is a big part of the work being done for the AMDGPU backend. As part of its cleanup, I am hoping to make some conservative statements about the default behaviour when the intrinsics are not being used. The minimal goal is to gracefully handle irreducible control flow, unlike the current DA implementation where we just give up and assume everything is divergent. More as soon as I am in a confident place to make specific statements. sameerds: Indeed, D85603 is a big part of the work being done for the AMDGPU backend. As part of its…
				// * The algorithm requires reducible loops because the implementation
				// implicitly performs a single iteration of the underlying data flow analysis.
				// This was done for pragmatism, simplicity and speed.
				//
				// Relevant related work for extending the algorithm to irreducible control:
				// A simple algorithm for global data flow analysis problems.
				// Matthew S. Hecht and Jeffrey D. Ullman.
				// SIAM Journal on Computing, 4(4):519–532, December 1975.
				//
				// * Another reason for requiring reducible loops is that points of
				// synchronization in irreducible loops aren't 'obvious' - there is no unique
				// header where threads 'should' synchronize when entering or coming back
				// around from the latch.
				//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#include "llvm/Analysis/SyncDependenceAnalysis.h"			#include "llvm/Analysis/SyncDependenceAnalysis.h"
	#include "llvm/ADT/PostOrderIterator.h"			#include "llvm/ADT/PostOrderIterator.h"
	#include "llvm/ADT/SmallPtrSet.h"			#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/Analysis/PostDominators.h"			#include "llvm/Analysis/PostDominators.h"
	#include "llvm/IR/BasicBlock.h"			#include "llvm/IR/BasicBlock.h"
	#include "llvm/IR/CFG.h"			#include "llvm/IR/CFG.h"
	#include "llvm/IR/Dominators.h"			#include "llvm/IR/Dominators.h"
	Show All 15 Lines
	// pretend it had different successors.			// pretend it had different successors.
	namespace {			namespace {
	using namespace llvm;			using namespace llvm;

	// Custom Post-Order Traveral			// Custom Post-Order Traveral
	//			//
	// We cannot use the vanilla (R)PO computation of LLVM because:			// We cannot use the vanilla (R)PO computation of LLVM because:
	// * We (virtually) modify the CFG.			// * We (virtually) modify the CFG.
	// * We want a loop-compact block enumeration, that is the numbers assigned by			// * We want a loop-compact block enumeration, that is the numbers assigned to
	// the traveral to the blocks of a loop are an interval.			// blocks of a loop form an interval
				//
				sameerdsUnsubmitted Not Done Reply Inline Actions Is it correct to say that this is not critical for correctness? From what I understood, if other blocks (beyond the loop exit) got interleaved, the implementation will work, but unnecessarily visit and skip the interleaved blocks when walking backwards on the ModifiedPO. sameerds: Is it correct to say that this is not critical for correctness? From what I understood, if…
				simollAuthorUnsubmitted Done Reply Inline Actions Correct. Exit blocks must have a lower number than the loop header in the traversal (to account for the virtual edge from header to exit). The PO should take care of all other dependences. Strict loop compactness is a relic from the earlier implementation that recursed on the loop tree during propagation. It may help with efficiency though. simoll: Correct. Exit blocks must have a lower number than the loop header in the traversal (to account…
	using POCB = std::function<void(const BasicBlock &)>;			using POCB = std::function<void(const BasicBlock &)>;
	using VisitedSet = std::set<const BasicBlock *>;			using VisitedSet = std::set<const BasicBlock *>;
	using BlockStack = std::vector<const BasicBlock *>;			using BlockStack = std::vector<const BasicBlock *>;

	// forward			// forward
	static void computeLoopPO(const LoopInfo &LI, Loop &Loop, POCB CallBack,			static void computeLoopPO(const LoopInfo &LI, Loop &Loop, POCB CallBack,
	VisitedSet &Finalized);			VisitedSet &Finalized);

	▲ Show 20 Lines • Show All 319 Lines • Show Last 20 Lines