This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Allow alloca instructions in the SCoP
Needs ReviewPublic

Authored by jdoerfert on Oct 7 2015, 1:49 PM.

Download Raw Diff

Details

Reviewers

Summary

Alloca instructions are no different to other instructions and can
therefore be allowed in the SCoP.

Diff Detail

Event Timeline

jdoerfert updated this revision to Diff 36791.Oct 7 2015, 1:49 PM

jdoerfert retitled this revision from to [Polly] Allow alloca instructions in the SCoP.

jdoerfert added reviewers: grosser, Meinersbur.

jdoerfert updated this object.

jdoerfert added a subscriber: Restricted Project.

alloca is most definitely not like any other instruction. When placed in a loop, the stack will grow in every iteration an eventually lead to a stack overflow. Such code is more than evil. How did you even manage to make Clang output such code? It should also have added stacksave/stackrestore intrinsics.

The question is, whether it matters for Polly. How does it handle the situation that %B refers to different memory in every iteration? Are there infinitely many arrays in the scop, since there are infinitely many base addresses? Are base addresses that are variant within the scop even valid?

Since the semantics of allocas in loops are dangerous and there is probably no valid use for them without stackrestore, I strongly recommend against this.

Pretty much, there is a lot of problem that arise when you put alloca in random places.

I am also surprised 'alloc' is not different to any other instruction.

We can possibly support it in some way, but this requires some additional thoughts similar to what Michael suggested. Will the varying base pointer cause troubles, .../. Can you dead-code-eliminate allocs? Can you reorder allocs?

Do you actually have a use case that produces such kind of code and that makes it worth to support this case? As deadalnix pointed out, allocas outside of the entry block have a lot of surprising behaviors (and are e.g. not even considered by mem2reg for good reasons).

Best,
Tobias

jdoerfert added a comment.Oct 8 2015, 3:11 AM

This comment was removed by jdoerfert.

Some comments from me. If anybody has an example (code or situation) that might be a problem, please feel free to share it.

In D13529#262273, @Meinersbur wrote:

alloca is most definitely not like any other instruction. When placed in a loop, the stack will grow in every iteration an eventually lead to a stack overflow. Such code is more than evil. How did you even manage to make Clang output such code? It should also have added stacksave/stackrestore intrinsics.

Yes alloca has a side effect but none that needs to be modeled, hence my commit message. We will not change/increase the domain of an alloca, thus if it was executed N times before and allocated stack space each time it will do so after Polly too. "Evil code" stays evil, good code stays good.
Clan generates allocas in regions (conditionals/loops) sometimes, this output was placed by hand but in the llvm-test suite + some benchmark suites I know we have alloca's in loops.
It would probably add livetime intrinsics not stacksave. The former we handle by ignoring them.

The question is, whether it matters for Polly. How does it handle the situation that %B refers to different memory in every iteration? Are there infinitely many arrays in the scop, since there are infinitely many base addresses? Are base addresses that are variant within the scop even valid?

This is the interesting question, thanks for asking. With this commit we will not model that %B = alloca will ever change, hence we will model less locations. However:

When you execute %B = alloca in a loop you loose your handle to %B each iteration and can only access the old stack space via a pointer that escaped in a phi node. The base address for those will be the phi. Nevertheless, this would again model less locations than there actually are.
With this patch we model the locations only once (maybe a small number of times if the alloca escapes in a phi, see 1)) it remains therefor to argue that this is safe. I argue it can be. (There might be something missing here but so far this patch only deleted code ;))

Since the semantics of allocas in loops are dangerous and there is probably no valid use for them without stackrestore, I strongly recommend against this.

3 out of 4 things we do have tricky semantics. I would not call allocas dangerous.
As mentioned before, LLVM-Test suite as well as benchmarks (e.g., alignment in BOTS) have allocas in loops. It happens and for the later we could do some nice optimizations.

Btw, stacksave/restore is only used/needed for variable sized arrays, I guess you mean the lifetime markers.

In D13529#262273, @deadalnix wrote:

Pretty much, there is a lot of problem that arise when you put alloca in random places.

True, but we don't put anything in random places. Dependences are computed for the allocated locations as well as the alloca itself and they are obeyed during scheduling/code generation. Additionally, the domain (number of executions) of the alloca is not changed.

In D13529#262273, @grosser wrote:

I am also surprised 'alloc' is not different to any other instruction.

At least not if you don't want to benefit from the semantics. A sound over-approximation should be to just model them as this patch does.
Later we might want to consolidate allocas or model the different memory locations (and the lifetime markers) to get rid of dependences that do not exist.

We can possibly support it in some way, but this requires some additional thoughts similar to what Michael suggested. Will the varying base pointer cause troubles, .../. Can you dead-code-eliminate allocs? Can you reorder allocs?

I don't think anything that is enabled by default will cause trouples. dead-iteration-elimination might, I will come back to you on this. Reordering is fine as the semantics of IR does not gurantee a stack layout.

Do you actually have a use case that produces such kind of code and that makes it worth to support this case? As deadalnix pointed out, allocas outside of the entry block have a lot of surprising behaviors (and are e.g. not even considered by mem2reg for good reasons).!

My use case for now are the LLVM-Test suite regions with allocas but later benchmarks like alignment in BOTS. I intend to model lifetime markers too at some
point to enable transformations that are prevented now by spurious dependences.
The reason mem2reg doesn't support them is unknown to me, if you have any insights please share them so we can discuss if it makes problems for us.

In D13529#262640, @jdoerfert wrote:

Some comments from me. If anybody has an example (code or situation) that might be a problem, please feel free to share it.

At the moment I can think of three situations where this breaks:

Invariant Load hoisting. Because the index is constant (but not the BaseAddrs), invariant load hoisting might move the load to before the scop. This yields invalid code because the alloca is not dominating it.
My De-LICM approach will reuse elements in arrays when known to be overwritten. If the base address of an array changes, the element will point to some new location, effectively forgetting what was written in that element.
Such alloca's size might depend on ivs/values computed within the scop. Delinirization might interpret it as index. If not, the size expression suddenly depends on instructions within the scop. It is not even an affine term anymore, although the index is.

Polly was written with the assumption that base addresses are defined and fixed during the scop's execution. IMHO it is not understood enough to remove the check with a single test case. It changes the meaning of a base address from an array in the sense of the polyhedral model to maybe "pointer to memory that currently does not alias with anything else".

Yes alloca has a side effect but none that needs to be modeled, hence my commit message.

The commit message is still wrong. "Not necessary to be modeled" is not "no different to other instructions"

We will not change/increase the domain of an alloca, thus if it was executed N times before and allocated stack space each time it will do so after Polly too. "Evil code" stays evil, good code stays good.

IMHO there is no need to optimize code that should not have been written in the first place.

Clan generates allocas in regions (conditionals/loops) sometimes, this output was placed by hand but in the llvm-test suite + some benchmark suites I know we have alloca's in loops.

What is "sometimes"? To my best knowledge Clang will always put allocas in the entry block if the the size is known in advance. The expections are C99 VLAs which are put between stacksave and stackrestore - and manual calls to the alloca builtins.

It would probably add livetime intrinsics not stacksave. The former we handle by ignoring them.

Nope. But maybe you have an actual example when clang generates an an alloca within a loop without stacksave to prove me wrong. Lifetime intrinsics don't matter here.

The question is, whether it matters for Polly. How does it handle the situation that %B refers to different memory in every iteration? Are there infinitely many arrays in the scop, since there are infinitely many base addresses? Are base addresses that are variant within the scop even valid?

This is the interesting question, thanks for asking. With this commit we will not model that %B = alloca will ever change, hence we will model less locations. However:

When you execute %B = alloca in a loop you loose your handle to %B each iteration and can only access the old stack space via a pointer that escaped in a phi node. The base address for those will be the phi. Nevertheless, this would again model less locations than there actually are.

With this patch we model the locations only once (maybe a small number of times if the alloca escapes in a phi, see 1)) it remains therefor to argue that this is safe. I argue it can be. (There might be something missing here but so far this patch only deleted code ;))

Will there be correct RAW/WAW/WAR dependencies if it escapes through a phi? Explain how.

Since the semantics of allocas in loops are dangerous and there is probably no valid use for them without stackrestore, I strongly recommend against this.

3 out of 4 things we do have tricky semantics. I would not call allocas dangerous.

allocas in loops without stackrestore.

Your test code is "designed" to crash after at most 26215 iterations. (assuming 1 MB stack size). It might work because the actual number of iterations is constant.

As mentioned before, LLVM-Test suite as well as benchmarks (e.g., alignment in BOTS) have allocas in loops. It happens and for the later we could do some nice optimizations.

Can you show what IR generates from what C source?

Btw, stacksave/restore is only used/needed for variable sized arrays, I guess you mean the lifetime markers.

Yes, VLAs, and I really mean stacksave/stackrestore. I even tried out bofore posting my first comment.

If not a VLA, clang puts it into the entry BB.

etherzhhb added a subscriber: etherzhhb.Feb 15 2016, 7:59 AM

This comment was removed by etherzhhb.

etherzhhb added a comment.Feb 15 2016, 8:00 AM

This comment was removed by etherzhhb.

We should move this discussion to the patch page D13529 once phab is
available again.

Here we are.

You can deal with the inlineing once we cross that bridge. However, just
as a first idea (which has nothing to do with D13529):

An alloca that is inlined and was in the function entry of the callee
(which is the case most of the time) [or even at any other block that
was not part of a loop] can be placed in the function entry of the
caller, thus not increasing the maximal stack usage. We can even place
appropriate llvm.lifetime.start/stop markers to help futher analyses.

Let's focus on one thing at a time (which would be a simple alloca).

This was what I suggesting as well (at the email's end). But there is no need for modeling of allocas in SCoPs in these simple cases. Hongbin's original email started trying that.

You mention stacksave/stackrestore a lot but I am still not sure why.

stacksave/stackrestore is essential for correctly handling alloca in the general case. Clang will insert them for variable length variables (CodeGenFunction::EmitAutoVarAlloca), all others will be in the entry block. The only other possibility I know to get such an alloca (without stacksave/stackrestore) is __builtin_alloca()/alloca() that are hopefully not used in loops or will likely trigger undefined behaviour. We should ignore such cases.

Are those benchmarks in which you encountered them of any importance for Polly? (Barcelona OpenMP Task Suite is not part of LLVM's test suite and there is no alloca() in its alignment test. I could find one in in OpenMP loop in nqueens and floorplan each, which i evil enough but OpenMP compilation should put them into separate functions).

In test-suite, I found that oggenc.c contains such allocas, but to allocate array elements:

for(i=0;i<vi1->channels;i++)
  lappcm[i]=alloca(sizeof(**lappcm)*n1);

That's something we cannot loop-optimize.

Another occurance is in gawk's builtin.c with a comment excusing its use. (And memcpy call in the loop, so Polly won't optimize it either)

For inlining, blacklisting stacksave/stackrestore in ScopDetection is not enough. Handled naively, we would compile

void foo(int m) {
	float A[m];
}

void bar(int n) {
	for (int i = 0; i < n; i += 1) {
		foo(i);
	}
}

into

define void @bar(i32 %n) #0 {
entry:
  %n.addr = alloca i32, align 4
  %i = alloca i32, align 4
  store i32 %n, i32* %n.addr, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %entry
  %0 = load i32, i32* %i, align 4
  %1 = load i32, i32* %n.addr, align 4
  %cmp = icmp slt i32 %0, %1
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %2 = load i32, i32* %i, align 4
  %3 = zext i32 %2 to i64
  %vla = alloca float, i64 %3, align 16
  br label %for.inc

for.inc:                                          ; preds = %for.body
  %6 = load i32, i32* %i, align 4
  %add = add nsw i32 %6, 1
  store i32 %add, i32* %i, align 4
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret void
}

which will stack overflow for n high large enough where the source program did not, ie. this is a miscompile.

Allocas in ramdon location. We can discuss this case form the perspective of correctness and efficiency.

For correctness, I assume that allocas in any location can be hoist to the entry of the function without changing the behavior the function (exclude causing a stack overflow ... we assume the stack is infinite here), otherwise, it is the function source code itself who has the problem (e.g. accessing uninitialized memory). Hence moving allocas to the entry of the function (including from the entry of an inlined callee to the entry of the caller) is a correct transformation. (Maybe recursive function call will break this assumption, and make it incorrect to move alloca to the entry of the function?)

This is not correct, even for non-recursive functions. First, the size may depend on an induction variable or something computed in the loop. Second, this would collapse multiple allocations into one, creating aliasing where none has been before (The %alloca of a previous loop iteration might still be available through a phi or storing it into a temporary location such as the oggenc.c example)

Hi Johannes,

it seems the discussion seems to have been stalled and this patch is currently not be worked on. I will drop out as an reviewer for now to clean up my review queue. Feel free to add me back in in case you want to pick up this discussion.

Best,
Tobias

PS: I am aware that some of your reviews may have been dropped due to inefficiencies in our review process. One motivation of this cleanup is to make the review process more efficiently. This is not a sign that I am not interested in this work, but rather a move to make sure I am aware of the top reviews.

"I am aware of the top reviews" -> "I am aware of the currently active reviews"

Actually dropping out myself.

Revision Contents

Path

Size

include/

polly/

ScopDetectionDiagnostic.h

22 lines

lib/

Analysis/

ScopDetection.cpp

8 lines

ScopDetectionDiagnostic.cpp

18 lines

test/

Isl/

CodeGen/

alloca-in-region.ll

106 lines

Diff 36791

include/polly/ScopDetectionDiagnostic.h

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	enum RejectReasonKind {

rrkAlias,		rrkAlias,

rrkSimpleLoop,		rrkSimpleLoop,

// Other		// Other
rrkOther,		rrkOther,
rrkIntToPtr,		rrkIntToPtr,
rrkAlloca,
rrkUnknownInst,		rrkUnknownInst,
rrkEntry,		rrkEntry,
rrkUnprofitable,		rrkUnprofitable,
rrkLastOther		rrkLastOther
};		};

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
/// @brief Base class of all reject reasons found during Scop detection.		/// @brief Base class of all reject reasons found during Scop detection.
▲ Show 20 Lines • Show All 570 Lines • ▼ Show 20 Lines	public:
/// @name RejectReason interface		/// @name RejectReason interface
//@{		//@{
virtual std::string getMessage() const override;		virtual std::string getMessage() const override;
virtual const DebugLoc &getDebugLoc() const override;		virtual const DebugLoc &getDebugLoc() const override;
//@}		//@}
};		};

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
/// @brief Captures errors with alloca instructions.
class ReportAlloca : public ReportOther {
//===--------------------------------------------------------------------===//
Instruction *Inst;

public:
ReportAlloca(Instruction *Inst);

/// @name LLVM-RTTI interface
//@{
static bool classof(const RejectReason *RR);
//@}

/// @name RejectReason interface
//@{
virtual std::string getMessage() const override;
virtual const DebugLoc &getDebugLoc() const override;
//@}
};

//===----------------------------------------------------------------------===//
/// @brief Captures errors with unknown instructions.		/// @brief Captures errors with unknown instructions.
class ReportUnknownInst : public ReportOther {		class ReportUnknownInst : public ReportOther {
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
Instruction *Inst;		Instruction *Inst;

public:		public:
ReportUnknownInst(Instruction *Inst);		ReportUnknownInst(Instruction *Inst);

▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

lib/Analysis/ScopDetection.cpp

Show First 20 Lines • Show All 726 Lines • ▼ Show 20 Lines	bool ScopDetection::isValidInstruction(Instruction &Inst,
// We only check the call instruction but not invoke instruction.		// We only check the call instruction but not invoke instruction.
if (CallInst *CI = dyn_cast<CallInst>(&Inst)) {		if (CallInst *CI = dyn_cast<CallInst>(&Inst)) {
if (isValidCallInst(*CI))		if (isValidCallInst(*CI))
return true;		return true;

return invalid<ReportFuncCall>(Context, /Assert=/true, &Inst);		return invalid<ReportFuncCall>(Context, /Assert=/true, &Inst);
}		}

if (!Inst.mayWriteToMemory() && !Inst.mayReadFromMemory()) {		if (!Inst.mayWriteToMemory() && !Inst.mayReadFromMemory())
if (!isa<AllocaInst>(Inst))
return true;		return true;

return invalid<ReportAlloca>(Context, /Assert=/true, &Inst);
}

// Check the access function.		// Check the access function.
if (isa<LoadInst>(Inst) \|\| isa<StoreInst>(Inst)) {		if (isa<LoadInst>(Inst) \|\| isa<StoreInst>(Inst)) {
Context.hasStores \|= isa<StoreInst>(Inst);		Context.hasStores \|= isa<StoreInst>(Inst);
Context.hasLoads \|= isa<LoadInst>(Inst);		Context.hasLoads \|= isa<LoadInst>(Inst);
return isValidMemoryAccess(Inst, Context);		return isValidMemoryAccess(Inst, Context);
}		}

// We do not know this instruction, therefore we assume it is invalid.		// We do not know this instruction, therefore we assume it is invalid.
▲ Show 20 Lines • Show All 457 Lines • Show Last 20 Lines

lib/Analysis/ScopDetectionDiagnostic.cpp

Show First 20 Lines • Show All 446 Lines • ▼ Show 20 Lines	const DebugLoc &ReportIntToPtr::getDebugLoc() const {
return BaseValue->getDebugLoc();		return BaseValue->getDebugLoc();
}		}

bool ReportIntToPtr::classof(const RejectReason *RR) {		bool ReportIntToPtr::classof(const RejectReason *RR) {
return RR->getKind() == rrkIntToPtr;		return RR->getKind() == rrkIntToPtr;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// ReportAlloca.

ReportAlloca::ReportAlloca(Instruction *Inst)
: ReportOther(rrkAlloca), Inst(Inst) {}

std::string ReportAlloca::getMessage() const {
return "Alloca instruction: " + *Inst;
}

const DebugLoc &ReportAlloca::getDebugLoc() const {
return Inst->getDebugLoc();
}

bool ReportAlloca::classof(const RejectReason *RR) {
return RR->getKind() == rrkAlloca;
}

//===----------------------------------------------------------------------===//
// ReportUnknownInst.		// ReportUnknownInst.

ReportUnknownInst::ReportUnknownInst(Instruction *Inst)		ReportUnknownInst::ReportUnknownInst(Instruction *Inst)
: ReportOther(rrkUnknownInst), Inst(Inst) {}		: ReportOther(rrkUnknownInst), Inst(Inst) {}

std::string ReportUnknownInst::getMessage() const {		std::string ReportUnknownInst::getMessage() const {
return "Unknown instruction: " + *Inst;		return "Unknown instruction: " + *Inst;
}		}
▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

test/Isl/CodeGen/alloca-in-region.ll

This file was added.

				; RUN: opt %loadPolly -polly-codegen -mem2reg -S < %s \| FileCheck %s
				;
				; void f(int *A) {
				; for (int i = 0; i < 10; i++) {
				; int B[10];
				; for (int j = 0; j < 10; j++)
				; B[j] = 0;
				; for (int j = 0; j < i; j++)
				; B[j] += A[i];
				; for (int j = 0; j < i; j++)
				; A[i] += B[j];
				; }
				; }
				;
				;
				; CHECK: polly.stmt.for.body:
				; CHECK-NEXT: %p_B = alloca [10 x i32], align 16
				;
				; CHECK: polly.stmt.for.body.3:
				; CHECK-NEXT: %p_arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* %p_B, i64 0, i64 %polly.indvar4
				;
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				define void @f(i32* %A) {
				entry:
				br label %for.cond

				for.cond: ; preds = %for.inc.27, %entry
				%indvars.iv12 = phi i64 [ %indvars.iv.next13, %for.inc.27 ], [ 0, %entry ]
				%indvars.iv8 = phi i32 [ %indvars.iv.next9, %for.inc.27 ], [ 0, %entry ]
				%exitcond14 = icmp ne i64 %indvars.iv12, 10
				br i1 %exitcond14, label %for.body, label %for.end.29

				for.body: ; preds = %for.cond
				%B = alloca [10 x i32], align 16
				br label %for.cond.1

				for.cond.1: ; preds = %for.inc, %for.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.inc ], [ 0, %for.body ]
				%exitcond = icmp ne i64 %indvars.iv, 10
				br i1 %exitcond, label %for.body.3, label %for.end

				for.body.3: ; preds = %for.cond.1
				%arrayidx = getelementptr inbounds [10 x i32], [10 x i32]* %B, i64 0, i64 %indvars.iv
				store i32 0, i32* %arrayidx, align 4
				br label %for.inc

				for.inc: ; preds = %for.body.3
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br label %for.cond.1

				for.end: ; preds = %for.cond.1
				br label %for.cond.5

				for.cond.5: ; preds = %for.inc.12, %for.end
				%indvars.iv1 = phi i64 [ %indvars.iv.next2, %for.inc.12 ], [ 0, %for.end ]
				%lftr.wideiv = trunc i64 %indvars.iv1 to i32
				%exitcond5 = icmp ne i32 %lftr.wideiv, %indvars.iv8
				br i1 %exitcond5, label %for.body.7, label %for.end.14

				for.body.7: ; preds = %for.cond.5
				%arrayidx9 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv12
				%tmp = load i32, i32* %arrayidx9, align 4
				%arrayidx11 = getelementptr inbounds [10 x i32], [10 x i32]* %B, i64 0, i64 %indvars.iv1
				%tmp15 = load i32, i32* %arrayidx11, align 4
				%add = add nsw i32 %tmp15, %tmp
				store i32 %add, i32* %arrayidx11, align 4
				br label %for.inc.12

				for.inc.12: ; preds = %for.body.7
				%indvars.iv.next2 = add nuw nsw i64 %indvars.iv1, 1
				br label %for.cond.5

				for.end.14: ; preds = %for.cond.5
				br label %for.cond.16

				for.cond.16: ; preds = %for.inc.24, %for.end.14
				%indvars.iv6 = phi i64 [ %indvars.iv.next7, %for.inc.24 ], [ 0, %for.end.14 ]
				%lftr.wideiv10 = trunc i64 %indvars.iv6 to i32
				%exitcond11 = icmp ne i32 %lftr.wideiv10, %indvars.iv8
				br i1 %exitcond11, label %for.body.18, label %for.end.26

				for.body.18: ; preds = %for.cond.16
				%arrayidx20 = getelementptr inbounds [10 x i32], [10 x i32]* %B, i64 0, i64 %indvars.iv6
				%tmp16 = load i32, i32* %arrayidx20, align 4
				%arrayidx22 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv12
				%tmp17 = load i32, i32* %arrayidx22, align 4
				%add23 = add nsw i32 %tmp17, %tmp16
				store i32 %add23, i32* %arrayidx22, align 4
				br label %for.inc.24

				for.inc.24: ; preds = %for.body.18
				%indvars.iv.next7 = add nuw nsw i64 %indvars.iv6, 1
				br label %for.cond.16

				for.end.26: ; preds = %for.cond.16
				br label %for.inc.27

				for.inc.27: ; preds = %for.end.26
				%indvars.iv.next13 = add nuw nsw i64 %indvars.iv12, 1
				%indvars.iv.next9 = add nuw nsw i32 %indvars.iv8, 1
				br label %for.cond

				for.end.29: ; preds = %for.cond
				ret void
				}