This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
lib/
-
Analysis/
15/24
TargetTransformInfo.cpp
-
CodeGen/
4/10
HardwareLoops.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
loop-instr-prep-non-const-increasement.ll
-
mma-phi-accs.ll

Differential D109676

[HardwareLoops] put +1 for loop count before zero extension
Needs ReviewPublic

Authored by shchenz on Sep 13 2021, 2:47 AM.

Download Raw Diff

Details

Reviewers

samparker
JanekvO
nikic
jsji
dmgreen
reames

Summary

Putting the +1 before the zero-extend will allow scalar evolution to fold the expression in some cases.

This issue is first tried in https://reviews.llvm.org/D91724 and seems there is some issue with that patch. This patch fixes the issues in that patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

shchenz created this revision.Sep 13 2021, 2:47 AM

Herald added subscribers: hiraditya, nemanjai. · View Herald TranscriptSep 13 2021, 2:47 AM

shchenz requested review of this revision.Sep 13 2021, 2:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 13 2021, 2:47 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B123634: Diff 372191.Sep 13 2021, 3:36 AM

reames added inline comments.Sep 13 2021, 10:03 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
170	You're using a variable called TripCount to store an ExitCount (e.g. BTC) before adding the appropriate offset. I strongly suggest using two variables.
178	There appears to be an unasserted assumption that CountType is of a wider type than the exitcount's type. The code structure does not make it clear this is true, at a minimum, please assert.
178	I'd suggest turning the pointer type check into an early return false unless you have a motivating case for the complexity. Pointer typed trip counts should be rare.

reames added inline comments.Sep 13 2021, 10:03 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
185	Please use a comparison of the exit count against maximum unsigned value instead. The optimizer will probably turn it into the same, but the form you've written has potential interaction with SCEVs flag inference which is best avoided.
193	If you use getNoopOrZeroExtend, you don't need the explicit else case.

reames requested changes to this revision.Sep 13 2021, 10:03 AM

This revision now requires changes to proceed.Sep 13 2021, 10:03 AM

address comments

@reames Thanks for your review. Updated accordingly.

llvm/lib/Analysis/TargetTransformInfo.cpp
170	Good catch
178	The code already make sure that `TripCount->getType()` not bigger than `CountType`. See line 124 ~ 125 in the same function `isHardwareLoopCandidate`. But yeah, an assertion will be more clear here.
185	This was my first thought too. We can get the maximum num for `ExitCount` and if it is less than the maximum unsigned value for that type, we can safely do the +1. Unfortunately, for the case changes, they are all not such cases. One case I was looking at is: define dso_local signext i32 @testNestedPHI(i32 signext %cond, i32 signext %count, <512 x i1>* nocapture %ptr, <16 x i8> %vc) #1 { for.body: ; preds = %for.body.preheader, %for.body %i.011 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ] %vq.110 = phi <512 x i1> [ %1, %for.body ], [ %vq.0, %for.body.preheader ] %1 = tail call <512 x i1> @llvm.ppc.mma.xvf32gernp(<512 x i1> %vq.110, <16 x i8> %vc, <16 x i8> %vc) %inc = add nuw nsw i32 %i.011, 1 %exitcond.not = icmp eq i32 %inc, %count br i1 %exitcond.not, label %for.cond.cleanup, label %for.body } SCEV for the `ExitCount` is `%count - 1`, bacause there is no wrap flag, so `getRangeRef` will return full-set as the range for `ExitCount`. For `ExitCount`, I think we should add `nuw` flag for the above SCEV expression? But that should be out of this patch's scope?
193	Good idea.

Harbormaster completed remote builds in B123775: Diff 372385.Sep 13 2021, 7:58 PM

rebase and fix Lint warnings

Harbormaster completed remote builds in B123792: Diff 372405.Sep 14 2021, 12:10 AM

reames requested changes to this revision.Sep 16 2021, 9:48 AM

reames added inline comments.

llvm/lib/Analysis/TargetTransformInfo.cpp
178	This is wrong. Please return false here instead, pointer typed trip counts should never happen.
185	To make sure I'm following you, you're saying that we manage to prove ExitCount != 0 in this example? That's interesting, I actually won't expect us to get that. I'll drop the point about not checking for zero. Just be aware there is some risk there. The good news is that we're rapidly reducing that risk via work on pr51817.
llvm/lib/CodeGen/HardwareLoops.cpp
195	Please remove the white space diff. If you want to reflow the code, do so in a separate change.
211	Same.
389–390	Hm, hadn't noticed this before. An actual trip count is never zero. This appears to have been trying to handle the overflow in BTC to TC conversion. I haven't looked at this to see what it's doing and why, but this is likely dead code now and you should understand why this was insufficient previously.

This revision now requires changes to proceed.Sep 16 2021, 9:48 AM

address comments

llvm/lib/Analysis/TargetTransformInfo.cpp
178	Thanks. I am not aware that loop exit count can not be a pointer type. I change this to assert instead of returning false.
185	Yes, the case is from the below test changes. Now we can put `+1` before the zero extend.
llvm/lib/CodeGen/HardwareLoops.cpp
195	I was doing this on purpose. Some Lint warnings are emitted for this patch after I rename `ExitCount` to `TripCount`. Should I fix the Lint warnings in this patch?
389–390	I don't think it is used to detect overflow in the original patch. It should be used to detect the possibility that the loop count can be zero. If it can not be a zero(for example, explicitly check the loop count and constant zero in the IR), some targets like ARM can use 'test and set' form loop count intrinsic, so that we can delete the explicitly check instruction in the IR. See one example in `test/Transforms/HardwareLoops/loop-guards.ll` `@test6()`. After the change, instruction `%cmp = icmp ne i32 %N, 0` has no user. For the previous bug in https://reviews.llvm.org/D91724, this check does not help. On PowerPC, there is no `test and set` form loop count intrinsic, so no matter what the result of `isLoopEntryGuardedByCond` is, `UseLoopGuard` is always false. And the loop count value may still be changed from umax_32 + 1(if we extend to u64) to 0 after D91724. At last, yes, after this patch, this check should not be done two times. I have updated the patch.

samparker added inline comments.Sep 17 2021, 12:30 AM

llvm/lib/CodeGen/HardwareLoops.cpp
389–390	There's a comment of line 417 about this. I was trying to familiarise, again, with this code the other day and it confused me... SCEV didn't seem to help much here and so we use pattern matching in CanGenerateTest to achieve what we want. But yes, looks like dead code now.

Harbormaster completed remote builds in B124340: Diff 373150.Sep 17 2021, 12:42 AM

fix lint warning

Harbormaster completed remote builds in B125468: Diff 374715.Sep 23 2021, 8:05 PM

reames mentioned this in D110587: [SCEV] Extend trip count to avoid overflow by default.Sep 27 2021, 1:59 PM

I posted a SCEV cleanup change inspired by this review. It shouldn't block this one. See https://reviews.llvm.org/D110587

llvm/lib/CodeGen/HardwareLoops.cpp
389–390	I'm not following this comment. My best guess is that you believe that the trip count of a loop which is not entered is zero. This is not necessarily true. A trip count (as returned by SCEV) is the number of times the loop header is executed if the loop is reached. As such, a trip count is always 1 or greater.

shchenz added inline comments.Sep 27 2021, 5:53 PM

llvm/lib/CodeGen/HardwareLoops.cpp
389–390	Sorry for the confusion. I was trying to explain why even with the guard `isLoopEntryGuardedByCond`, we still fail after committing D91724. The check `isLoopEntryGuardedByCond` here will not stop the hardware loop pass to convert a loop to a hardware loop. And the trip count changing from umax_32 + 1(if we extend to u64) to 0 after D91724 is not impacted by this check. So that's why we still hit the bug on PowerPC. We need to do this check when we put +1 before the extension.

Hi @reames , do you have further comments for this patch? Thanks.

thopre added a subscriber: thopre.Oct 7 2021, 4:13 PM

In D109676#3039130, @shchenz wrote:

Hi @reames , do you have further comments for this patch? Thanks.

Sorry, your last response didn't parse for me. I know the SCEV side of this, but the client pieces are going over my head.

My general suggestion is to rebase over D110587 (which I'm about to land), and cleanup any downstream handling of zero trip counts separately.

We've about hit the limit of my ability to provide useful review on this, JFYI.

reames mentioned this in rG7f55209cee55: [SCEV] Extend trip count to avoid overflow by default.Oct 11 2021, 9:56 AM

In D109676#3055431, @reames wrote:

In D109676#3039130, @shchenz wrote:

Hi @reames , do you have further comments for this patch? Thanks.

Sorry, your last response didn't parse for me. I know the SCEV side of this, but the client pieces are going over my head.

My general suggestion is to rebase over D110587 (which I'm about to land), and cleanup any downstream handling of zero trip counts separately.

We've about hit the limit of my ability to provide useful review on this, JFYI.

Hi @reames , seems for this client side, Modified getTripCountFromExitCount in D110587 can not be used. getTripCountFromExitCount always return a wider type(1 bit larger than ExitCount->type size). Because of hardware loop counter register length limitation, the ExitCount->type must be not larger than loop counter register length(For example on 64-bit PowerPC, loop counter register length is 64-bit).
So without calling the API in D110587, we can use a i64 type ExitCount, if we can prove ExitCount + 1 does not overflow by calling isLoopEntryGuardedByCond. But if we call getTripCountFromExitCount with parameter Extend=true, the i64 exitcount/tripcount will be extended to i65 which is not a right loop counter type, so we can not use this i64 type ExitCount even it will not overflow after +1.

Thoughs? Please take your time.

remove the dead code

shchenz added inline comments.Oct 13 2021, 1:58 AM

llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗	(On Diff #379298)	The test changes is due to `isLoopEntryGuardedByCond` can not determine that SCEV `(1 + (-1 * %N) + %i)` is at least 1 before. But now, we assume that the TripCount is at least 1 if we enter the loop. And we also check that TripCount will not overflow when we set TripCount.

Can't you teach SCEV::getAddExpr() that? (to fold zext(x)+C to zext(x+C) iff that doesn't alter the result: https://alive2.llvm.org/ce/z/5ZQ_X_)

Harbormaster completed remote builds in B128559: Diff 379298.Oct 13 2021, 2:48 AM

In D109676#3060551, @shchenz wrote:

Thoughs? Please take your time.

Proving that EC + 1 does not overflow is a proof that extension is not needed. The method has a boolean parameter for exactly that fact. It was written with exactly this case in mind.

In D109676#3060697, @lebedev.ri wrote:

Can't you teach SCEV::getAddExpr() that? (to fold zext(x)+C to zext(x+C) iff that doesn't alter the result: https://alive2.llvm.org/ce/z/5ZQ_X_)

To prove zext(x)+C not overflow, we need to pass a parameter for a specific loop to SCEV::getAddExpr(), will it be a little strange? AddExpr should not be connected with any loop.

If you still think this is a requirement, I'd be happy to do this refactor in another patch.

rebased based on D110587

shchenz added inline comments.Oct 14 2021, 6:06 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
190	Legacy code also does not consider the overflow when `ExitCount->getType()` is same with `CountType`, since it works well on all platforms, so I think it should be OK to treat it as not overflow when `SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth()`. if (!ExitCount->getType()->isPointerTy() && ExitCount->getType() != CountType) ExitCount = SE.getZeroExtendExpr(ExitCount, CountType); ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));

Harbormaster completed remote builds in B128844: Diff 379687.Oct 14 2021, 6:47 AM

Removing self as reviewer. The SCEV change has been made and looks reasonable, the rest of the review requires context for the hardware loop detection feature I don't have.

In D109676#3065312, @reames wrote:

Removing self as reviewer. The SCEV change has been made and looks reasonable, the rest of the review requires context for the hardware loop detection feature I don't have.

Thanks very much for your review about the SCEV part @reames

@samparker Hi Sam, do you think the change in HardwareLoop pass is reasonable especially the changes in the hardware loop test cases. Thanks in advance.

samparker added inline comments.Oct 15 2021, 3:11 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	This comment is making me uneasy... I'm don't think target specific stuff should be referenced in this generic layer. But either way, why would a count register ever have a value of -1, especially when we're expending quite some effort to test against zero in this transform?
llvm/lib/CodeGen/HardwareLoops.cpp
421	Let's now remove this comment about isLoopEntryGuardedByCond.
llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗	(On Diff #379298)	These are minor regressions, but they're inline with the other previous sub-optimal cases so I think it's okay.

shchenz added inline comments.Oct 15 2021, 7:50 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	PowerPC don't have test and set form. On PowerPC, if the count register is 0 before being decremented, it will be wrapped to -1(all bits are 1) afterward, so we can handle the overflow case when `ExitCount` type is same with `CountType`. The total loop count is u64_MAX + 1 But on other targets like ARM, the hardware loops pass handles the overflow case for `ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));` when `ExitCount` type is same with `CountType` with test and set form? If this is the case, yes, the comments is not right, we need to update it for different targets. But for all targets, making `SE.getAddExpr(ExitCount, SE.getOne(CountType));` as not overflow when `SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth()` should be OK. Right?
llvm/lib/CodeGen/HardwareLoops.cpp
421	Thanks, will do this later.
llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗	(On Diff #379298)	Thanks for confirmation.

address comments

Harbormaster completed remote builds in B129261: Diff 380273.Oct 17 2021, 9:07 PM

samparker added inline comments.Oct 18 2021, 1:43 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	Regardless of test.set support, all the intrinsics which set the counter use the trip count. So, if this is zero, as you're suggesting, the loop should never execute and so your counter should never be decremented to -1. What am I missing here? Could you point me at a test for an example? The test.set forms, for ARM, are just so we can try to map on to 'while' loops and we can't presume anything in this method about test.set being successfully generated. But for all targets, making SE.getAddExpr(ExitCount, SE.getOne(CountType)); as not overflow when SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth() should be OK. Right? Yes, this sounds fine and I think this would be the only condition needed for the overflow check.

shchenz added inline comments.Oct 18 2021, 6:06 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	Hi, let me describe the issue a little bit. Sorry for the long comment and appreciate your patience. @samparker Without any change: if (!ExitCount->getType()->isPointerTy() && ExitCount->getType() != CountType) ExitCount = SE.getZeroExtendExpr(ExitCount, CountType); ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType)); Suppose `CountType` is `i64` For ExitCount type smaller than `i64`, it will be zero-extended to i64, so there will be no overflow issue; For ExitCount type same with `i64`, it will add one directly without checking overflow. This should be functionality right since no issue was found so far. Technically speaking, `SE.getAddExpr(ExitCount, SE.getOne(CountType));` could be overflow if `ExitCount` is u64_max before adding one. This case can work right on PowerPC, because when TripCount(after adding one and overflow) is zero, the loop execution count is u64_max + 1 according to PowerPC count register semantic. So I guess other targets like ARM should have the same hardware behavior or there is some other place to handle the overflow zero case? With the change in D91724: TripCount = SE.getAddExpr(EC, SE.getOne(EC->getType())); if (!EC->getType()->isPointerTy() && EC->getType() != CountType) TripCount = SE.getZeroExtendExpr(TripCount, CountType); This is not right as we found regression pr51714 and the root cause was identified by Philip: For narrower type, like i32, if `ExitCount` is u32_max before adding one, before the change(zero extend first and the add one), the loop count is u32_max + 1 (represented in i64), but after the change, the loop count is 0(overflow on type i32). patch D91724 is not right. Now with this patch D109676: For `ExitCount` type smaller than `i64`, if we can prove that `ExitCount + 1` will not overflow, we can put +1 before zero-extension. Otherwise, we have to first do zero-extension and then do +1; For `ExitCount` type same with `i64`, according to the legacy logic, we can treat it as not-overflow even though it can be overflow technically. But like the above comments for `Without any change`, all targets can handle the overflow with some mechanism. For your question: So, if this is zero, as you're suggesting, the loop should never execute and so your counter should never be decremented to -1. What am I missing here? Could you point me at a test for an example? I think the TripCount is zero if `ExitCount` is u64_max before adding one, so hardware loop pass will set hardware count register to 0 as the result of overflow on type i64, and on PowerPC this is a valid loop, the loop execution count is u64_max + 1. No test case in hand, I did some experiments on PowerPC, set loop count register to 0 and after the first decrement (`bdnz` instruction), the loop count register is changed to u64_max(0xffffffffffffffff, -1). I think this would be the only condition needed for the overflow check. Do you mean we don't need the `isLoopEntryGuardedByCond` for the overflow check? I think `isLoopEntryGuardedByCond` is still needed for types smaller than i64, otherwise, all smaller types are treated as `WillNotOverflow` and we will not zero extend the type in `getTripCountFromExitCount`, we may still meet the same issue with what we met in patch `D91724`?

samparker added inline comments.Oct 20 2021, 4:27 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	Sigh, sorry, I forgot (again) that we haven't checked against ExitCount == UMAX . There's the static function 'mustBeFiniteCountedLoop' in PlaceSafePoints, maybe this could be moved somewhere so we could use it? Or has that kind of check already been tried? Sorry for my confusion and continued questioning.

shchenz added inline comments.Oct 20 2021, 4:46 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	Haha, don't worry about your continued questions. They are making this patch be more reasonable and clear. Here, we wouldn't get infinite loop error at least on PowerPC target. When the initial loop count register is 0, the loop execution count is u64_max + 1. A typical pattern for a hardware loop on PowerPC is like this: mtctr $gpr ; move the loop count from $gpr to loop count regsiter CTR loop_start: ; start of loop header ... ; loop body ... bdnz loop_start ; loop latch The semantic of the PowerPC `bdnz` instructions is "Decrement CTR and branch if it is still nonzero". So if the initial value is 0, after the first decrement, the loop count register is changed to `0xffffffffffffffff`. and then after decrement 0xffffffffffffffff times, it will be changed to zero and then the loop is exiting. Do we meet infinite loop issue on ARM? I think here we only care about overflow.

samparker added inline comments.Oct 20 2021, 7:27 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	Okay, so I've tried going through the Arm spec and it's not immediately clear to me, so I have asked for some clarification. I suspect our behaviour is the same, but I will confirm once I know.

samparker added inline comments.Oct 21 2021, 12:08 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	So, unfortunately it looks like the behaviour isn't the same as PPC. We only decrement the counter, and perform the backwards branch, if the counter has a value > 1. So setting our loop counter to a value of either zero or one, will result in a single trip and no backedge taken.

I would like to start a top-level discussion so everything isn't lost within the review comments, and pull in the Arm people who still work on this... @dmgreen @samtebbs

It appears as though PPC and Arm have different semantics for their loop backedge control. To summarize @shchenz, PPC will 'Decrement CTR and branch if it is still nonzero'. On Arm, AFAICT, we don't perform the decrement when the counter (LR) is <= 1.

So, do we need to tighten the semantics of the loop intrinsics for their overflow behaviour? Arm and PPC use different intrinsics so I can't see why defining them slightly differently would anyone.

shchenz added inline comments.Oct 21 2021, 1:29 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	If so, would it be a potential issue even without this patch? if (!ExitCount->getType()->isPointerTy() && ExitCount->getType() != CountType) ExitCount = SE.getZeroExtendExpr(ExitCount, CountType); ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType)); If `ExitCount` type is same with `CountType` and `ExitCount` is UMAX before +1, after +1, it will be overflow and set to 0. For this case, does current behavior match ARM's hardware loop's expection?

samparker added inline comments.Oct 21 2021, 3:13 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
187	Yes, it will cause us problems, which is why I think we probably need some better intrinsic semantics and varying logic here. With the current representation in the loop, we can't enter the loop with the counter (trip count) set to zero. This sounds, to me, rather sensible. I thought (and seemingly keep forgetting that it's not true) that this transform found 'countable' loops which would terminate, so that a trip count could have a max value of UMAX but ExitCount (BTC) could only ever reach UMAX - 1.

Everything tends to be i32's in Arm, so we only rarely see extends in the induction variables in the Hardware loop pass. Even if the IV was previously in i8/i16 it will likely already have been extended to a i32 by indvarsimplify.

In D109676#3077185, @samparker wrote:

I would like to start a top-level discussion so everything isn't lost within the review comments, and pull in the Arm people who still work on this... @dmgreen @samtebbs

It appears as though PPC and Arm have different semantics for their loop backedge control. To summarize @shchenz, PPC will 'Decrement CTR and branch if it is still nonzero'. On Arm, AFAICT, we don't perform the decrement when the counter (LR) is <= 1.

So, do we need to tighten the semantics of the loop intrinsics for their overflow behaviour? Arm and PPC use different intrinsics so I can't see why defining them slightly differently would anyone.

It's true we don't need to add 1 to the loop counter (if I'm understanding this correctly). When running our downstream benchmarks the only changes I saw from this patch is instructions that were in the preheader are now in the pre-preheader (the guard block). That makes things better or worse by ~1%, depending on the benchmark.

llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗	(On Diff #379298)	It's only small, but it does mean we execute more instructions if the branch is taken..

In D109676#3083540, @dmgreen wrote:

Everything tends to be i32's in Arm, so we only rarely see extends in the induction variables in the Hardware loop pass. Even if the IV was previously in i8/i16 it will likely already have been extended to a i32 by indvarsimplify.

Thanks, that is the possible reason why we don't see any regressions after patch D91724 on ARM.

It's true we don't need to add 1 to the loop counter (if I'm understanding this correctly). When running our downstream benchmarks the only changes I saw from this patch is instructions that were in the preheader are now in the pre-preheader (the guard block). That makes things better or worse by ~1%, depending on the benchmark.

Thanks for your test on ARM. The performance impact should be small as it only affects the loop count expansion outside of the loop.

So for this patch, we need to wait for ARM experts find out the solution for overflow issue when ExitCount type is same with CountType?

llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗	(On Diff #379298)	Putting the loop count in pre-preheader is required by `UseLoopGuard = true`? If so, this should be positive, because now we use more loop guards?

shchenz added a reviewer: dmgreen.Oct 25 2021, 6:49 AM

So for this patch, we need to wait for ARM experts find out the solution for the overflow issue when ExitCount type is same with CountType?

Not sure if there is any solution for the above issue? @samparker @dmgreen .

This patch is required by the PowerPC as it can generate better instructions shown in the test changes. And this should not introduce any regression on ARM, because the codes before this patch also have the potential issue. See https://reviews.llvm.org/D109676#3077240

So, is it possible to get this in first? Thanks in advance.

gentle ping

As I said previously, the behaviour of the intrinsics should better defined and then this transform should adhere to those semantics. We're already using different pairs of intrinsics for each backend, so this shouldn't be a problem; however, we may have to ensure the Arm backend rejects the 'old' set.loop.iterations just to be safe.

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

21 lines

CodeGen/

HardwareLoops.cpp

25 lines

test/

CodeGen/

PowerPC/

loop-instr-prep-non-const-increasement.ll

24 lines

mma-phi-accs.ll

8 lines

Diff 372191

llvm/include/llvm/Analysis/TargetTransformInfo.h

	Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines

	/// Attributes of a target dependent hardware loop.			/// Attributes of a target dependent hardware loop.
	struct HardwareLoopInfo {			struct HardwareLoopInfo {
	HardwareLoopInfo() = delete;			HardwareLoopInfo() = delete;
	HardwareLoopInfo(Loop *L) : L(L) {}			HardwareLoopInfo(Loop *L) : L(L) {}
	Loop *L = nullptr;			Loop *L = nullptr;
	BasicBlock *ExitBlock = nullptr;			BasicBlock *ExitBlock = nullptr;
	BranchInst *ExitBranch = nullptr;			BranchInst *ExitBranch = nullptr;
	const SCEV *ExitCount = nullptr;			const SCEV *TripCount = nullptr;
	IntegerType *CountType = nullptr;			IntegerType *CountType = nullptr;
	Value *LoopDecrement = nullptr; // Decrement the loop counter by this			Value *LoopDecrement = nullptr; // Decrement the loop counter by this
	// value in every iteration.			// value in every iteration.
	bool IsNestingLegal = false; // Can a hardware loop be a parent to			bool IsNestingLegal = false; // Can a hardware loop be a parent to
	// another hardware loop?			// another hardware loop?
	bool CounterInReg = false; // Should loop counter be updated in			bool CounterInReg = false; // Should loop counter be updated in
	// the loop via a phi?			// the loop via a phi?
	bool PerformEntryTest = false; // Generate the intrinsic which also performs			bool PerformEntryTest = false; // Generate the intrinsic which also performs
	▲ Show 20 Lines • Show All 2,306 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {

ExitBranch = BI;		ExitBranch = BI;
} else		} else
continue;		continue;

// Note that this block may not be the loop latch block, even if the loop		// Note that this block may not be the loop latch block, even if the loop
// has a latch block.		// has a latch block.
ExitBlock = BB;		ExitBlock = BB;
ExitCount = EC;		TripCount = EC;
		reamesUnsubmitted Done Reply Inline Actions You're using a variable called TripCount to store an ExitCount (e.g. BTC) before adding the appropriate offset. I strongly suggest using two variables. reames: You're using a variable called TripCount to store an ExitCount (e.g. BTC) before adding the…
		shchenzAuthorUnsubmitted Done Reply Inline Actions Good catch shchenz: Good catch
break;		break;
}		}

if (!ExitBlock)		if (!ExitBlock)
return false;		return false;

		bool NeedExtend =
		!TripCount->getType()->isPointerTy() && TripCount->getType() != CountType;
		reamesUnsubmitted Not Done Reply Inline Actions There appears to be an unasserted assumption that CountType is of a wider type than the exitcount's type. The code structure does not make it clear this is true, at a minimum, please assert. reames: There appears to be an unasserted assumption that CountType is of a wider type than the…
		shchenzAuthorUnsubmitted Done Reply Inline Actions The code already make sure that `TripCount->getType()` not bigger than `CountType`. See line 124 ~ 125 in the same function `isHardwareLoopCandidate`. But yeah, an assertion will be more clear here. shchenz: The code already make sure that `TripCount->getType()` not bigger than `CountType`. See line…
		reamesUnsubmitted Done Reply Inline Actions I'd suggest turning the pointer type check into an early return false unless you have a motivating case for the complexity. Pointer typed trip counts should be rare. reames: I'd suggest turning the pointer type check into an early return false unless you have a…
		reamesUnsubmitted Not Done Reply Inline Actions This is wrong. Please return false here instead, pointer typed trip counts should never happen. reames: This is wrong. Please return false here instead, pointer typed trip counts should never happen.
		shchenzAuthorUnsubmitted Done Reply Inline Actions Thanks. I am not aware that loop exit count can not be a pointer type. I change this to assert instead of returning false. shchenz: Thanks. I am not aware that loop exit count can not be a pointer type. I change this to assert…
		// If we are going to extend the TripCount and we can prove that TripCount + 1
		// doesn't overflow, we can first do TripCount + 1 to do some folding related
		// to -1 and +1.
		if (NeedExtend &&
		SE.isLoopEntryGuardedByCond(
		L, ICmpInst::ICMP_NE,
		SE.getAddExpr(TripCount, SE.getOne(TripCount->getType())),
		reamesUnsubmitted Not Done Reply Inline Actions Please use a comparison of the exit count against maximum unsigned value instead. The optimizer will probably turn it into the same, but the form you've written has potential interaction with SCEVs flag inference which is best avoided. reames: Please use a comparison of the exit count against maximum unsigned value instead. The…
		shchenzAuthorUnsubmitted Done Reply Inline Actions This was my first thought too. We can get the maximum num for `ExitCount` and if it is less than the maximum unsigned value for that type, we can safely do the +1. Unfortunately, for the case changes, they are all not such cases. One case I was looking at is: define dso_local signext i32 @testNestedPHI(i32 signext %cond, i32 signext %count, <512 x i1>* nocapture %ptr, <16 x i8> %vc) #1 { for.body: ; preds = %for.body.preheader, %for.body %i.011 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ] %vq.110 = phi <512 x i1> [ %1, %for.body ], [ %vq.0, %for.body.preheader ] %1 = tail call <512 x i1> @llvm.ppc.mma.xvf32gernp(<512 x i1> %vq.110, <16 x i8> %vc, <16 x i8> %vc) %inc = add nuw nsw i32 %i.011, 1 %exitcond.not = icmp eq i32 %inc, %count br i1 %exitcond.not, label %for.cond.cleanup, label %for.body } SCEV for the `ExitCount` is `%count - 1`, bacause there is no wrap flag, so `getRangeRef` will return full-set as the range for `ExitCount`. For `ExitCount`, I think we should add `nuw` flag for the above SCEV expression? But that should be out of this patch's scope? shchenz: This was my first thought too. We can get the maximum num for `ExitCount` and if it is less…
		reamesUnsubmitted Done Reply Inline Actions To make sure I'm following you, you're saying that we manage to prove ExitCount != 0 in this example? That's interesting, I actually won't expect us to get that. I'll drop the point about not checking for zero. Just be aware there is some risk there. The good news is that we're rapidly reducing that risk via work on pr51817. reames: To make sure I'm following you, you're saying that we manage to prove ExitCount != 0 in this…
		shchenzAuthorUnsubmitted Done Reply Inline Actions Yes, the case is from the below test changes. Now we can put `+1` before the zero extend. shchenz: Yes, the case is from the below test changes. Now we can put `+1` before the zero extend.
		SE.getZero(TripCount->getType())))
		TripCount = SE.getZeroExtendExpr(
		samparkerUnsubmitted Not Done Reply Inline Actions This comment is making me uneasy... I'm don't think target specific stuff should be referenced in this generic layer. But either way, why would a count register ever have a value of -1, especially when we're expending quite some effort to test against zero in this transform? samparker: This comment is making me uneasy... I'm don't think target specific stuff should be referenced…
		shchenzAuthorUnsubmitted Done Reply Inline Actions PowerPC don't have test and set form. On PowerPC, if the count register is 0 before being decremented, it will be wrapped to -1(all bits are 1) afterward, so we can handle the overflow case when `ExitCount` type is same with `CountType`. The total loop count is u64_MAX + 1 But on other targets like ARM, the hardware loops pass handles the overflow case for `ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));` when `ExitCount` type is same with `CountType` with test and set form? If this is the case, yes, the comments is not right, we need to update it for different targets. But for all targets, making `SE.getAddExpr(ExitCount, SE.getOne(CountType));` as not overflow when `SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth()` should be OK. Right? shchenz: PowerPC don't have test and set form. On PowerPC, if the count register is 0 before being…
		samparkerUnsubmitted Not Done Reply Inline Actions Regardless of test.set support, all the intrinsics which set the counter use the trip count. So, if this is zero, as you're suggesting, the loop should never execute and so your counter should never be decremented to -1. What am I missing here? Could you point me at a test for an example? The test.set forms, for ARM, are just so we can try to map on to 'while' loops and we can't presume anything in this method about test.set being successfully generated. But for all targets, making SE.getAddExpr(ExitCount, SE.getOne(CountType)); as not overflow when SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth() should be OK. Right? Yes, this sounds fine and I think this would be the only condition needed for the overflow check. samparker: Regardless of test.set support, all the intrinsics which set the counter use the trip count. So…
		shchenzAuthorUnsubmitted Done Reply Inline Actions Hi, let me describe the issue a little bit. Sorry for the long comment and appreciate your patience. @samparker Without any change: if (!ExitCount->getType()->isPointerTy() && ExitCount->getType() != CountType) ExitCount = SE.getZeroExtendExpr(ExitCount, CountType); ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType)); Suppose `CountType` is `i64` For ExitCount type smaller than `i64`, it will be zero-extended to i64, so there will be no overflow issue; For ExitCount type same with `i64`, it will add one directly without checking overflow. This should be functionality right since no issue was found so far. Technically speaking, `SE.getAddExpr(ExitCount, SE.getOne(CountType));` could be overflow if `ExitCount` is u64_max before adding one. This case can work right on PowerPC, because when TripCount(after adding one and overflow) is zero, the loop execution count is u64_max + 1 according to PowerPC count register semantic. So I guess other targets like ARM should have the same hardware behavior or there is some other place to handle the overflow zero case? With the change in D91724: TripCount = SE.getAddExpr(EC, SE.getOne(EC->getType())); if (!EC->getType()->isPointerTy() && EC->getType() != CountType) TripCount = SE.getZeroExtendExpr(TripCount, CountType); This is not right as we found regression pr51714 and the root cause was identified by Philip: For narrower type, like i32, if `ExitCount` is u32_max before adding one, before the change(zero extend first and the add one), the loop count is u32_max + 1 (represented in i64), but after the change, the loop count is 0(overflow on type i32). patch D91724 is not right. Now with this patch D109676: For `ExitCount` type smaller than `i64`, if we can prove that `ExitCount + 1` will not overflow, we can put +1 before zero-extension. Otherwise, we have to first do zero-extension and then do +1; For `ExitCount` type same with `i64`, according to the legacy logic, we can treat it as not-overflow even though it can be overflow technically. But like the above comments for `Without any change`, all targets can handle the overflow with some mechanism. For your question: So, if this is zero, as you're suggesting, the loop should never execute and so your counter should never be decremented to -1. What am I missing here? Could you point me at a test for an example? I think the TripCount is zero if `ExitCount` is u64_max before adding one, so hardware loop pass will set hardware count register to 0 as the result of overflow on type i64, and on PowerPC this is a valid loop, the loop execution count is u64_max + 1. No test case in hand, I did some experiments on PowerPC, set loop count register to 0 and after the first decrement (`bdnz` instruction), the loop count register is changed to u64_max(0xffffffffffffffff, -1). I think this would be the only condition needed for the overflow check. Do you mean we don't need the `isLoopEntryGuardedByCond` for the overflow check? I think `isLoopEntryGuardedByCond` is still needed for types smaller than i64, otherwise, all smaller types are treated as `WillNotOverflow` and we will not zero extend the type in `getTripCountFromExitCount`, we may still meet the same issue with what we met in patch `D91724`? shchenz: Hi, let me describe the issue a little bit. Sorry for the long comment and appreciate your…
		samparkerUnsubmitted Not Done Reply Inline Actions Sigh, sorry, I forgot (again) that we haven't checked against ExitCount == UMAX . There's the static function 'mustBeFiniteCountedLoop' in PlaceSafePoints, maybe this could be moved somewhere so we could use it? Or has that kind of check already been tried? Sorry for my confusion and continued questioning. samparker: Sigh, sorry, I forgot (again) that we haven't checked against ExitCount == UMAX . There's the…
		shchenzAuthorUnsubmitted Done Reply Inline Actions Haha, don't worry about your continued questions. They are making this patch be more reasonable and clear. Here, we wouldn't get infinite loop error at least on PowerPC target. When the initial loop count register is 0, the loop execution count is u64_max + 1. A typical pattern for a hardware loop on PowerPC is like this: mtctr $gpr ; move the loop count from $gpr to loop count regsiter CTR loop_start: ; start of loop header ... ; loop body ... bdnz loop_start ; loop latch The semantic of the PowerPC `bdnz` instructions is "Decrement CTR and branch if it is still nonzero". So if the initial value is 0, after the first decrement, the loop count register is changed to `0xffffffffffffffff`. and then after decrement 0xffffffffffffffff times, it will be changed to zero and then the loop is exiting. Do we meet infinite loop issue on ARM? I think here we only care about overflow. shchenz: Haha, don't worry about your continued questions. They are making this patch be more reasonable…
		samparkerUnsubmitted Not Done Reply Inline Actions Okay, so I've tried going through the Arm spec and it's not immediately clear to me, so I have asked for some clarification. I suspect our behaviour is the same, but I will confirm once I know. samparker: Okay, so I've tried going through the Arm spec and it's not immediately clear to me, so I have…
		samparkerUnsubmitted Not Done Reply Inline Actions So, unfortunately it looks like the behaviour isn't the same as PPC. We only decrement the counter, and perform the backwards branch, if the counter has a value > 1. So setting our loop counter to a value of either zero or one, will result in a single trip and no backedge taken. samparker: So, unfortunately it looks like the behaviour isn't the same as PPC. We only decrement the…
		shchenzAuthorUnsubmitted Done Reply Inline Actions If so, would it be a potential issue even without this patch? if (!ExitCount->getType()->isPointerTy() && ExitCount->getType() != CountType) ExitCount = SE.getZeroExtendExpr(ExitCount, CountType); ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType)); If `ExitCount` type is same with `CountType` and `ExitCount` is UMAX before +1, after +1, it will be overflow and set to 0. For this case, does current behavior match ARM's hardware loop's expection? shchenz: If so, would it be a potential issue even without this patch? ``` if (!ExitCount->getType()…
		samparkerUnsubmitted Not Done Reply Inline Actions Yes, it will cause us problems, which is why I think we probably need some better intrinsic semantics and varying logic here. With the current representation in the loop, we can't enter the loop with the counter (trip count) set to zero. This sounds, to me, rather sensible. I thought (and seemingly keep forgetting that it's not true) that this transform found 'countable' loops which would terminate, so that a trip count could have a max value of UMAX but ExitCount (BTC) could only ever reach UMAX - 1. samparker: Yes, it will cause us problems, which is why I think we probably need some better intrinsic…
		SE.getAddExpr(TripCount, SE.getOne(TripCount->getType())), CountType);
		else if (NeedExtend)
		TripCount = SE.getAddExpr(SE.getZeroExtendExpr(TripCount, CountType),
		shchenzAuthorUnsubmitted Done Reply Inline Actions Legacy code also does not consider the overflow when `ExitCount->getType()` is same with `CountType`, since it works well on all platforms, so I think it should be OK to treat it as not overflow when `SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth()`. if (!ExitCount->getType()->isPointerTy() && ExitCount->getType() != CountType) ExitCount = SE.getZeroExtendExpr(ExitCount, CountType); ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType)); shchenz: Legacy code also does not consider the overflow when `ExitCount->getType()` is same with…
		SE.getOne(CountType));
		else
		TripCount = SE.getAddExpr(TripCount, SE.getOne(TripCount->getType()));
		reamesUnsubmitted Done Reply Inline Actions If you use getNoopOrZeroExtend, you don't need the explicit else case. reames: If you use getNoopOrZeroExtend, you don't need the explicit else case.
		shchenzAuthorUnsubmitted Done Reply Inline Actions Good idea. shchenz: Good idea.

return true;		return true;
}		}

TargetTransformInfo::TargetTransformInfo(const DataLayout &DL)		TargetTransformInfo::TargetTransformInfo(const DataLayout &DL)
: TTIImpl(new Model<NoTTIImpl>(NoTTIImpl(DL))) {}		: TTIImpl(new Model<NoTTIImpl>(NoTTIImpl(DL))) {}

TargetTransformInfo::~TargetTransformInfo() {}		TargetTransformInfo::~TargetTransformInfo() {}

▲ Show 20 Lines • Show All 990 Lines • Show Last 20 Lines

llvm/lib/CodeGen/HardwareLoops.cpp

Show First 20 Lines • Show All 178 Lines • ▼ Show 20 Lines	class HardwareLoop {
PHINode InsertPHICounter(Value NumElts, Value *EltsRem);		PHINode InsertPHICounter(Value NumElts, Value *EltsRem);

// Create a new cmp, that checks the returned value of loop_decrement*,		// Create a new cmp, that checks the returned value of loop_decrement*,
// and update the exit branch to use it.		// and update the exit branch to use it.
void UpdateBranch(Value *EltsRem);		void UpdateBranch(Value *EltsRem);

public:		public:
HardwareLoop(HardwareLoopInfo &Info, ScalarEvolution &SE,		HardwareLoop(HardwareLoopInfo &Info, ScalarEvolution &SE,
const DataLayout &DL,		const DataLayout &DL,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - const DataLayout &DL, - OptimizationRemarkEmitter ORE) : - SE(SE), DL(DL), ORE(ORE), L(Info.L), M(L->getHeader()->getModule()), - TripCount(Info.TripCount), - CountType(Info.CountType), - ExitBranch(Info.ExitBranch), - LoopDecrement(Info.LoopDecrement), - UsePHICounter(Info.CounterInReg), - UseLoopGuard(Info.PerformEntryTest) { } + const DataLayout &DL, OptimizationRemarkEmitter ORE) 5 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - const DataLayout &DL…
OptimizationRemarkEmitter *ORE) :		OptimizationRemarkEmitter *ORE) :
SE(SE), DL(DL), ORE(ORE), L(Info.L), M(L->getHeader()->getModule()),		SE(SE), DL(DL), ORE(ORE), L(Info.L), M(L->getHeader()->getModule()),
ExitCount(Info.ExitCount),		TripCount(Info.TripCount),
CountType(Info.CountType),		CountType(Info.CountType),
ExitBranch(Info.ExitBranch),		ExitBranch(Info.ExitBranch),
LoopDecrement(Info.LoopDecrement),		LoopDecrement(Info.LoopDecrement),
UsePHICounter(Info.CounterInReg),		UsePHICounter(Info.CounterInReg),
UseLoopGuard(Info.PerformEntryTest) { }		UseLoopGuard(Info.PerformEntryTest) { }
		reamesUnsubmitted Not Done Reply Inline Actions Please remove the white space diff. If you want to reflow the code, do so in a separate change. reames: Please remove the white space diff. If you want to reflow the code, do so in a separate change.
		shchenzAuthorUnsubmitted Done Reply Inline Actions I was doing this on purpose. Some Lint warnings are emitted for this patch after I rename `ExitCount` to `TripCount`. Should I fix the Lint warnings in this patch? shchenz: I was doing this on purpose. Some Lint warnings are emitted for this patch after I rename…

void Create();		void Create();

private:		private:
ScalarEvolution &SE;		ScalarEvolution &SE;
const DataLayout &DL;		const DataLayout &DL;
OptimizationRemarkEmitter *ORE = nullptr;		OptimizationRemarkEmitter *ORE = nullptr;
Loop *L = nullptr;		Loop *L = nullptr;
Module *M = nullptr;		Module *M = nullptr;
const SCEV *ExitCount = nullptr;		const SCEV *TripCount = nullptr;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - const SCEV TripCount = nullptr; + const SCEV TripCount = nullptr; Lint: Pre-merge checks: clang-format: please reformat the code ``` - const SCEV *TripCount = nullptr; + const…
Type *CountType = nullptr;		Type *CountType = nullptr;
BranchInst *ExitBranch = nullptr;		BranchInst *ExitBranch = nullptr;
Value *LoopDecrement = nullptr;		Value *LoopDecrement = nullptr;
bool UsePHICounter = false;		bool UsePHICounter = false;
bool UseLoopGuard = false;		bool UseLoopGuard = false;
BasicBlock *BeginBB = nullptr;		BasicBlock *BeginBB = nullptr;
		reamesUnsubmitted Not Done Reply Inline Actions Same. reames: Same.
};		};
}		}

char HardwareLoops::ID = 0;		char HardwareLoops::ID = 0;

bool HardwareLoops::runOnFunction(Function &F) {		bool HardwareLoops::runOnFunction(Function &F) {
if (skipFunction(F))		if (skipFunction(F))
return false;		return false;
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	if (!HWLoopInfo.isHardwareLoopCandidate(SE, LI, *DT, ForceNestedLoop,
// TODO: there can be many reasons a loop is not considered a		// TODO: there can be many reasons a loop is not considered a
// candidate, so we should let isHardwareLoopCandidate fill in the		// candidate, so we should let isHardwareLoopCandidate fill in the
// reason and then report a better message here.		// reason and then report a better message here.
reportHWLoopFailure("loop is not a candidate", "HWLoopNoCandidate", ORE, L);		reportHWLoopFailure("loop is not a candidate", "HWLoopNoCandidate", ORE, L);
return false;		return false;
}		}

assert(		assert(
(HWLoopInfo.ExitBlock && HWLoopInfo.ExitBranch && HWLoopInfo.ExitCount) &&		(HWLoopInfo.ExitBlock && HWLoopInfo.ExitBranch && HWLoopInfo.TripCount) &&
"Hardware Loop must have set exit info.");		"Hardware Loop must have set exit info.");

BasicBlock *Preheader = L->getLoopPreheader();		BasicBlock *Preheader = L->getLoopPreheader();

// If we don't have a preheader, then insert one.		// If we don't have a preheader, then insert one.
if (!Preheader)		if (!Preheader)
Preheader = InsertPreheaderForLoop(L, DT, LI, nullptr, PreserveLCSSA);		Preheader = InsertPreheaderForLoop(L, DT, LI, nullptr, PreserveLCSSA);
if (!Preheader)		if (!Preheader)
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines
}		}

Value *HardwareLoop::InitLoopCount() {		Value *HardwareLoop::InitLoopCount() {
LLVM_DEBUG(dbgs() << "HWLoops: Initialising loop counter value:\n");		LLVM_DEBUG(dbgs() << "HWLoops: Initialising loop counter value:\n");
// Can we replace a conditional branch with an intrinsic that sets the		// Can we replace a conditional branch with an intrinsic that sets the
// loop counter and tests that is not zero?		// loop counter and tests that is not zero?

SCEVExpander SCEVE(SE, DL, "loopcnt");		SCEVExpander SCEVE(SE, DL, "loopcnt");
if (!ExitCount->getType()->isPointerTy() &&
ExitCount->getType() != CountType)
ExitCount = SE.getZeroExtendExpr(ExitCount, CountType);

ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));

// If we're trying to use the 'test and set' form of the intrinsic, we need		// If we're trying to use the 'test and set' form of the intrinsic, we need
// to replace a conditional branch that is controlling entry to the loop. It		// to replace a conditional branch that is controlling entry to the loop. It
// is likely (guaranteed?) that the preheader has an unconditional branch to		// is likely (guaranteed?) that the preheader has an unconditional branch to
// the loop header, so also check if it has a single predecessor.		// the loop header, so also check if it has a single predecessor.
if (SE.isLoopEntryGuardedByCond(L, ICmpInst::ICMP_NE, ExitCount,		if (SE.isLoopEntryGuardedByCond(L, ICmpInst::ICMP_NE, TripCount,
SE.getZero(ExitCount->getType()))) {		SE.getZero(TripCount->getType()))) {
		reamesUnsubmitted Not Done Reply Inline Actions Hm, hadn't noticed this before. An actual trip count is never zero. This appears to have been trying to handle the overflow in BTC to TC conversion. I haven't looked at this to see what it's doing and why, but this is likely dead code now and you should understand why this was insufficient previously. reames: Hm, hadn't noticed this before. An actual trip count is never zero. This appears to have been…
		shchenzAuthorUnsubmitted Done Reply Inline Actions I don't think it is used to detect overflow in the original patch. It should be used to detect the possibility that the loop count can be zero. If it can not be a zero(for example, explicitly check the loop count and constant zero in the IR), some targets like ARM can use 'test and set' form loop count intrinsic, so that we can delete the explicitly check instruction in the IR. See one example in `test/Transforms/HardwareLoops/loop-guards.ll` `@test6()`. After the change, instruction `%cmp = icmp ne i32 %N, 0` has no user. For the previous bug in https://reviews.llvm.org/D91724, this check does not help. On PowerPC, there is no `test and set` form loop count intrinsic, so no matter what the result of `isLoopEntryGuardedByCond` is, `UseLoopGuard` is always false. And the loop count value may still be changed from umax_32 + 1(if we extend to u64) to 0 after D91724. At last, yes, after this patch, this check should not be done two times. I have updated the patch. shchenz: I don't think it is used to detect overflow in the original patch. It should be used to detect…
		samparkerUnsubmitted Not Done Reply Inline Actions There's a comment of line 417 about this. I was trying to familiarise, again, with this code the other day and it confused me... SCEV didn't seem to help much here and so we use pattern matching in CanGenerateTest to achieve what we want. But yes, looks like dead code now. samparker: There's a comment of line 417 about this. I was trying to familiarise, again, with this code…
		reamesUnsubmitted Not Done Reply Inline Actions I'm not following this comment. My best guess is that you believe that the trip count of a loop which is not entered is zero. This is not necessarily true. A trip count (as returned by SCEV) is the number of times the loop header is executed if the loop is reached. As such, a trip count is always 1 or greater. reames: I'm not following this comment. My best guess is that you believe that the trip count of a…
		shchenzAuthorUnsubmitted Done Reply Inline Actions Sorry for the confusion. I was trying to explain why even with the guard `isLoopEntryGuardedByCond`, we still fail after committing D91724. The check `isLoopEntryGuardedByCond` here will not stop the hardware loop pass to convert a loop to a hardware loop. And the trip count changing from umax_32 + 1(if we extend to u64) to 0 after D91724 is not impacted by this check. So that's why we still hit the bug on PowerPC. We need to do this check when we put +1 before the extension. shchenz: Sorry for the confusion. I was trying to explain why even with the guard…
LLVM_DEBUG(dbgs() << " - Attempting to use test.set counter.\n");		LLVM_DEBUG(dbgs() << " - Attempting to use test.set counter.\n");
UseLoopGuard \|= ForceGuardLoopEntry;		UseLoopGuard \|= ForceGuardLoopEntry;
} else		} else
UseLoopGuard = false;		UseLoopGuard = false;

BasicBlock *BB = L->getLoopPreheader();		BasicBlock *BB = L->getLoopPreheader();
if (UseLoopGuard && BB->getSinglePredecessor() &&		if (UseLoopGuard && BB->getSinglePredecessor() &&
cast<BranchInst>(BB->getTerminator())->isUnconditional()) {		cast<BranchInst>(BB->getTerminator())->isUnconditional()) {
BasicBlock *Predecessor = BB->getSinglePredecessor();		BasicBlock *Predecessor = BB->getSinglePredecessor();
// If it's not safe to create a while loop then don't force it and create a		// If it's not safe to create a while loop then don't force it and create a
// do-while loop instead		// do-while loop instead
if (!isSafeToExpandAt(ExitCount, Predecessor->getTerminator(), SE))		if (!isSafeToExpandAt(TripCount, Predecessor->getTerminator(), SE))
UseLoopGuard = false;		UseLoopGuard = false;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - UseLoopGuard = false; + UseLoopGuard = false; Lint: Pre-merge checks: clang-format: please reformat the code ``` - UseLoopGuard = false; + UseLoopGuard =…
else		else
BB = Predecessor;		BB = Predecessor;
}		}

if (!isSafeToExpandAt(ExitCount, BB->getTerminator(), SE)) {		if (!isSafeToExpandAt(TripCount, BB->getTerminator(), SE)) {
LLVM_DEBUG(dbgs() << "- Bailing, unsafe to expand ExitCount "		LLVM_DEBUG(dbgs() << "- Bailing, unsafe to expand TripCount "
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - LLVM_DEBUG(dbgs() << "- Bailing, unsafe to expand TripCount " - << TripCount << "\n"); + LLVM_DEBUG(dbgs() << "- Bailing, unsafe to expand TripCount " << TripCount + << "\n"); Lint: Pre-merge checks: clang-format: please reformat the code ``` - LLVM_DEBUG(dbgs() << "- Bailing, unsafe to…
<< *ExitCount << "\n");		<< *TripCount << "\n");
return nullptr;		return nullptr;
}		}

Value *Count = SCEVE.expandCodeFor(ExitCount, CountType,		Value *Count = SCEVE.expandCodeFor(TripCount, CountType,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - Value Count = SCEVE.expandCodeFor(TripCount, CountType, - BB->getTerminator()); + Value Count = SCEVE.expandCodeFor(TripCount, CountType, BB->getTerminator()); Lint: Pre-merge checks: clang-format: please reformat the code ``` - Value *Count = SCEVE.expandCodeFor(TripCount…
BB->getTerminator());		BB->getTerminator());

// FIXME: We've expanded Count where we hope to insert the counter setting		// FIXME: We've expanded Count where we hope to insert the counter setting
// intrinsic. But, in the case of the 'test and set' form, we may fallback to		// intrinsic. But, in the case of the 'test and set' form, we may fallback to
// the just 'set' form and in which case the insertion block is most likely		// the just 'set' form and in which case the insertion block is most likely
// different. It means there will be instruction(s) in a block that possibly		// different. It means there will be instruction(s) in a block that possibly
// aren't needed. The isLoopEntryGuardedByCond is trying to avoid this issue,		// aren't needed. The isLoopEntryGuardedByCond is trying to avoid this issue,
		samparkerUnsubmitted Not Done Reply Inline Actions Let's now remove this comment about isLoopEntryGuardedByCond. samparker: Let's now remove this comment about isLoopEntryGuardedByCond.
		shchenzAuthorUnsubmitted Done Reply Inline Actions Thanks, will do this later. shchenz: Thanks, will do this later.
// but it's doesn't appear to work in all cases.		// but it's doesn't appear to work in all cases.

UseLoopGuard = UseLoopGuard && CanGenerateTest(L, Count);		UseLoopGuard = UseLoopGuard && CanGenerateTest(L, Count);
BeginBB = UseLoopGuard ? BB : L->getLoopPreheader();		BeginBB = UseLoopGuard ? BB : L->getLoopPreheader();
LLVM_DEBUG(dbgs() << " - Loop Count: " << *Count << "\n"		LLVM_DEBUG(dbgs() << " - Loop Count: " << *Count << "\n"
<< " - Expanded Count in " << BB->getName() << "\n"		<< " - Expanded Count in " << BB->getName() << "\n"
<< " - Will insert set counter intrinsic into: "		<< " - Will insert set counter intrinsic into: "
<< BeginBB->getName() << "\n");		<< BeginBB->getName() << "\n");
▲ Show 20 Lines • Show All 106 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/loop-instr-prep-non-const-increasement.ll

	Show All 16 Lines
	; }			; }

	define i64 @foo(i8* %p, i32 signext %n, i32 signext %count) {			define i64 @foo(i8* %p, i32 signext %n, i32 signext %count) {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: # %bb.0: # %entry			; CHECK: # %bb.0: # %entry
	; CHECK-NEXT: cmpwi r4, 1			; CHECK-NEXT: cmpwi r4, 1
	; CHECK-NEXT: blt cr0, .LBB0_4			; CHECK-NEXT: blt cr0, .LBB0_4
	; CHECK-NEXT: # %bb.1: # %for.body.preheader			; CHECK-NEXT: # %bb.1: # %for.body.preheader
	; CHECK-NEXT: addi r4, r4, -1			; CHECK-NEXT: clrldi r4, r4, 32
	; CHECK-NEXT: extsw r5, r5			; CHECK-NEXT: extsw r5, r5
	; CHECK-NEXT: li r6, 0			; CHECK-NEXT: li r6, 0
	; CHECK-NEXT: li r7, 5			; CHECK-NEXT: li r7, 5
	; CHECK-NEXT: li r8, 9
	; CHECK-NEXT: clrldi r4, r4, 32
	; CHECK-NEXT: addi r4, r4, 1
	; CHECK-NEXT: mtctr r4			; CHECK-NEXT: mtctr r4
				; CHECK-NEXT: li r8, 9
	; CHECK-NEXT: li r4, 0			; CHECK-NEXT: li r4, 0
	; CHECK-NEXT: .p2align 5			; CHECK-NEXT: .p2align 5
	; CHECK-NEXT: .LBB0_2: # %for.body			; CHECK-NEXT: .LBB0_2: # %for.body
	; CHECK-NEXT: #			; CHECK-NEXT: #
	; CHECK-NEXT: add r9, r3, r6			; CHECK-NEXT: add r9, r3, r6
	; CHECK-NEXT: add r6, r6, r5			; CHECK-NEXT: add r6, r6, r5
	; CHECK-NEXT: ldx r10, r9, r7			; CHECK-NEXT: ldx r10, r9, r7
	; CHECK-NEXT: ldx r9, r9, r8			; CHECK-NEXT: ldx r9, r9, r8
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; }			; }

	define zeroext i8 @foo1(i8* %p, i32 signext %n, i32 signext %count) {			define zeroext i8 @foo1(i8* %p, i32 signext %n, i32 signext %count) {
	; CHECK-LABEL: foo1:			; CHECK-LABEL: foo1:
	; CHECK: # %bb.0: # %entry			; CHECK: # %bb.0: # %entry
	; CHECK-NEXT: cmpwi r4, 1			; CHECK-NEXT: cmpwi r4, 1
	; CHECK-NEXT: blt cr0, .LBB1_4			; CHECK-NEXT: blt cr0, .LBB1_4
	; CHECK-NEXT: # %bb.1: # %for.body.preheader			; CHECK-NEXT: # %bb.1: # %for.body.preheader
	; CHECK-NEXT: addi r4, r4, -1			; CHECK-NEXT: addi r6, r3, 1000
	; CHECK-NEXT: addi r3, r3, 1000			; CHECK-NEXT: clrldi r3, r4, 32
	; CHECK-NEXT: extsw r5, r5			; CHECK-NEXT: extsw r5, r5
	; CHECK-NEXT: li r6, 0
	; CHECK-NEXT: clrldi r4, r4, 32
	; CHECK-NEXT: addi r4, r4, 1
	; CHECK-NEXT: mtctr r4
	; CHECK-NEXT: li r4, 0			; CHECK-NEXT: li r4, 0
				; CHECK-NEXT: mtctr r3
				; CHECK-NEXT: li r3, 0
	; CHECK-NEXT: .p2align 4			; CHECK-NEXT: .p2align 4
	; CHECK-NEXT: .LBB1_2: # %for.body			; CHECK-NEXT: .LBB1_2: # %for.body
	; CHECK-NEXT: #			; CHECK-NEXT: #
	; CHECK-NEXT: lbzx r7, r3, r6			; CHECK-NEXT: lbzx r7, r6, r4
	; CHECK-NEXT: add r6, r6, r5			; CHECK-NEXT: add r4, r4, r5
	; CHECK-NEXT: add r4, r7, r4			; CHECK-NEXT: add r3, r7, r3
	; CHECK-NEXT: bdnz .LBB1_2			; CHECK-NEXT: bdnz .LBB1_2
	; CHECK-NEXT: # %bb.3: # %for.cond.cleanup			; CHECK-NEXT: # %bb.3: # %for.cond.cleanup
	; CHECK-NEXT: clrldi r3, r4, 56			; CHECK-NEXT: clrldi r3, r3, 56
	; CHECK-NEXT: blr			; CHECK-NEXT: blr
	; CHECK-NEXT: .LBB1_4:			; CHECK-NEXT: .LBB1_4:
	; CHECK-NEXT: li r3, 0			; CHECK-NEXT: li r3, 0
	; CHECK-NEXT: blr			; CHECK-NEXT: blr
	entry:			entry:
	%cmp10 = icmp sgt i32 %n, 0			%cmp10 = icmp sgt i32 %n, 0
	br i1 %cmp10, label %for.body.preheader, label %for.cond.cleanup			br i1 %cmp10, label %for.body.preheader, label %for.cond.cleanup

	Show All 21 Lines

llvm/test/CodeGen/PowerPC/mma-phi-accs.ll

	Show First 20 Lines • Show All 257 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: cmpwi r4, 1			; CHECK-NEXT: cmpwi r4, 1
	; CHECK-NEXT: bge cr0, .LBB3_3			; CHECK-NEXT: bge cr0, .LBB3_3
	; CHECK-NEXT: b .LBB3_5			; CHECK-NEXT: b .LBB3_5
	; CHECK-NEXT: .LBB3_2:			; CHECK-NEXT: .LBB3_2:
	; CHECK-NEXT: # implicit-def: $acc0			; CHECK-NEXT: # implicit-def: $acc0
	; CHECK-NEXT: cmpwi r4, 1			; CHECK-NEXT: cmpwi r4, 1
	; CHECK-NEXT: blt cr0, .LBB3_5			; CHECK-NEXT: blt cr0, .LBB3_5
	; CHECK-NEXT: .LBB3_3: # %for.body.preheader			; CHECK-NEXT: .LBB3_3: # %for.body.preheader
	; CHECK-NEXT: addi r3, r4, -1			; CHECK-NEXT: clrldi r3, r4, 32
	; CHECK-NEXT: clrldi r3, r3, 32
	; CHECK-NEXT: addi r3, r3, 1
	; CHECK-NEXT: mtctr r3			; CHECK-NEXT: mtctr r3
	; CHECK-NEXT: .p2align 4			; CHECK-NEXT: .p2align 4
	; CHECK-NEXT: .LBB3_4: # %for.body			; CHECK-NEXT: .LBB3_4: # %for.body
	; CHECK-NEXT: #			; CHECK-NEXT: #
	; CHECK-NEXT: xvf32gernp acc0, v2, v2			; CHECK-NEXT: xvf32gernp acc0, v2, v2
	; CHECK-NEXT: bdnz .LBB3_4			; CHECK-NEXT: bdnz .LBB3_4
	; CHECK-NEXT: .LBB3_5: # %for.cond.cleanup			; CHECK-NEXT: .LBB3_5: # %for.cond.cleanup
	; CHECK-NEXT: li r3, 0			; CHECK-NEXT: li r3, 0
	Show All 13 Lines
	; CHECK-BE-NEXT: cmpwi r4, 1			; CHECK-BE-NEXT: cmpwi r4, 1
	; CHECK-BE-NEXT: bge cr0, .LBB3_3			; CHECK-BE-NEXT: bge cr0, .LBB3_3
	; CHECK-BE-NEXT: b .LBB3_5			; CHECK-BE-NEXT: b .LBB3_5
	; CHECK-BE-NEXT: .LBB3_2:			; CHECK-BE-NEXT: .LBB3_2:
	; CHECK-BE-NEXT: # implicit-def: $acc0			; CHECK-BE-NEXT: # implicit-def: $acc0
	; CHECK-BE-NEXT: cmpwi r4, 1			; CHECK-BE-NEXT: cmpwi r4, 1
	; CHECK-BE-NEXT: blt cr0, .LBB3_5			; CHECK-BE-NEXT: blt cr0, .LBB3_5
	; CHECK-BE-NEXT: .LBB3_3: # %for.body.preheader			; CHECK-BE-NEXT: .LBB3_3: # %for.body.preheader
	; CHECK-BE-NEXT: addi r3, r4, -1			; CHECK-BE-NEXT: clrldi r3, r4, 32
	; CHECK-BE-NEXT: clrldi r3, r3, 32
	; CHECK-BE-NEXT: addi r3, r3, 1
	; CHECK-BE-NEXT: mtctr r3			; CHECK-BE-NEXT: mtctr r3
	; CHECK-BE-NEXT: .p2align 4			; CHECK-BE-NEXT: .p2align 4
	; CHECK-BE-NEXT: .LBB3_4: # %for.body			; CHECK-BE-NEXT: .LBB3_4: # %for.body
	; CHECK-BE-NEXT: #			; CHECK-BE-NEXT: #
	; CHECK-BE-NEXT: xvf32gernp acc0, v2, v2			; CHECK-BE-NEXT: xvf32gernp acc0, v2, v2
	; CHECK-BE-NEXT: bdnz .LBB3_4			; CHECK-BE-NEXT: bdnz .LBB3_4
	; CHECK-BE-NEXT: .LBB3_5: # %for.cond.cleanup			; CHECK-BE-NEXT: .LBB3_5: # %for.cond.cleanup
	; CHECK-BE-NEXT: li r3, 0			; CHECK-BE-NEXT: li r3, 0
	Show All 32 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[HardwareLoops] put +1 for loop count before zero extensionNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 372191

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/CodeGen/HardwareLoops.cpp

llvm/test/CodeGen/PowerPC/loop-instr-prep-non-const-increasement.ll

llvm/test/CodeGen/PowerPC/mma-phi-accs.ll

[HardwareLoops] put +1 for loop count before zero extension
Needs ReviewPublic