This is an archive of the discontinued LLVM Phabricator instance.

[HardwareLoops] put +1 for loop count before zero extension
Needs ReviewPublic

Authored by shchenz on Sep 13 2021, 2:47 AM.

Details

Summary

Putting the +1 before the zero-extend will allow scalar evolution to fold the expression in some cases.

This issue is first tried in https://reviews.llvm.org/D91724 and seems there is some issue with that patch. This patch fixes the issues in that patch.

Diff Detail

Event Timeline

shchenz created this revision.Sep 13 2021, 2:47 AM
shchenz requested review of this revision.Sep 13 2021, 2:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 13 2021, 2:47 AM
reames added inline comments.Sep 13 2021, 10:03 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
170

You're using a variable called TripCount to store an ExitCount (e.g. BTC) before adding the appropriate offset. I strongly suggest using two variables.

178

There appears to be an unasserted assumption that CountType is of a wider type than the exitcount's type. The code structure does not make it clear this is true, at a minimum, please assert.

178

I'd suggest turning the pointer type check into an early return false unless you have a motivating case for the complexity. Pointer typed trip counts should be rare.

reames added inline comments.Sep 13 2021, 10:03 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
185

Please use a comparison of the exit count against maximum unsigned value instead. The optimizer will probably turn it into the same, but the form you've written has potential interaction with SCEVs flag inference which is best avoided.

193

If you use getNoopOrZeroExtend, you don't need the explicit else case.

reames requested changes to this revision.Sep 13 2021, 10:03 AM
This revision now requires changes to proceed.Sep 13 2021, 10:03 AM
shchenz updated this revision to Diff 372385.Sep 13 2021, 7:20 PM
shchenz marked 3 inline comments as done.

address comments

@reames Thanks for your review. Updated accordingly.

llvm/lib/Analysis/TargetTransformInfo.cpp
170

Good catch

178

The code already make sure that TripCount->getType() not bigger than CountType. See line 124 ~ 125 in the same function isHardwareLoopCandidate. But yeah, an assertion will be more clear here.

185

This was my first thought too. We can get the maximum num for ExitCount and if it is less than the maximum unsigned value for that type, we can safely do the +1. Unfortunately, for the case changes, they are all not such cases.

One case I was looking at is:

define dso_local signext i32 @testNestedPHI(i32 signext %cond, i32 signext %count, <512 x i1>* nocapture %ptr, <16 x i8> %vc) #1 {                     

for.body:                                         ; preds = %for.body.preheader, %for.body
  %i.011 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]              
  %vq.110 = phi <512 x i1> [ %1, %for.body ], [ %vq.0, %for.body.preheader ]    
  %1 = tail call <512 x i1> @llvm.ppc.mma.xvf32gernp(<512 x i1> %vq.110, <16 x i8> %vc, <16 x i8> %vc)
  %inc = add nuw nsw i32 %i.011, 1                                              
  %exitcond.not = icmp eq i32 %inc, %count                                      
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body                 
}

SCEV for the ExitCount is %count - 1, bacause there is no wrap flag, so getRangeRef will return full-set as the range for ExitCount.

For ExitCount, I think we should add nuw flag for the above SCEV expression? But that should be out of this patch's scope?

193

Good idea.

shchenz updated this revision to Diff 372405.Sep 13 2021, 10:44 PM

rebase and fix Lint warnings

reames requested changes to this revision.Sep 16 2021, 9:48 AM
reames added inline comments.
llvm/lib/Analysis/TargetTransformInfo.cpp
178

This is wrong. Please return false here instead, pointer typed trip counts should never happen.

185

To make sure I'm following you, you're saying that we manage to prove ExitCount != 0 in this example? That's interesting, I actually won't expect us to get that.

I'll drop the point about not checking for zero. Just be aware there is some risk there. The good news is that we're rapidly reducing that risk via work on pr51817.

llvm/lib/CodeGen/HardwareLoops.cpp
195

Please remove the white space diff. If you want to reflow the code, do so in a separate change.

211

Same.

389–390

Hm, hadn't noticed this before.

An actual trip count is never zero. This appears to have been trying to handle the overflow in BTC to TC conversion. I haven't looked at this to see what it's doing and why, but this is likely dead code now and you should understand why this was insufficient previously.

This revision now requires changes to proceed.Sep 16 2021, 9:48 AM
shchenz updated this revision to Diff 373150.Sep 16 2021, 11:59 PM
shchenz marked an inline comment as done.

address comments

llvm/lib/Analysis/TargetTransformInfo.cpp
178

Thanks. I am not aware that loop exit count can not be a pointer type. I change this to assert instead of returning false.

185

Yes, the case is from the below test changes. Now we can put +1 before the zero extend.

llvm/lib/CodeGen/HardwareLoops.cpp
195

I was doing this on purpose. Some Lint warnings are emitted for this patch after I rename ExitCount to TripCount. Should I fix the Lint warnings in this patch?

389–390

I don't think it is used to detect overflow in the original patch. It should be used to detect the possibility that the loop count can be zero. If it can not be a zero(for example, explicitly check the loop count and constant zero in the IR), some targets like ARM can use 'test and set' form loop count intrinsic, so that we can delete the explicitly check instruction in the IR. See one example in test/Transforms/HardwareLoops/loop-guards.ll @test6(). After the change, instruction %cmp = icmp ne i32 %N, 0 has no user.

For the previous bug in https://reviews.llvm.org/D91724, this check does not help. On PowerPC, there is no test and set form loop count intrinsic, so no matter what the result of isLoopEntryGuardedByCond is, UseLoopGuard is always false. And the loop count value may still be changed from umax_32 + 1(if we extend to u64) to 0 after D91724.

At last, yes, after this patch, this check should not be done two times. I have updated the patch.

samparker added inline comments.Sep 17 2021, 12:30 AM
llvm/lib/CodeGen/HardwareLoops.cpp
389–390

There's a comment of line 417 about this. I was trying to familiarise, again, with this code the other day and it confused me... SCEV didn't seem to help much here and so we use pattern matching in CanGenerateTest to achieve what we want. But yes, looks like dead code now.

shchenz updated this revision to Diff 374715.Sep 23 2021, 8:04 PM

fix lint warning

I posted a SCEV cleanup change inspired by this review. It shouldn't block this one. See https://reviews.llvm.org/D110587

llvm/lib/CodeGen/HardwareLoops.cpp
389–390

I'm not following this comment.

My best guess is that you believe that the trip count of a loop which is not entered is zero. This is not necessarily true. A trip count (as returned by SCEV) is the number of times the loop header is executed if the loop is reached. As such, a trip count is always 1 or greater.

shchenz added inline comments.Sep 27 2021, 5:53 PM
llvm/lib/CodeGen/HardwareLoops.cpp
389–390

Sorry for the confusion.

I was trying to explain why even with the guard isLoopEntryGuardedByCond, we still fail after committing D91724.

The check isLoopEntryGuardedByCond here will not stop the hardware loop pass to convert a loop to a hardware loop. And the trip count changing from umax_32 + 1(if we extend to u64) to 0 after D91724 is not impacted by this check. So that's why we still hit the bug on PowerPC.

We need to do this check when we put +1 before the extension.

Hi @reames , do you have further comments for this patch? Thanks.

thopre added a subscriber: thopre.Oct 7 2021, 4:13 PM

Hi @reames , do you have further comments for this patch? Thanks.

Sorry, your last response didn't parse for me. I know the SCEV side of this, but the client pieces are going over my head.

My general suggestion is to rebase over D110587 (which I'm about to land), and cleanup any downstream handling of zero trip counts separately.

We've about hit the limit of my ability to provide useful review on this, JFYI.

shchenz added a comment.EditedOct 13 2021, 12:27 AM

Hi @reames , do you have further comments for this patch? Thanks.

Sorry, your last response didn't parse for me. I know the SCEV side of this, but the client pieces are going over my head.

My general suggestion is to rebase over D110587 (which I'm about to land), and cleanup any downstream handling of zero trip counts separately.

We've about hit the limit of my ability to provide useful review on this, JFYI.

Hi @reames , seems for this client side, Modified getTripCountFromExitCount in D110587 can not be used. getTripCountFromExitCount always return a wider type(1 bit larger than ExitCount->type size). Because of hardware loop counter register length limitation, the ExitCount->type must be not larger than loop counter register length(For example on 64-bit PowerPC, loop counter register length is 64-bit).
So without calling the API in D110587, we can use a i64 type ExitCount, if we can prove ExitCount + 1 does not overflow by calling isLoopEntryGuardedByCond. But if we call getTripCountFromExitCount with parameter Extend=true, the i64 exitcount/tripcount will be extended to i65 which is not a right loop counter type, so we can not use this i64 type ExitCount even it will not overflow after +1.

Thoughs? Please take your time.

shchenz updated this revision to Diff 379298.Oct 13 2021, 1:52 AM

remove the dead code

shchenz added inline comments.Oct 13 2021, 1:58 AM
llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗(On Diff #379298)

The test changes is due to isLoopEntryGuardedByCond can not determine that SCEV (1 + (-1 * %N) + %i) is at least 1 before. But now, we assume that the TripCount is at least 1 if we enter the loop. And we also check that TripCount will not overflow when we set TripCount.

Can't you teach SCEV::getAddExpr() that? (to fold zext(x)+C to zext(x+C) iff that doesn't alter the result: https://alive2.llvm.org/ce/z/5ZQ_X_)

Thoughs? Please take your time.

Proving that EC + 1 does not overflow is a proof that extension is not needed. The method has a boolean parameter for exactly that fact. It was written with exactly this case in mind.

Can't you teach SCEV::getAddExpr() that? (to fold zext(x)+C to zext(x+C) iff that doesn't alter the result: https://alive2.llvm.org/ce/z/5ZQ_X_)

To prove zext(x)+C not overflow, we need to pass a parameter for a specific loop to SCEV::getAddExpr(), will it be a little strange? AddExpr should not be connected with any loop.

If you still think this is a requirement, I'd be happy to do this refactor in another patch.

shchenz updated this revision to Diff 379687.Oct 14 2021, 6:02 AM

rebased based on D110587

shchenz added inline comments.Oct 14 2021, 6:06 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
190

Legacy code also does not consider the overflow when ExitCount->getType() is same with CountType, since it works well on all platforms, so I think it should be OK to treat it as not overflow when SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth().

if (!ExitCount->getType()->isPointerTy() &&
    ExitCount->getType() != CountType)
  ExitCount = SE.getZeroExtendExpr(ExitCount, CountType);

ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));
reames resigned from this revision.Oct 14 2021, 2:03 PM

Removing self as reviewer. The SCEV change has been made and looks reasonable, the rest of the review requires context for the hardware loop detection feature I don't have.

shchenz added a subscriber: reames.Oct 14 2021, 5:36 PM

Removing self as reviewer. The SCEV change has been made and looks reasonable, the rest of the review requires context for the hardware loop detection feature I don't have.

Thanks very much for your review about the SCEV part @reames

@samparker Hi Sam, do you think the change in HardwareLoop pass is reasonable especially the changes in the hardware loop test cases. Thanks in advance.

samparker added inline comments.Oct 15 2021, 3:11 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

This comment is making me uneasy... I'm don't think target specific stuff should be referenced in this generic layer. But either way, why would a count register ever have a value of -1, especially when we're expending quite some effort to test against zero in this transform?

llvm/lib/CodeGen/HardwareLoops.cpp
421

Let's now remove this comment about isLoopEntryGuardedByCond.

llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗(On Diff #379298)

These are minor regressions, but they're inline with the other previous sub-optimal cases so I think it's okay.

shchenz added inline comments.Oct 15 2021, 7:50 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

PowerPC don't have test and set form. On PowerPC, if the count register is 0 before being decremented, it will be wrapped to -1(all bits are 1) afterward, so we can handle the overflow case when ExitCount type is same with CountType. The total loop count is u64_MAX + 1

But on other targets like ARM, the hardware loops pass handles the overflow case for ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType)); when ExitCount type is same with CountType with test and set form? If this is the case, yes, the comments is not right, we need to update it for different targets.

But for all targets, making SE.getAddExpr(ExitCount, SE.getOne(CountType)); as not overflow when SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth() should be OK. Right?

llvm/lib/CodeGen/HardwareLoops.cpp
421

Thanks, will do this later.

llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗(On Diff #379298)

Thanks for confirmation.

shchenz updated this revision to Diff 380273.Oct 17 2021, 8:23 PM

address comments

samparker added inline comments.Oct 18 2021, 1:43 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

Regardless of test.set support, all the intrinsics which set the counter use the trip count. So, if this is zero, as you're suggesting, the loop should never execute and so your counter should never be decremented to -1. What am I missing here? Could you point me at a test for an example?

The test.set forms, for ARM, are just so we can try to map on to 'while' loops and we can't presume anything in this method about test.set being successfully generated.

But for all targets, making SE.getAddExpr(ExitCount, SE.getOne(CountType)); as not overflow when SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth() should be OK. Right?

Yes, this sounds fine and I think this would be the only condition needed for the overflow check.

shchenz added inline comments.Oct 18 2021, 6:06 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

Hi, let me describe the issue a little bit. Sorry for the long comment and appreciate your patience. @samparker

  1. Without any change:
if (!ExitCount->getType()->isPointerTy() &&
    ExitCount->getType() != CountType)
  ExitCount = SE.getZeroExtendExpr(ExitCount, CountType);

ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));

Suppose CountType is i64

For ExitCount type smaller than i64, it will be zero-extended to i64, so there will be no overflow issue;
For ExitCount type same with i64, it will add one directly without checking overflow. This should be functionality right since no issue was found so far. Technically speaking, SE.getAddExpr(ExitCount, SE.getOne(CountType)); could be overflow if ExitCount is u64_max before adding one. This case can work right on PowerPC, because when TripCount(after adding one and overflow) is zero, the loop execution count is u64_max + 1 according to PowerPC count register semantic. So I guess other targets like ARM should have the same hardware behavior or there is some other place to handle the overflow zero case?

  1. With the change in D91724:
TripCount = SE.getAddExpr(EC, SE.getOne(EC->getType()));

if (!EC->getType()->isPointerTy() && EC->getType() != CountType)
  TripCount = SE.getZeroExtendExpr(TripCount, CountType);

This is not right as we found regression pr51714 and the root cause was identified by Philip:
For narrower type, like i32, if ExitCount is u32_max before adding one, before the change(zero extend first and the add one), the loop count is u32_max + 1 (represented in i64), but after the change, the loop count is 0(overflow on type i32).
patch D91724 is not right.

  1. Now with this patch D109676:

For ExitCount type smaller than i64, if we can prove that ExitCount + 1 will not overflow, we can put +1 before zero-extension. Otherwise, we have to first do zero-extension and then do +1;
For ExitCount type same with i64, according to the legacy logic, we can treat it as not-overflow even though it can be overflow technically. But like the above comments for Without any change, all targets can handle the overflow with some mechanism.

For your question:

So, if this is zero, as you're suggesting, the loop should never execute and so your counter should never be decremented to -1. What am I missing here? Could you point me at a test for an example?

I think the TripCount is zero if ExitCount is u64_max before adding one, so hardware loop pass will set hardware count register to 0 as the result of overflow on type i64, and on PowerPC this is a valid loop, the loop execution count is u64_max + 1. No test case in hand, I did some experiments on PowerPC, set loop count register to 0 and after the first decrement (bdnz instruction), the loop count register is changed to u64_max(0xffffffffffffffff, -1).

I think this would be the only condition needed for the overflow check.

Do you mean we don't need the isLoopEntryGuardedByCond for the overflow check? I think isLoopEntryGuardedByCond is still needed for types smaller than i64, otherwise, all smaller types are treated as WillNotOverflow and we will not zero extend the type in getTripCountFromExitCount, we may still meet the same issue with what we met in patch D91724?

samparker added inline comments.Oct 20 2021, 4:27 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

Sigh, sorry, I forgot (again) that we haven't checked against ExitCount == UMAX . There's the static function 'mustBeFiniteCountedLoop' in PlaceSafePoints, maybe this could be moved somewhere so we could use it? Or has that kind of check already been tried? Sorry for my confusion and continued questioning.

shchenz added inline comments.Oct 20 2021, 4:46 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

Haha, don't worry about your continued questions. They are making this patch be more reasonable and clear.

Here, we wouldn't get infinite loop error at least on PowerPC target. When the initial loop count register is 0, the loop execution count is u64_max + 1.

A typical pattern for a hardware loop on PowerPC is like this:

mtctr $gpr  ; move the loop count from $gpr to loop count regsiter CTR
loop_start:  ; start of loop header 
...   ; loop body
...
bdnz loop_start  ; loop latch

The semantic of the PowerPC bdnz instructions is "Decrement CTR and branch if it is still nonzero". So if the initial value is 0, after the first decrement, the loop count register is changed to 0xffffffffffffffff. and then after decrement 0xffffffffffffffff times, it will be changed to zero and then the loop is exiting.

Do we meet infinite loop issue on ARM? I think here we only care about overflow.

samparker added inline comments.Oct 20 2021, 7:27 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

Okay, so I've tried going through the Arm spec and it's not immediately clear to me, so I have asked for some clarification. I suspect our behaviour is the same, but I will confirm once I know.

samparker added inline comments.Oct 21 2021, 12:08 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

So, unfortunately it looks like the behaviour isn't the same as PPC. We only decrement the counter, and perform the backwards branch, if the counter has a value > 1. So setting our loop counter to a value of either zero or one, will result in a single trip and no backedge taken.

I would like to start a top-level discussion so everything isn't lost within the review comments, and pull in the Arm people who still work on this... @dmgreen @samtebbs

It appears as though PPC and Arm have different semantics for their loop backedge control. To summarize @shchenz, PPC will 'Decrement CTR and branch if it is still nonzero'. On Arm, AFAICT, we don't perform the decrement when the counter (LR) is <= 1.

So, do we need to tighten the semantics of the loop intrinsics for their overflow behaviour? Arm and PPC use different intrinsics so I can't see why defining them slightly differently would anyone.

shchenz added inline comments.Oct 21 2021, 1:29 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

If so, would it be a potential issue even without this patch?

if (!ExitCount->getType()->isPointerTy() &&
    ExitCount->getType() != CountType)
  ExitCount = SE.getZeroExtendExpr(ExitCount, CountType);

ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));

If ExitCount type is same with CountType and ExitCount is UMAX before +1, after +1, it will be overflow and set to 0. For this case, does current behavior match ARM's hardware loop's expection?

samparker added inline comments.Oct 21 2021, 3:13 AM
llvm/lib/Analysis/TargetTransformInfo.cpp
187

Yes, it will cause us problems, which is why I think we probably need some better intrinsic semantics and varying logic here. With the current representation in the loop, we can't enter the loop with the counter (trip count) set to zero. This sounds, to me, rather sensible. I thought (and seemingly keep forgetting that it's not true) that this transform found 'countable' loops which would terminate, so that a trip count could have a max value of UMAX but ExitCount (BTC) could only ever reach UMAX - 1.

Everything tends to be i32's in Arm, so we only rarely see extends in the induction variables in the Hardware loop pass. Even if the IV was previously in i8/i16 it will likely already have been extended to a i32 by indvarsimplify.

I would like to start a top-level discussion so everything isn't lost within the review comments, and pull in the Arm people who still work on this... @dmgreen @samtebbs

It appears as though PPC and Arm have different semantics for their loop backedge control. To summarize @shchenz, PPC will 'Decrement CTR and branch if it is still nonzero'. On Arm, AFAICT, we don't perform the decrement when the counter (LR) is <= 1.

So, do we need to tighten the semantics of the loop intrinsics for their overflow behaviour? Arm and PPC use different intrinsics so I can't see why defining them slightly differently would anyone.

It's true we don't need to add 1 to the loop counter (if I'm understanding this correctly). When running our downstream benchmarks the only changes I saw from this patch is instructions that were in the preheader are now in the pre-preheader (the guard block). That makes things better or worse by ~1%, depending on the benchmark.

llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗(On Diff #379298)

It's only small, but it does mean we execute more instructions if the branch is taken..

shchenz marked an inline comment as done.Oct 25 2021, 6:49 AM

Everything tends to be i32's in Arm, so we only rarely see extends in the induction variables in the Hardware loop pass. Even if the IV was previously in i8/i16 it will likely already have been extended to a i32 by indvarsimplify.

Thanks, that is the possible reason why we don't see any regressions after patch D91724 on ARM.

It's true we don't need to add 1 to the loop counter (if I'm understanding this correctly). When running our downstream benchmarks the only changes I saw from this patch is instructions that were in the preheader are now in the pre-preheader (the guard block). That makes things better or worse by ~1%, depending on the benchmark.

Thanks for your test on ARM. The performance impact should be small as it only affects the loop count expansion outside of the loop.

So for this patch, we need to wait for ARM experts find out the solution for overflow issue when ExitCount type is same with CountType?

llvm/test/Transforms/HardwareLoops/scalar-while.ll
311 ↗(On Diff #379298)

Putting the loop count in pre-preheader is required by UseLoopGuard = true? If so, this should be positive, because now we use more loop guards?

shchenz marked an inline comment as done.Dec 2 2021, 8:01 PM

So for this patch, we need to wait for ARM experts find out the solution for the overflow issue when ExitCount type is same with CountType?

Not sure if there is any solution for the above issue? @samparker @dmgreen .

This patch is required by the PowerPC as it can generate better instructions shown in the test changes. And this should not introduce any regression on ARM, because the codes before this patch also have the potential issue. See https://reviews.llvm.org/D109676#3077240

So, is it possible to get this in first? Thanks in advance.

gentle ping

As I said previously, the behaviour of the intrinsics should better defined and then this transform should adhere to those semantics. We're already using different pairs of intrinsics for each backend, so this shouldn't be a problem; however, we may have to ensure the Arm backend rejects the 'old' set.loop.iterations just to be safe.