This is an archive of the discontinued LLVM Phabricator instance.

[ScalarEvolution] Infer loop max trip count from memory accesses
Needs ReviewPublic

Authored by Peakulorain on Jul 12 2023, 1:15 AM.

Details

Summary

Data references in a loop should not access elements over the
statically allocated size. So we can infer a loop max trip count
from this undefined behavior.

Diff Detail

Event Timeline

Peakulorain created this revision.Jul 12 2023, 1:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2023, 1:15 AM
Peakulorain requested review of this revision.Jul 12 2023, 1:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2023, 1:15 AM

Hi, All,
This patch is a fixed commit.
Related to:
https://reviews.llvm.org/D109821
https://reviews.llvm.org/D113554

In order to infer this trip count, I divide the process into the following steps:

  1. Collect the load/store instructions that are executed on each iteration of the loop;
  2. Filter out Reads/Write that may overlap;
  3. Calculate the possible wrap value for each Ptr index and record the smaller one;
  4. Infer the maximum number of executables from the total memory size/step value;
  5. Compare the calculated value with the smaller wrap value in the index to ensure that the step is strictly increasing.

Happy to see it!

I am mainly interested in cases when you have a small array and with this new analyzed size, you can do unrolling.

nikic requested changes to this revision.Jul 12 2023, 1:56 AM

Before we start looking at the implementation, please:

  • Add an option that uses the new logic to refine the standard BE count results.
  • Move tests to llvm/test/Analysis/ScalarEvolution using that option.
This revision now requires changes to proceed.Jul 12 2023, 1:56 AM

Before we start looking at the implementation, please:

  • Add an option that uses the new logic to refine the standard BE count results.
  • Move tests to llvm/test/Analysis/ScalarEvolution using that option.

Thanks for the guide, updated as requested.

nikic added a comment.Jul 18 2023, 2:49 AM

Could you please explain at a high level why we need all this custom wrapping logic? Why are the no-wrap flags on the AddRec insufficient?

Could you please explain at a high level why we need all this custom wrapping logic? Why are the no-wrap flags on the AddRec insufficient?

The purpose of using the logic I implemented myself is to focus on calculating how many iterations the index of GEP will wrap. With this value, we can know if the loop will fall into an infinite loop.

for.body:
  %iv = phi i8 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i8 %iv to i64
  %arrayidx = getelementptr inbounds [500 x i32], [500 x i32]* %a, i64 0, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add i8 %iv, 1
  %inc_zext = zext i8 %inc to i32
  %cmp = icmp ult i32 %inc_zext, %len
  br i1 %cmp, label %for.body, label %loopexit

If the value of %len (which comes from argument) is greater than the maximum value that i8 can represent, loop falls into infinite, but the store access is wandering in a fixed area without UB. This is done to ensure that in this case the inference is correct.

nikic added a comment.Jul 18 2023, 7:57 AM

Could you please explain at a high level why we need all this custom wrapping logic? Why are the no-wrap flags on the AddRec insufficient?

The purpose of using the logic I implemented myself is to focus on calculating how many iterations the index of GEP will wrap. With this value, we can know if the loop will fall into an infinite loop.

for.body:
  %iv = phi i8 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i8 %iv to i64
  %arrayidx = getelementptr inbounds [500 x i32], [500 x i32]* %a, i64 0, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add i8 %iv, 1
  %inc_zext = zext i8 %inc to i32
  %cmp = icmp ult i32 %inc_zext, %len
  br i1 %cmp, label %for.body, label %loopexit

If the value of %len (which comes from argument) is greater than the maximum value that i8 can represent, loop falls into infinite, but the store access is wandering in a fixed area without UB. This is done to ensure that in this case the inference is correct.

In such a case the pointer will be something like ((4 * (zext i8 {0,+,1}<%for.body> to i64))<nuw><nsw> + %a)<nuw> rather than the {%a,+,4}<nuw><%for.body> it would be in the non-wrapping case. Why is the restriction to addrec pointers not sufficient for this case?

Could you please explain at a high level why we need all this custom wrapping logic? Why are the no-wrap flags on the AddRec insufficient?

The purpose of using the logic I implemented myself is to focus on calculating how many iterations the index of GEP will wrap. With this value, we can know if the loop will fall into an infinite loop.

for.body:
  %iv = phi i8 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i8 %iv to i64
  %arrayidx = getelementptr inbounds [500 x i32], [500 x i32]* %a, i64 0, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add i8 %iv, 1
  %inc_zext = zext i8 %inc to i32
  %cmp = icmp ult i32 %inc_zext, %len
  br i1 %cmp, label %for.body, label %loopexit

If the value of %len (which comes from argument) is greater than the maximum value that i8 can represent, loop falls into infinite, but the store access is wandering in a fixed area without UB. This is done to ensure that in this case the inference is correct.

In such a case the pointer will be something like ((4 * (zext i8 {0,+,1}<%for.body> to i64))<nuw><nsw> + %a)<nuw> rather than the {%a,+,4}<nuw><%for.body> it would be in the non-wrapping case. Why is the restriction to addrec pointers not sufficient for this case?

Thanks for your help, the above case is indeed filtered out by constraints. But please see :

define void @test(i32 signext %len) {...
for.body:
  %iv = phi i8 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i8 %iv to i64
  %arrayidx = getelementptr inbounds [500 x i32], [500 x i32]* %a, i64 0, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nuw nsw i8 %iv, 1
  %inc_zext = zext i8 %inc to i32
  %cmp = icmp slt i32 %inc_zext, %len
  br i1 %cmp, label %for.body, label %loopexit
  ...
}

this case would get {%a,+,4}<nuw><%for.body>. In such a situation, I think it is necessary to calculate how many iterations to wrap. :)

Thanks for your help, the above case is indeed filtered out by constraints. But please see :

define void @test(i32 signext %len) {...
for.body:
  %iv = phi i8 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i8 %iv to i64
  %arrayidx = getelementptr inbounds [500 x i32], [500 x i32]* %a, i64 0, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nuw nsw i8 %iv, 1
  %inc_zext = zext i8 %inc to i32
  %cmp = icmp slt i32 %inc_zext, %len
  br i1 %cmp, label %for.body, label %loopexit
  ...
}

this case would get {%a,+,4}<nuw><%for.body>. In such a situation, I think it is necessary to calculate how many iterations to wrap. :)

Doesn't the add nuw exclude wrapping in this case though? This is why SCEV concludes it's okay to look through the zext.

Thanks for your help, the above case is indeed filtered out by constraints. But please see :

define void @test(i32 signext %len) {...
for.body:
  %iv = phi i8 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i8 %iv to i64
  %arrayidx = getelementptr inbounds [500 x i32], [500 x i32]* %a, i64 0, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nuw nsw i8 %iv, 1
  %inc_zext = zext i8 %inc to i32
  %cmp = icmp slt i32 %inc_zext, %len
  br i1 %cmp, label %for.body, label %loopexit
  ...
}

this case would get {%a,+,4}<nuw><%for.body>. In such a situation, I think it is necessary to calculate how many iterations to wrap. :)

Doesn't the add nuw exclude wrapping in this case though? This is why SCEV concludes it's okay to look through the zext.

I know that nuw flag has excluded wrapping. Even so, on this basis, the BE we get by (MemSize / StepSize + 1) is 501, I'm concerned that this inferred value is not available, so I did a comparison with i8 wrap value. If the inferred value is within the loop iterator wrap value, then we consider it available.

nikic added a comment.Jul 20 2023, 1:15 AM

Thanks for your help, the above case is indeed filtered out by constraints. But please see :

define void @test(i32 signext %len) {...
for.body:
  %iv = phi i8 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i8 %iv to i64
  %arrayidx = getelementptr inbounds [500 x i32], [500 x i32]* %a, i64 0, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nuw nsw i8 %iv, 1
  %inc_zext = zext i8 %inc to i32
  %cmp = icmp slt i32 %inc_zext, %len
  br i1 %cmp, label %for.body, label %loopexit
  ...
}

this case would get {%a,+,4}<nuw><%for.body>. In such a situation, I think it is necessary to calculate how many iterations to wrap. :)

Doesn't the add nuw exclude wrapping in this case though? This is why SCEV concludes it's okay to look through the zext.

I know that nuw flag has excluded wrapping. Even so, on this basis, the BE we get by (MemSize / StepSize + 1) is 501, I'm concerned that this inferred value is not available, so I did a comparison with i8 wrap value. If the inferred value is within the loop iterator wrap value, then we consider it available.

In this example, isn't it okay if the max trip count is reported as 501? We actually know that the max trip count must be 256 due to the nuw on the add, even if SCEV fails to realize it. As such, reporting 501 as a conservative over-estimate should be fine. Am I missing something?

Thanks for your help, the above case is indeed filtered out by constraints. But please see :

define void @test(i32 signext %len) {...
for.body:
  %iv = phi i8 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i8 %iv to i64
  %arrayidx = getelementptr inbounds [500 x i32], [500 x i32]* %a, i64 0, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nuw nsw i8 %iv, 1
  %inc_zext = zext i8 %inc to i32
  %cmp = icmp slt i32 %inc_zext, %len
  br i1 %cmp, label %for.body, label %loopexit
  ...
}

this case would get {%a,+,4}<nuw><%for.body>. In such a situation, I think it is necessary to calculate how many iterations to wrap. :)

Doesn't the add nuw exclude wrapping in this case though? This is why SCEV concludes it's okay to look through the zext.

I know that nuw flag has excluded wrapping. Even so, on this basis, the BE we get by (MemSize / StepSize + 1) is 501, I'm concerned that this inferred value is not available, so I did a comparison with i8 wrap value. If the inferred value is within the loop iterator wrap value, then we consider it available.

In this example, isn't it okay if the max trip count is reported as 501? We actually know that the max trip count must be 256 due to the nuw on the add, even if SCEV fails to realize it. As such, reporting 501 as a conservative over-estimate should be fine. Am I missing something?

Okia, looks like I'm worrying too much. Then the new implementation is divided into the following steps:

  1. Collect the load/store instructions that are executed on each iteration of the loop;
  2. Filter out Reads/Write that may overlap;
  3. Infer the maximum number of executables from the total memory size/step value;

Please see if there are any other issues for this patch. :)

Peakulorain added a comment.EditedJul 25 2023, 11:55 PM

Hi, nikic. Sorry for bothering. I would like to ask if you have any other doubts about this patch?

Back to prior diff. @nikic

Back to prior diff.

Why? The new diff looked better, though it probably should have NoSelfWrap -> NoUnsignedWrap.

Though even then it seems like this patch has more code/checks for which I don't understand the purpose. For example, why do we care whether the memory accesses may overlap or not?

I do not follow parts of this. The overlapping stuff is odd. We should also filter access earlier. And finally, lot's of the checks seem weird.
What I was expecting is roughly:

Loop count C is unknown.
Size of the object is T.
Access size is A.
Access function is F, we can restrict it to <B, +, S>loop with S > 0.

What we know is that
(T - B) >= 0
or the loop trip count is 0.
And
T >= C * (S * A) + B
which is
(T - B) >= C * S * A
which gives us, under some checks
(T - B) / (S * A) >= C
The left hand side should be constant and thereby bound C.

What am I missing?

llvm/lib/Analysis/ScalarEvolution.cpp
255

Do we have compile time numbers with this set to true?

8184–8195
8252–8255

I also don't understand why we need to check overlapping.

8263–8266

I would have expected this code much earlier. The expectation is that most pointer bases are not bounded, so why do we bother collecting the instructions, checking the access expression, etc. We should look at the base first before we ever consider using the access for reasoning.

8272–8281

I doubt this is necessary given the other restrictions. You already verified the access happens every iteration and it will happen whenever we enter the loop header. This is only necessary if we relax the conditions and allow early exists. Thus, add 1 only if the existing block is not unique or not the latch.

8291

I don't understand this.

8321

If we got a proper value, no need to use the secondary reasoning, right?

fmayer added a subscriber: fmayer.Sep 27 2023, 12:21 PM

How this is going to affect sanitizers? We still want them being able to detect overflows.

How this is going to affect sanitizers? We still want them being able to detect overflows.

more context: sanitizers use SCEV to decide to not instrument accesses where SCEV tells us that they are in range. With features like this that exploit UB in SCEV, we can no longer rely on this, because the whole point of the sanitizer is to catch UB.

How this is going to affect sanitizers? We still want them being able to detect overflows.

more context: sanitizers use SCEV to decide to not instrument accesses where SCEV tells us that they are in range. With features like this that exploit UB in SCEV, we can no longer rely on this, because the whole point of the sanitizer is to catch UB.

SCEV (and other helpers) already refine UB, don't they? That said, you can disable the feature, right now via a command line flag. That said, we probably want a "catch all" try not to exploit UB flag.

How this is going to affect sanitizers? We still want them being able to detect overflows.

more context: sanitizers use SCEV to decide to not instrument accesses where SCEV tells us that they are in range. With features like this that exploit UB in SCEV, we can no longer rely on this, because the whole point of the sanitizer is to catch UB.

SCEV (and other helpers) already refine UB, don't they? That said, you can disable the feature, right now via a command line flag. That said, we probably want a "catch all" try not to exploit UB flag.

Thanks! Yes such a catch all would be great. Just confirming I understand the CL correctly (I didn't actually read all of the code). If I take the loop from the discourse

int square(int num) {
    int A[3];
    for (int i = 0; i < num; ++i)
      A[i] = i * num;
    return A[1] + A[2];
}

SCEV would tell me that 0 <= i < 3 is true?

How this is going to affect sanitizers? We still want them being able to detect overflows.

more context: sanitizers use SCEV to decide to not instrument accesses where SCEV tells us that they are in range. With features like this that exploit UB in SCEV, we can no longer rely on this, because the whole point of the sanitizer is to catch UB.

SCEV (and other helpers) already refine UB, don't they? That said, you can disable the feature, right now via a command line flag. That said, we probably want a "catch all" try not to exploit UB flag.

Thanks! Yes such a catch all would be great. Just confirming I understand the CL correctly (I didn't actually read all of the code). If I take the loop from the discourse

int square(int num) {
    int A[3];
    for (int i = 0; i < num; ++i)
      A[i] = i * num;
    return A[1] + A[2];
}

SCEV would tell me that 0 <= i < 3 is true?

And also num < 3?

SCEV (and other helpers) already refine UB, don't they? That said, you can disable the feature, right now via a command line flag. That said, we probably want a "catch all" try not to exploit UB flag.

We should check function attr.

llvm/lib/Analysis/ScalarEvolution.cpp
8138
&& !(F.hasFnAttribute(Attribute::SanitizeAddress) ||
         F.hasFnAttribute(Attribute::SanitizeThread) ||
         F.hasFnAttribute(Attribute::SanitizeMemory) ||
         F.hasFnAttribute(Attribute::SanitizeHWAddress) ||
         F.hasFnAttribute(Attribute::SanitizeMemTag))
8138

or maybe near flag check

How this is going to affect sanitizers? We still want them being able to detect overflows.

more context: sanitizers use SCEV to decide to not instrument accesses where SCEV tells us that they are in range. With features like this that exploit UB in SCEV, we can no longer rely on this, because the whole point of the sanitizer is to catch UB.

SCEV (and other helpers) already refine UB, don't they? That said, you can disable the feature, right now via a command line flag. That said, we probably want a "catch all" try not to exploit UB flag.

Thanks! Yes such a catch all would be great. Just confirming I understand the CL correctly (I didn't actually read all of the code). If I take the loop from the discourse

int square(int num) {
    int A[3];
    for (int i = 0; i < num; ++i)
      A[i] = i * num;
    return A[1] + A[2];
}

SCEV would tell me that 0 <= i < 3 is true?

And also num < 3?

Yes, or at least the loop count is bound by 3. Which should be enough for the unroller to get going.

We can check for attributes, is there a helper that encapsulates all the specific attribute names?

How this is going to affect sanitizers? We still want them being able to detect overflows.

more context: sanitizers use SCEV to decide to not instrument accesses where SCEV tells us that they are in range. With features like this that exploit UB in SCEV, we can no longer rely on this, because the whole point of the sanitizer is to catch UB.

SCEV (and other helpers) already refine UB, don't they? That said, you can disable the feature, right now via a command line flag. That said, we probably want a "catch all" try not to exploit UB flag.

Thanks! Yes such a catch all would be great. Just confirming I understand the CL correctly (I didn't actually read all of the code). If I take the loop from the discourse

int square(int num) {
    int A[3];
    for (int i = 0; i < num; ++i)
      A[i] = i * num;
    return A[1] + A[2];
}

SCEV would tell me that 0 <= i < 3 is true?

And also num < 3?

Yes, or at least the loop count is bound by 3. Which should be enough for the unroller to get going.

We can check for attributes, is there a helper that encapsulates all the specific attribute names?

I'm asking because I think we don't care about the loop count for sanitizer purposes. But if SCEV told us that num < 3 is always true, we might fail to insert some checks

@Peakulorain Are you still interested in pushing this forward? I would otherwise take over.

Hi @Peakulorain, if it is okay with you, @jdoerfert and I will push it forward.