This is an archive of the discontinued LLVM Phabricator instance.

[Delinerization] Keep array element byte offset.
Needs ReviewPublic

Authored by Meinersbur on Aug 28 2021, 9:34 PM.

Details

Summary

The offset into the array element cannot just be discarded. In most cases it will be zero, but like any of the detected subscript any memory access must be checked whether it is contained in the byte range of the memory access (or aliases with neighboring array elements). This is because delinearization is just a heuristic, there is not guarantee that the byte offset is a constant, is non-negative, is less than the array element size, or the memory access is entirely contained within the array element (delinerization does not even know the access size).

Fix by removing the special handling of the least significant dimension in ScalarEvolution::computeAccessFunctions. The makes the returned Subscripts array one larger than the Sizes array. This actually would be expected, a subscript each size plus one subscript for the division remainder representing the outermost dimension of unknown size.

This bug caused Polly to miscompile blender (526.blender_r from SPEC CPU 2017) in -polly-process-unprofitable mode. The SCEV expression incorrectly delinearized has been reduced in the test case byte_offset.ll. The dropped offset into the array element of size 4 (a float) is ((sext i32 %mul7.i4534 to i64) + {(sext i32 %i1 to i64),+,((sext i32 (1 + ((1 + %shl.i.i) * (1 + %shl.i.i)) + %shl.i.i) to i64) * (sext i32 %i1 to i64))}<%for.body703>). This significant component was just dropped, and the wrong pointer was computed when regenerating code from the remaining delinearized subscripts.

This occurred during blender's subsurface scattering implementation. As a result, blender's rendering diverged from the reference image.

BugReference

Diff Detail

Event Timeline

Meinersbur created this revision.Aug 28 2021, 9:34 PM
Meinersbur requested review of this revision.Aug 28 2021, 9:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 28 2021, 9:34 PM

[Polly] Also require offset to be zero.

The makes the returned Subscripts array one larger than the Sizes array. This actually would be expected, a subscript each size plus one subscript for the division remainder representing the outermost dimension of unknown size.

The Size array includes the element size, so it already makes up for the lack of a value for the outermost dimension.

Instead of treating the element offset as a subscript, could we make computeAccessFunctions return the offset (if non-zero) in a separate output parameter?
Or maybe we should consider delinearization unsuccessful when there is a non-zero remainder after the last real subscript is already computed.

llvm/include/llvm/Analysis/ScalarEvolution.h
1201

address of a memory access -> address of the above memory access

llvm/lib/Analysis/DependenceAnalysis.cpp
3454

I'm worried that having an extra subscript could confuse the analysis. Each separable subscript typically corresponds to a distinct loop level (unless the subscript is ZIV), but if we include the element offset as a "subscript" it would normally not have any corresponding loop levels.

llvm/test/Analysis/Delinearization/byte_offset.ll
10

could we have the psudo-c for this IR added here as a comment?

Meinersbur marked 2 inline comments as done.
  • Address review

The Size array includes the element size, so it already makes up for the lack of a value for the outermost dimension.

That's the issue. It is contained in the Size array, but its subscript (which ideally would be between 0 <= subscript < EltSize), unlike for any other size, is dropped.

Instead of treating the element offset as a subscript, could we make computeAccessFunctions return the offset (if non-zero) in a separate output parameter?

This would make handling of the element offset more complicated and error prone, while there is no inherent difference between a array dimension of constant size and the byte offset. As illustrated by this bug and the fact that DependenceAnalysis's subscript checking also just works for the array element offset.

Or maybe we should consider delinearization unsuccessful when there is a non-zero remainder after the last real subscript is already computed.

Would be possibility, and was already done by computeAccessFunctions if the outermost operation was a SCEVAddRecExpr justified by being "too complicated". How can it be "too complicated" if it is just forgotten about?
It would also make make it impossible do use delinearization for array-of-structs. I would leave that decision what to support to the caller. At least for Polly I was considering supporting array-of-structs.

llvm/lib/Analysis/DependenceAnalysis.cpp
3454

Changed this to pop the last element before continuing unless isZero() is false and then bail out, like already done for Polly and LoopCacheAnalysis. We can check later how DA had to be changed to support this. At least the DelinearizationChecks below would just work and I would assume that DA works find as long as the accessed memory falls completely into the elements byte change, which is verified by the Delinearization check.

llvm/test/Analysis/Delinearization/byte_offset.ll
10

Not sure how useful you find these manually-decompiled IR that come from llvm-reduce, but here it is.

You can also find the origin here: https://github.com/blender/blender/blob/765b842f9520843183bf0a3cdcd071f152bbbf9e/source/blender/blenkernel/intern/CCGSubSurf.c#L2131

Instead of treating the element offset as a subscript, could we make computeAccessFunctions return the offset (if non-zero) in a separate output parameter?

This would make handling of the element offset more complicated and error prone, while there is no inherent difference between a array dimension of constant size and the byte offset. As illustrated by this bug and the fact that DependenceAnalysis's subscript checking also just works for the array element offset.

The way "element offset" is being handled in this patch is to isolate it (from the rest of the subscripts) and ignore it everywhere except for ScopDetection.cpp where it is still isolated but used to decide if the access is affine. If it needs to be isolated to be handled, then I don't see how returning it as a separate parameter makes it more complicated. The original goal of the delinearization algorithm is to recover source level subscripts. While the "element offset" can be thought of as a byte index into an imaginary inner-most dimension, it is not something that corresponds to a source level array subscript.

Or maybe we should consider delinearization unsuccessful when there is a non-zero remainder after the last real subscript is already computed.

Would be possibility, and was already done by computeAccessFunctions if the outermost operation was a SCEVAddRecExpr justified by being "too complicated". How can it be "too complicated" if it is just forgotten about?
It would also make make it impossible do use delinearization for array-of-structs. I would leave that decision what to support to the caller. At least for Polly I was considering supporting array-of-structs.

Yeah, I'm not sure why it was considered "too complicated" only when the remainder was an AddRec. The paper says that the original polynomial expression (representing the linearized access function) is first divided by the element size, but they don't say what it means if there is a remainder and what to do with it. I would assume that having an access function that doesn't evenly divide the element size cannot be safely delinearized.

I tried a simple example and it seems that the "element offset" is zero regardless of which member of a structure is being accessed, so not sure if it has any bearing on the ability to delinearize arrays of structs:

> cat delin_struct.c
struct MyS {
  float a, b;
};

void foo(int n, int m, struct MyS f1[][n][m]) {
  for (int i = 0; i < 1024; i++)
    for (int j = 0; j < n; j++)
      for (int k = 0; k < m; k++)
        f1[i][j][k].a = f1[i][j][k].b;
}

opt -passes='print<delinearization>' -disable-output delin_struct.simp.ll 2>&1 | grep ArrayRef
ArrayRef[{0,+,2}<%for.body>][{0,+,2}<%for.body4>][{1,+,2}<%for.body8>][0]
ArrayRef[{0,+,2}<%for.body>][{2,+,2}<%for.body4>][-1][0]
ArrayRef[0][{(-1 + (2 * (zext i32 %n to i64) * (zext i32 %m to i64))),+,(2 * (zext i32 %n to i64) * (zext i32 %m to i64))}<%for.body>][0]
ArrayRef[{0,+,2}<%for.body>][{0,+,2}<%for.body4>][{0,+,2}<%for.body8>][0]
ArrayRef[{0,+,2}<%for.body>][{2,+,2}<%for.body4>][-2][0]
ArrayRef[0][{(-2 + (2 * (zext i32 %n to i64) * (zext i32 %m to i64))),+,(2 * (zext i32 %n to i64) * (zext i32 %m to i64))}<%for.body>][0]

Note that the subscript in the last dimension (assumed byte array) is 0! IR attached:

llvm/lib/Analysis/DependenceAnalysis.cpp
3453

[nit] the by offset -> the byte offset

llvm/test/Analysis/Delinearization/byte_offset.ll
10

I haven't looked at the original example yet, but this looks like a case where delinearization should be expected to fail. The recovered subscripts above don't look anything like what the source level subscripts are.

polly/lib/Analysis/ScopDetection.cpp
1029

[nit] ElementSize may be a better name than ArrSize?

If it needs to be isolated to be handled, then I don't see how returning it as a separate parameter makes it more complicated.

It only is handled differently because the legacy code does not expect an additional parameter. However, fixing the bug a higher priority and the individual passes could be improved making proper uses of the returned subscript at a later point.

The original goal of the delinearization algorithm is to recover source level subscripts.

[citation needed]

If this was true than any delinearization that does not correspond to the original source subscript needed to be considered wrong. However, I am pretty sure we also want to delinearize A[i +j*n] even though the source code subscript has been linearized.

I tried a simple example and it seems that the "element offset" is zero regardless of which member of a structure is being accessed, so not sure if it has any bearing on the ability to delinearize arrays of structs:

It is delinearized as array element sizeof(float), not sizeof(MyS) with one dimension twice as large and SCEVAddRecExpr starting with 1 instead of 0. This has to do with how ScalarEvolution tries to remove the struct index from the GEP, modeling it as a simple addition.

Use different elements of different sizes in the struct:

struct __attribute__((packed)) Pair { int x; long y; };
void foo(unsigned long N, struct Pair A[][N]) {
  for (unsigned long i = 0; i < N; i+=1)
      A[i][i].y = 0;
}
opt -delinearize
AccessFunction: {4,+,(12 + (12 * %N))}<%for.cond>
Base offset: %A
ArrayDecl[UnknownSize][%N][8]
ArrayRef[0][{0,+,1}<%for.cond>][{4,+,(4 + (12 * %N))}<%for.cond>]

This is obviously more complicated than it needs to be. Specifically, it still considers element size to be sizeof(long), because that's the access it sees. If accessing x instead of y the element size would be 4. Polly tries to combine the shapes from multiple delinearization results into a common one such that subscripts are comparable.

Yes, with the current implementation of delinearization, I don't think there are a lot of cases where the byte offset subscript would improve the dependence analysis. Still, I think the right API that allows improving the delinerization includes the byte offset.

Having derived this example, it is much better understandable than the regression test I got from llvm-reduce, I could replace it.

If it needs to be isolated to be handled, then I don't see how returning it as a separate parameter makes it more complicated.

It only is handled differently because the legacy code does not expect an additional parameter. However, fixing the bug a higher priority and the individual passes could be improved making proper uses of the returned subscript at a later point.

The original goal of the delinearization algorithm is to recover source level subscripts.

[citation needed]

If this was true than any delinearization that does not correspond to the original source subscript needed to be considered wrong. However, I am pretty sure we also want to delinearize A[i +j*n] even though the source code subscript has been linearized.

I tried a simple example and it seems that the "element offset" is zero regardless of which member of a structure is being accessed, so not sure if it has any bearing on the ability to delinearize arrays of structs:

It is delinearized as array element sizeof(float), not sizeof(MyS) with one dimension twice as large and SCEVAddRecExpr starting with 1 instead of 0. This has to do with how ScalarEvolution tries to remove the struct index from the GEP, modeling it as a simple addition.

Use different elements of different sizes in the struct:

struct __attribute__((packed)) Pair { int x; long y; };
void foo(unsigned long N, struct Pair A[][N]) {
  for (unsigned long i = 0; i < N; i+=1)
      A[i][i].y = 0;
}
opt -delinearize
AccessFunction: {4,+,(12 + (12 * %N))}<%for.cond>
Base offset: %A
ArrayDecl[UnknownSize][%N][8]
ArrayRef[0][{0,+,1}<%for.cond>][{4,+,(4 + (12 * %N))}<%for.cond>]

This is obviously more complicated than it needs to be. Specifically, it still considers element size to be sizeof(long), because that's the access it sees. If accessing x instead of y the element size would be 4. Polly tries to combine the shapes from multiple delinearization results into a common one such that subscripts are comparable.

Yes, with the current implementation of delinearization, I don't think there are a lot of cases where the byte offset subscript would improve the dependence analysis. Still, I think the right API that allows improving the delinerization includes the byte offset.

Having derived this example, it is much better understandable than the regression test I got from llvm-reduce, I could replace it.

Thanks for the example. It does make it much easier to understand the problem. I also have concerns about the current implementation not being able to handle arrays of structures. To understand this better, I derived an example based on your example above:

struct __attribute__((packed)) MyS {
  float a;
  double b;
};

void foo(long long n, long long m, struct MyS f1[][n][m]) {
  for (int i = 0; i < 1024; i++)
    for (int j = 0; j < n; j++)
      for (int k = 0; k < m; k++)
        f1[i][j][k].b = f1[i-1][j][k].b;
}

Ideally we would want delinearization to recover the subscripts for the load to be [i-1][j][k] and the subscripts for the store to be [i][j][k], so that the dependence analysis can produce [-1, 0, 0]. With the changes in this patch we get the following delinearization for the load and stores respectively:

Inst:  %4 = load double, double* %b, align 1, !tbaa !3
In Loop with Header: for.body11
AccessFunction: {{{(4 + (-12 * %n * %m)),+,(12 * %n * %m)}<%for.body>,+,(12 * %m)}<%for.body5>,+,12}<%for.body11>
Base offset: %f1
ArrayDecl[UnknownSize][%n][%m][8]
ArrayRef[0][0][{0,+,1}<nuw><nsw><%for.body11>][{{{(4 + (-12 * %n * %m)),+,(12 * %n * %m)}<%for.body>,+,(12 * %m)}<%for.body5>,+,4}<%for.body11>]
Inst:  store double %4, double* %b22, align 1, !tbaa !3
In Loop with Header: for.body11
AccessFunction: {{{4,+,(12 * %n * %m)}<%for.body>,+,(12 * %m)}<%for.body5>,+,12}<%for.body11>
Base offset: %f1
ArrayDecl[UnknownSize][%n][%m][8]
ArrayRef[0][0][{0,+,1}<nuw><nsw><%for.body11>][{{{4,+,(12 * %n * %m)}<%for.body>,+,(12 * %m)}<%for.body5>,+,4}<%for.body11>]

Even though delinearization claims to be successful most of the complexity of the original access function has now moved to the inner most dimension, leaving the outer subscripts as zeros. This won't help dependence analysis in any way, as it now has to further analyze the complicated subscript for the synthesized inner most dimension. It's not clear to me how that can be done.

Note that if the element size passed to the delinearize function was chosen to be 12 (the true element size of the array), then delinearization would have been able to recover more meaningful subscripts for the outer dimensions without requiring a "byte offset". I wonder if we can improve the results for structure of arrays by choosing a better element size.

Note that if the element size passed to the delinearize function was chosen to be 12 (the true element size of the array), then delinearization would have been able to recover more meaningful subscripts for the outer dimensions without requiring a "byte offset". I wonder if we can improve the results for structure of arrays by choosing a better element size.

Correction: the access functions in my example do not divide 12, so choosing the "true element size" doesn't fix it :(

...but since delinearization is a heuristic, maybe we can heuristically pick something that does divide it evenly (eg a GCD or just choosing 1). If we choose 1, DA is able to produce expected result (with -da-disable-delinearization-checks):

ArrayDecl[UnknownSize][%n][%m][1]
ArrayRef[{-12,+,12}<%for.body>][{0,+,12}<%for.body5>][{4,+,12}<%for.body11>][0]
...
ArrayRef[{0,+,12}<%for.body>][{0,+,12}<%for.body5>][{4,+,12}<%for.body11>][0]
Src:  %4 = load double, double* %b, align 1, !tbaa !3 --> Dst:  store double %4, double* %b22, align 1, !tbaa !3
  da analyze - consistent anti [-1 0 0]!
Meinersbur added a comment.EditedSep 9 2021, 10:53 AM

> I also have concerns about the current implementation not being able to handle arrays of structures.

D109527

maybe we can heuristically pick something that does divide it evenly (eg a GCD or just choosing 1). If we choose 1, DA is able to produce expected result (with -da-disable-delinearization-checks)

A dimension of size 1 can be omitted since its only valid with a subscript of 0 and multiplies by 1. What makes delinearization useful is that allowing to assume that memory accesses only alias if all subscripts are equal (and the arrays themselves dot overlap). For this to work the subscripts must be between 0 and the dimension size. -da-disable-delinearization-checks skips this check and therefore may cause miscompilation.

> I also have concerns about the current implementation not being able to handle arrays of structures.

D109527

Thanks for that. Looks like using the "true element size" can significantly simplify the remainder expression for the added byte dimension. We may be converging on a solution here.

maybe we can heuristically pick something that does divide it evenly (eg a GCD or just choosing 1). If we choose 1, DA is able to produce expected result (with -da-disable-delinearization-checks)

A dimension of size 1 can be omitted since its only valid with a subscript of 0 and multiplies by 1. What makes delinearization useful is that allowing to assume that memory accesses only alias if all subscripts are equal (and the arrays themselves dot overlap)[*]. For this to work the subscripts must be between 0 and the dimension size. -da-disable-delinearization-checks skips this check and therefore may cause miscompilation.

Rather than viewing it as an extra dimension I simply view it as the size of the structure member. From the point of view of dependence analysis (after aliasing properties are computed) the dependence between A[i][j][k].b and A[i-1][j][k].b should yield the same results regardless of the size of b (be it 1-byte or otherwise). I used the -da-disable-delinearization-checks to illustrate the idea, but it's possible that using this scheme requires different validity checks.

bmahjour added inline comments.Sep 9 2021, 3:45 PM
llvm/lib/Analysis/DependenceAnalysis.cpp
3458–3467

I think we should care more about the difference between the byte offset of the source and destination, than its actual value. If the byte offsets are equal, then the rest of the subscripts should be analyzed. If they are different, then we don't know how to handle them yet.

suggestion: replace it with the following and move it after checking that the subscript arrays are equal in size and contain at least 2 elements each (line 3457 of the base).

const SCEV *EltOffsetSrc = SrcSubscripts.pop_back_val();
const SCEV *EltOffsetDst = DstSubscripts.pop_back_val();
if (EltOffsetSrc != EltOffsetDst)
  return false;