This is an archive of the discontinued LLVM Phabricator instance.

[LAA] Relax pointer dependency with runtime pointer checks
AbandonedPublic

Authored by dtemirbulatov on May 27 2022, 5:43 AM.

Details

Summary

Following patch allows us to emit extra run-time pointer checks and gives us extra opportunity to vectorize if one pointer is an invariant in a loop and another one can be deduced to an affine expression. With this change applied we could now vectorize following example:

struct Arrays {

double X[128];
double Y[128];
double Z[128];
double D;

} s1;
int n;
...

for ( int k=0 ; k<n ; k++ )
 s1.X[k] = s1.D+ s1.Y[k] * s1.Z[k];

...

Diff Detail

Event Timeline

dtemirbulatov created this revision.May 27 2022, 5:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 5:43 AM
dtemirbulatov requested review of this revision.May 27 2022, 5:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 5:43 AM
dtemirbulatov edited reviewers, added: bsmith; removed: brad.May 30 2022, 2:32 AM
dtemirbulatov edited reviewers, added: sdesmalen, dmgreen; removed: bsmith.

I found that initially I attached the wrong version of diff. Updating. Also with this patch I have seen the number of improvements with eemb and one regression due to generated runtime checks.

Is this example from the summary the motivating example?

struct Arrays {
double X[128];
double Y[128];
double Z[128];
double D;
} s1;

void test(Arrays *S, int n) {
  for ( int k=0 ; k<n ; k++ ) {
    S->X[k] = S->D + S->Y[k] * S->Z[k];
  }
}

GCC seems to pull the invariant load out of the loop without needing any runtime checks: https://godbolt.org/z/PPn4jP1be
Should clang be able to do the same? It seems that GCC thinks that S->X[k] cannot alias with S->D.

fhahn added a comment.May 31 2022, 1:05 AM

Is this example from the summary the motivating example?

struct Arrays {
double X[128];
double Y[128];
double Z[128];
double D;
} s1;

void test(Arrays *S, int n) {
  for ( int k=0 ; k<n ; k++ ) {
    S->X[k] = S->D + S->Y[k] * S->Z[k];
  }
}

GCC seems to pull the invariant load out of the loop without needing any runtime checks: https://godbolt.org/z/PPn4jP1be
Should clang be able to do the same? It seems that GCC thinks that S->X[k] cannot alias with S->D.

I think what's missing here is that Clang/LLVM doesn't support accesses to fixed-sizes arrays in structs in TBAA. There's been a lot of discussions about improving TBAA a couple of years ago that didn't make any progress unfortunately. I'm planning to summarize the status and see if we can find a way forward to address some of the current limitations.

We're aware of the effort to expand TBAA but as you say this has been at the experimental/proposal stage for a while, which is why we're proposing this patch as a stop gap to enable vectorisation albeit at the cost of a redundant runtime check.

This patch actually also enables support for the cases where we have potentially conflicting accesses that have different strides (well invariant read is technically stride equal to zero). Here's one more real-live example:

 __attribute__((aligned(ALIGNMENT))) TYPE * const a = global_data.a;
 __attribute__((aligned(ALIGNMENT))) TYPE * const b = global_data.b;
 __attribute__((aligned(ALIGNMENT))) TYPE * const c = global_data.c;
 __attribute__((aligned(ALIGNMENT))) TYPE * const d = global_data.d;

...

for (int i = 0; i < LEN/2; i++) {
  a[2*i] = c[i] * b[i] + d[i] * b[i] + c[i] * c[i] + d[i] * b[i] + d[i] * c[i];
}

But we have to keep in mind that if we enable this feature then function RuntimePointerChecking::groupChecks will be running second time with UseDependencies equal to false and it will be emitting a lot of runtime check. There is a way to decrease that number by doing more accurate grouping and it can be implemented as follow-up patch.

mgabka added a subscriber: mgabka.Jun 8 2022, 2:08 AM

Ping!
With the porposed change I notices following run-time improvments:
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt : -20%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl: -13%
And regressions:
MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg: +3%
SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu: +1%

sdesmalen added inline comments.Jul 1 2022, 7:23 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1870–1887

The code currently sets FoundNonConstantDistanceDependence = true if one of the two pointers is affine, but if one of the pointers is not affine (and also not loop-invariant), then there isn't a non-constant dependence distance to check in the SCEV checks. From what I can tell, I think *both* pointers should be either affine or loop invariant. I wasn't able to create a test-case for it from C, but I think there's a hole here that needs fixing (and preferably testing, if you can create one). What do you think?

This code could also do with some comments and tidying up.

dtemirbulatov added inline comments.Jul 4 2022, 3:23 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1870–1887

Correct, I have now a testcase with both non-affine nor loop invariants. I will fix that. Thanks for finding this.

fhahn added inline comments.Jul 4 2022, 4:02 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1887

The variable seems misnamed now, for the case here it seems like the distance is uncomputable, right?

llvm/test/Analysis/LoopAccessAnalysis/pointer-phis.ll
256

Why drop the check lines for the runtime checks?

343

Why drop the check lines for the runtime checks?

llvm/test/Transforms/LoopVectorize/X86/illegal-parallel-loop-uniform-write.ll
146

this comment seems out of date now?

llvm/test/Transforms/LoopVectorize/global_alias.ll
781

This seems like a regression, as here we have a case where the index depends on a load in the loop, right? Same for the change below.

llvm/test/Transforms/LoopVectorize/memory-dep-remarks.ll
274

why remove the test instead of updating the check?

367

Can this be kept?

Fixed remarks, rebased.

dtemirbulatov marked 4 inline comments as done.Jul 6 2022, 2:21 AM

No regression at llvm/test/Transforms/LoopVectorize/global_alias.ll looks with the update.

sdesmalen added inline comments.Jul 6 2022, 2:38 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1876

I think the expression below is equivalent, because if Src is not an affine SCEVAddRecExpr, then StrideAPtr must be zero?

bool SrcAffine = !SrcInvariant && isa<SCEVAddRecExpr>(Src) && cast<SCEVAddRecExpr>(Src)->isAffine();
1886–1887

This expression doesn't allow the case where SrcAffine && SinkAffine && !SrcInvariant && !SinkInvariant.

AIUI the distance can only be computed if:

(SrcAffine && SinkAffine) ||
(SrcAffine && SinkInvariant) ||
(SrcInvariant && SinkAffine) ||
(SrcInvariant && SinkInvariant)

By the way, I couldn't see any new test-cases. Did you forget to add these, or did I miss something?

dtemirbulatov added inline comments.Jul 7 2022, 8:54 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1886–1887

Hmm , For the case if both are invariants we don't need to emit a run-time check and I think we can ignore such cases. For both are affine type pointers I have not considered to support with the change and if I allow it I noticed performance regressions.

fhahn added inline comments.Jul 7 2022, 10:20 PM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1886–1887

It would be good to have at least all those combinations covered in the unit tests. Do you have any insight in the cause for the performance regression. you are seeing?

Rebased, fixed remarks, added dedicated test.

dtemirbulatov marked 2 inline comments as done.Jul 13 2022, 5:43 AM
dtemirbulatov added inline comments.
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1886–1887

Added dedicated test.

dtemirbulatov marked an inline comment as done.Jul 14 2022, 7:38 AM
dtemirbulatov added inline comments.
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1886–1887

just extra check instructions with a small number of iterations.

Hi @dtemirbulatov thanks for adding some more tests! There is still an unaddressed question around the performance regression that would be good to get cleared up. I've also left a few comments on the new test.

llvm/lib/Analysis/LoopAccessAnalysis.cpp
1876

Unaddressed comment.

1886–1887

For both are affine type pointers I have not considered to support with the change and if I allow it I noticed performance regressions

Do you know why there are performance regressions?

llvm/test/Transforms/LoopVectorize/runtime-check-invariant-and-affine.ll
1 ↗(On Diff #444234)

This should probably be a LoopAccessAnalysis test instead, so please use print-access-info instead of loop-vectorize and move this test to the corresponding directory.

6 ↗(On Diff #444234)

nit: Can you please clean up things like dso_local and local_unnamed_addr, as they are not required for this test.

33 ↗(On Diff #444234)

Did you get this output from Clang or did you modify it manually? I'm asking because the inbounds doesn't look right to me because it is accessing beyond the object @g at @g + sizeof(%struct.f). Should this just be ptr getelementptr inbounds (%struct.f, ptr @g, i32 0) ?

Also, is the outer gep equivalent to:

getelementptr inbounds %struct.c, (ptr ...), i64 %indvars.iv

?

dtemirbulatov marked an inline comment as done.

Rebased, fixed remarks.

dtemirbulatov marked 4 inline comments as done.Aug 2 2022, 8:19 AM
dtemirbulatov added inline comments.
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1886–1887

Sorry for missing that, Done.

dtemirbulatov marked an inline comment as done.

Rebased, enabled run-time checking for two side afffine type addresses since closer look at 471.omnetpp cpu2006 does not reveal any regression.

Rebased, enabled run-time checking for two side afffine type addresses since closer look at 471.omnetpp cpu2006 does not reveal any regression.

Great, thanks for checking. I'll try to do another perf check on my end over the next few days.

fhahn added a comment.Sep 5 2022, 12:20 AM

I've put up a patch to avoid unnecessary checks in case we can prove the dependence distance is larger than the trip count: D132703. This should help to avoid a few regressions due to additional runtime checks that may be generated with this change. I think with that we should look good from the regression side of things.

@fhahn, can we land this one while D132703 in review?

sdesmalen accepted this revision.Sep 8 2022, 3:35 AM

I haven't looked at D132703, but I'm happy with the changes in this patch if you can address the final comments before landing.

llvm/test/Analysis/LoopAccessAnalysis/runtime-check-invariant-and-affine.ll
53–54

Could you add CHECK lines for all the output, similar to what was done in e.g. llvm/test/Analysis/LoopAccessAnalysis/depend_diff_types.ll ?

102–103

nit: move this below the for.body

This revision is now accepted and ready to land.Sep 8 2022, 3:35 AM
fhahn added inline comments.Sep 9 2022, 12:23 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
1870

On main, there's now a SE variable that can be used instead of PSE.getSE().

1881

Could you add a comment here describing the logic here for future readers? It might be good to mention how the A[B[I] case is still avoided (we won't generate runtime checks because we won't be able to identify the bounds for the check).

llvm/test/Analysis/LoopAccessAnalysis/runtime-check-invariant-and-affine.ll
18

I might have missed it but could you also add a test case where one side is affine and the other is neither strided nor invariant?

20

please also check the runtime checks to make sure the correct tones are generated.

39

Is the more complicated condition needed here?

44

is this needed?

82

is this needed?

94

pass this as i64 argument to avoid the unnecessary load/zext later? (same above)

96

the check and conditional branch shouldn't be needed.

109

typed pointers are deprecated, could you update the test to use opaque pointers instead here?

llvm/test/Transforms/LoopDistribute/scev-inserted-runtime-check.ll
9

This test doesn't seem to be testing what it is supposed to any longer. I am not sure what changed and the new behavior isn't incorrect - we just don't version, but it would be good to know why and restore the original behavior for the test.

dtemirbulatov updated this revision to Diff 459626.EditedSep 12 2022, 7:23 PM
dtemirbulatov marked 13 inline comments as done.

Fixed remarks, rebased.

Removed two f() and f_with_offset() out of Transforms/LoopDistribute/scev-inserted-runtime-check.ll.

dtemirbulatov marked an inline comment as done.Sep 13 2022, 10:23 AM
dtemirbulatov added inline comments.
llvm/test/Analysis/LoopAccessAnalysis/runtime-check-invariant-and-affine.ll
96

it looks like we need this check

fhahn added inline comments.Sep 18 2022, 3:02 PM
llvm/test/Transforms/LoopDistribute/scev-inserted-runtime-check.ll
9

Did you manage to track down what changed here? Would it be possible to restore the original behavior of the test instead removing it?

dtemirbulatov retitled this revision from [LAA] Relax pointer dependency with runtime pointer checks to WIP: [LAA] Relax pointer dependency with runtime pointer checks .

Rebased.

dtemirbulatov added a comment.EditedSep 23 2022, 4:23 AM

The original version of this change didn't have this regression in Transforms/LoopDistribute/scev-inserted-runtime-check.ll @fhahn, how about only considering cases where one side is pointer invariant as in D132703?

fhahn added a comment.Oct 3 2022, 2:33 PM

The original version of this change didn't have this regression in Transforms/LoopDistribute/scev-inserted-runtime-check.ll @fhahn, how about only considering cases where one side is pointer invariant as in D132703?

If this avoids the regression I think that would be good to start with. We can always extend it in a follow up.

dtemirbulatov retitled this revision from WIP: [LAA] Relax pointer dependency with runtime pointer checks to [LAA] Relax pointer dependency with runtime pointer checks .Oct 12 2022, 6:05 PM
dtemirbulatov abandoned this revision.Oct 12 2022, 6:15 PM

Abandoning, there could be another way to solve the issue without introducing any run-time checks.