Page MenuHomePhabricator

[loop-idiom] Hoist loop memcpys to loop preheader
ClosedPublic

Authored by zhuhan0 on Mar 1 2021, 1:21 AM.

Details

Summary

For a simple loop like:

struct S {
  int x;
  int y;
  char b;
};

unsigned foo(S* __restrict__ a, S* b, int n) {
  for (int i = 0; i < n; i++)
    a[i] = b[i];

  return sizeof(a[0]);
}

We could eliminate the loop and convert it to a large memcpy of 12*n bytes. Currently this is not handled. Output of opt -loop-idiom -S < memcpy_before.ll

%struct.S = type { i32, i32, i8 }

define dso_local i32 @_Z3fooP1SS0_i(%struct.S* noalias nocapture %a, %struct.S* nocapture readonly %b, i32 %n) local_unnamed_addr {
entry:
  %cmp7 = icmp sgt i32 %n, 0
  br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:                               ; preds = %entry
  br label %for.body

for.cond.cleanup.loopexit:                        ; preds = %for.body
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  ret i32 12

for.body:                                         ; preds = %for.body, %for.body.preheader
  %i.08 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i32 %i.08 to i64
  %arrayidx = getelementptr inbounds %struct.S, %struct.S* %b, i64 %idxprom
  %arrayidx2 = getelementptr inbounds %struct.S, %struct.S* %a, i64 %idxprom
  %0 = bitcast %struct.S* %arrayidx2 to i8*
  %1 = bitcast %struct.S* %arrayidx to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 4 dereferenceable(12) %0, i8* nonnull align 4 dereferenceable(12) %1, i64 12, i1 false)
  %inc = add nuw nsw i32 %i.08, 1
  %cmp = icmp slt i32 %inc, %n
  br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #0

attributes #0 = { argmemonly nofree nosync nounwind willreturn }

The loop idiom pass currently only handles load and store instructions. Since struct S is too big to fit in a register, the loop body contains a memcpy intrinsic.

With this change, re-run opt -loop-idiom -S < memcpy_before.ll. The loop memcpy is promoted to loop preheader. For this trivial case, the loop is dead and will be removed by another pass.

%struct.S = type { i32, i32, i8 }

define dso_local i32 @_Z3fooP1SS0_i(%struct.S* noalias nocapture %a, %struct.S* nocapture readonly %b, i32 %n) local_unnamed_addr {
entry:
  %a1 = bitcast %struct.S* %a to i8*
  %b2 = bitcast %struct.S* %b to i8*
  %cmp7 = icmp sgt i32 %n, 0
  br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:                               ; preds = %entry
  %0 = zext i32 %n to i64
  %1 = mul nuw nsw i64 %0, 12
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %a1, i8* align 4 %b2, i64 %1, i1 false)
  br label %for.body

for.cond.cleanup.loopexit:                        ; preds = %for.body
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  ret i32 12

for.body:                                         ; preds = %for.body, %for.body.preheader
  %i.08 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
  %idxprom = zext i32 %i.08 to i64
  %arrayidx = getelementptr inbounds %struct.S, %struct.S* %b, i64 %idxprom
  %arrayidx2 = getelementptr inbounds %struct.S, %struct.S* %a, i64 %idxprom
  %2 = bitcast %struct.S* %arrayidx2 to i8*
  %3 = bitcast %struct.S* %arrayidx to i8*
  %inc = add nuw nsw i32 %i.08, 1
  %cmp = icmp slt i32 %inc, %n
  br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #0

attributes #0 = { argmemonly nofree nosync nounwind willreturn }

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
foad added a subscriber: foad.Mar 23 2021, 9:23 AM

Typo in description: "perheader".

zhuhan0 retitled this revision from [loop-idiom] Hoist loop memcpys to loop perheader to [loop-idiom] Hoist loop memcpys to loop preheader.Mar 23 2021, 9:43 AM

Typo in description: "perheader".

Good catch. Corrected title.

This revision was landed with ongoing or failed builds.Mar 29 2021, 11:42 PM
This revision was automatically updated to reflect the committed changes.

Is @zino someone's replacement account?
I'm asking because the accept of this revision is the only contribution of that account.

This revision is now accepted and ready to land.Mar 30 2021, 2:49 AM
lebedev.ri requested changes to this revision.Mar 30 2021, 2:49 AM
This revision now requires changes to proceed.Mar 30 2021, 2:49 AM
hoy added a comment.Mar 30 2021, 9:20 AM

Is @zino someone's replacement account?
I'm asking because the accept of this revision is the only contribution of that account.

Zino is a real LLVM developer. @zino looks like your email is not verified.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
818

No need of this continue?

1233

Nit: Use 'StringRef` to avoid extra mem allocation?

hoy edited reviewers, added: hoy; removed: hoyFB.Mar 30 2021, 9:20 AM

Is @zino someone's replacement account?
I'm asking because the accept of this revision is the only contribution of that account.

Hi Roman, Yes, this is replacement account for Zino (on PTO for the week), and I believe the original account was https://reviews.llvm.org/p/zinob/

Zino, Hongtao and myself did internal code review for this patch before it's sent up here - we should have pointed that out.

Sorry for the breakage though. Instruction or a reproducer would be helpful.

-Wenlei

In D97667#2658951, @hoy wrote:

Is @zino someone's replacement account?
I'm asking because the accept of this revision is the only contribution of that account.

Zino is a real LLVM developer. @zino looks like your email is not verified.

Not as per https://reviews.llvm.org/people/commits/21223/ / https://reviews.llvm.org/people/revisions/21223/, which is why i asked.
Docs aren't quite clear on this, but the subtext is that reviews only count
if the reviewee is actually familiar with the code (and that can be seen from git log),
and is an [upstream!] llvm dev.

Can you please

  1. precommit the tests
  2. split this up into whatever nfc preparatory patch and the actual memcpy patch?

@lebedev.ri @zino is the replacement account for @zinob. He had some issue with the old account and couldn't retrieve it.

  1. precommit the tests

Could you elaborate on this? What's the best way to run the failed buildbots build without committing this?

  1. split this up into whatever nfc preparatory patch and the actual memcpy patch?

Will do. Thanks.

zhuhan0 updated this revision to Diff 339321.Wed, Apr 21, 11:02 AM

Fix build break. The breakage was a situation where the memcpy source and destination were of different types/sizes. Abort the transformation if that's the case. Also added a test case memcpy-intrinsic-different-types.ll.

Also split the preparatory change away into an NFC patch https://reviews.llvm.org/D100979.

zhuhan0 updated this revision to Diff 339324.Wed, Apr 21, 11:09 AM

Address @hoy's comments.

This was already accepted and you fixed build break, I think you can try to reland it.

This was already accepted and you fixed build break, I think you can try to reland it.

Ah I didn't know that. Thanks! Still not very familiar with upstream etiquette.

I'm actually going on a trip soon and will land this stack when I return next week.

foad removed a subscriber: foad.Fri, Apr 23, 12:21 AM
This revision was not accepted when it landed; it landed in state Needs Review.Tue, Apr 27, 5:40 PM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.

To clarify, i was essentially asking to get an independent reviewer to re-review this,
looks like that didn't happen.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
856

This is the first time i'm seeing Str short for Store. Please use it direcly.

868–870

This doesn't make sense. Strides of load and store must match exactly.
Doesn't this miscompile the case where load goes forward and store backward or vice verse?

lebedev.ri added inline comments.Wed, Apr 28, 2:46 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
868–870

Also, i would like to see better test coverage:

  1. negative stride
  2. load/store stride sign mismatch
tpopp added a subscriber: tpopp.Wed, Apr 28, 4:13 AM

lebedev.ri's concerns seem to have been valid, so I'll be rolling this back. A test case in XLA that reverses data across certain dimensions in a multidimensional change fails with this patch. A sequence of loads and stores is converted into a single memcpy even though the ordering should be different across loads and stores.

Un-optimized input:

; ModuleID = '__compute_module'
source_filename = "__compute_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

@0 = external dso_local unnamed_addr constant [96 x i8], align 16

; Function Attrs: uwtable
define void @Reverse4DFloatArrayOnDim01.3(i8* %retval, i8* noalias %run_options, i8** noalias %params, i8** noalias %buffer_table, i64* noalias %prof_counters) #0 {
entry:
  %reverse.2.invar_address.dim.3 = alloca i64, align 8
  %reverse.2.invar_address.dim.2 = alloca i64, align 8
  %reverse.2.invar_address.dim.1 = alloca i64, align 8
  %reverse.2.invar_address.dim.0 = alloca i64, align 8
  %0 = getelementptr inbounds i8*, i8** %buffer_table, i64 0
  %1 = load i8*, i8** %0, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  %reverse.2 = bitcast i8* %1 to [4 x [3 x [2 x [1 x float]]]]*
  store i64 0, i64* %reverse.2.invar_address.dim.0, align 8
  br label %reverse.2.loop_header.dim.0

reverse.2.loop_header.dim.0:                      ; preds = %reverse.2.loop_exit.dim.1, %entry
  %reverse.2.indvar.dim.0 = load i64, i64* %reverse.2.invar_address.dim.0, align 8
  %2 = icmp uge i64 %reverse.2.indvar.dim.0, 4
  br i1 %2, label %reverse.2.loop_exit.dim.0, label %reverse.2.loop_body.dim.0

reverse.2.loop_body.dim.0:                        ; preds = %reverse.2.loop_header.dim.0
  store i64 0, i64* %reverse.2.invar_address.dim.1, align 8
  br label %reverse.2.loop_header.dim.1

reverse.2.loop_header.dim.1:                      ; preds = %reverse.2.loop_exit.dim.2, %reverse.2.loop_body.dim.0
  %reverse.2.indvar.dim.1 = load i64, i64* %reverse.2.invar_address.dim.1, align 8
  %3 = icmp uge i64 %reverse.2.indvar.dim.1, 3
  br i1 %3, label %reverse.2.loop_exit.dim.1, label %reverse.2.loop_body.dim.1

reverse.2.loop_body.dim.1:                        ; preds = %reverse.2.loop_header.dim.1
  store i64 0, i64* %reverse.2.invar_address.dim.2, align 8
  br label %reverse.2.loop_header.dim.2

reverse.2.loop_header.dim.2:                      ; preds = %reverse.2.loop_exit.dim.3, %reverse.2.loop_body.dim.1
  %reverse.2.indvar.dim.2 = load i64, i64* %reverse.2.invar_address.dim.2, align 8
  %4 = icmp uge i64 %reverse.2.indvar.dim.2, 2
  br i1 %4, label %reverse.2.loop_exit.dim.2, label %reverse.2.loop_body.dim.2

reverse.2.loop_body.dim.2:                        ; preds = %reverse.2.loop_header.dim.2
  store i64 0, i64* %reverse.2.invar_address.dim.3, align 8
  br label %reverse.2.loop_header.dim.3

reverse.2.loop_header.dim.3:                      ; preds = %reverse.2.loop_body.dim.3, %reverse.2.loop_body.dim.2
  %reverse.2.indvar.dim.3 = load i64, i64* %reverse.2.invar_address.dim.3, align 8
  %5 = icmp uge i64 %reverse.2.indvar.dim.3, 1
  br i1 %5, label %reverse.2.loop_exit.dim.3, label %reverse.2.loop_body.dim.3

reverse.2.loop_body.dim.3:                        ; preds = %reverse.2.loop_header.dim.3
  %6 = sub i64 3, %reverse.2.indvar.dim.0
  %7 = sub i64 2, %reverse.2.indvar.dim.1
  %8 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* bitcast ([96 x i8]* @0 to [4 x [3 x [2 x [1 x float]]]]*), i64 0, i64 %6, i64 %7, i64 %reverse.2.indvar.dim.2, i64 0
  %9 = load float, float* %8, align 4, !alias.scope !3, !noalias !6
  %10 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %reverse.2, i64 0, i64 %reverse.2.indvar.dim.0, i64 %reverse.2.indvar.dim.1, i64 %reverse.2.indvar.dim.2, i64 0
  store float %9, float* %10, align 4, !alias.scope !6, !noalias !3
  %invar.inc3 = add nuw nsw i64 %reverse.2.indvar.dim.3, 1
  store i64 %invar.inc3, i64* %reverse.2.invar_address.dim.3, align 8
  br label %reverse.2.loop_header.dim.3

reverse.2.loop_exit.dim.3:                        ; preds = %reverse.2.loop_header.dim.3
  %invar.inc2 = add nuw nsw i64 %reverse.2.indvar.dim.2, 1
  store i64 %invar.inc2, i64* %reverse.2.invar_address.dim.2, align 8
  br label %reverse.2.loop_header.dim.2

reverse.2.loop_exit.dim.2:                        ; preds = %reverse.2.loop_header.dim.2
  %invar.inc1 = add nuw nsw i64 %reverse.2.indvar.dim.1, 1
  store i64 %invar.inc1, i64* %reverse.2.invar_address.dim.1, align 8
  br label %reverse.2.loop_header.dim.1

reverse.2.loop_exit.dim.1:                        ; preds = %reverse.2.loop_header.dim.1
  %invar.inc = add nuw nsw i64 %reverse.2.indvar.dim.0, 1
  store i64 %invar.inc, i64* %reverse.2.invar_address.dim.0, align 8
  br label %reverse.2.loop_header.dim.0

reverse.2.loop_exit.dim.0:                        ; preds = %reverse.2.loop_header.dim.0
  ret void
}

attributes #0 = { uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }

!0 = !{}
!1 = !{i64 96}
!2 = !{i64 16}
!3 = !{!4}
!4 = !{!"buffer: {index:1, offset:0, size:96}", !5}
!5 = !{!"XLA global AA domain"}
!6 = !{!7}
!7 = !{!"buffer: {index:0, offset:0, size:96}", !5}

Before this patch:

; ModuleID = '__compute_module'
source_filename = "__compute_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

; Function Attrs: nofree norecurse nosync nounwind uwtable
define void @Reverse4DFloatArrayOnDim01.3(i8* nocapture readnone %retval, i8* noalias nocapture readnone %run_options, i8** noalias nocapture readnone %params, i8** noalias nocapture readonly %buffer_table, i64* noalias nocapture readnone %prof_counters) local_unnamed_addr #0 {
entry:
  %0 = bitcast i8** %buffer_table to [4 x [3 x [2 x [1 x float]]]]**
  %1 = load [4 x [3 x [2 x [1 x float]]]]*, [4 x [3 x [2 x [1 x float]]]]** %0, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  %2 = bitcast [4 x [3 x [2 x [1 x float]]]]* %1 to <4 x i64>*
  store <4 x i64> <i64 4737786809096339456, i64 4733283209467920384, i64 4728779609839501312, i64 4724276010211082240>, <4 x i64>* %2, align 16
  %scevgep.1.1 = getelementptr [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 1, i64 1, i64 0, i64 0
  %3 = bitcast float* %scevgep.1.1 to <4 x i64>*
  store <4 x i64> <i64 4719772410582138880, i64 4710765211325300736, i64 4701758012068462592, i64 4692750812811624448>, <4 x i64>* %3, align 16
  %scevgep.2.2 = getelementptr [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 2, i64 2, i64 0, i64 0
  %4 = bitcast float* %scevgep.2.2 to <4 x i64>*
  store <4 x i64> <i64 4683743613553737728, i64 4665729215040061440, i64 4647714816524288000, i64 4611686019492741120>, <4 x i64>* %4, align 16
  ret void
}

attributes #0 = { nofree norecurse nosync nounwind uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }

!0 = !{}
!1 = !{i64 96}
!2 = !{i64 16}

After this patch:

; ModuleID = '__compute_module'
source_filename = "__compute_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

@0 = private unnamed_addr constant [96 x i8] c"\00\00\80?\00\00\00@\00\00@@\00\00\80@\00\00\A0@\00\00\C0@\00\00\E0@\00\00\00A\00\00\10A\00\00 A\00\000A\00\00@A\00\00PA\00\00`A\00\00pA\00\00\80A\00\00\88A\00\00\90A\00\00\98A\00\00\A0A\00\00\A8A\00\00\B0A\00\00\B8A\00\00\C0A", align 16

; Function Attrs: nofree norecurse nosync nounwind uwtable
define void @Reverse4DFloatArrayOnDim01.3(i8* nocapture readnone %retval, i8* noalias nocapture readnone %run_options, i8** noalias nocapture readnone %params, i8** noalias nocapture readonly %buffer_table, i64* noalias nocapture readnone %prof_counters) local_unnamed_addr #0 {
entry:
  %0 = load i8*, i8** %buffer_table, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 16 dereferenceable(96) %0, i8* noundef nonnull align 8 dereferenceable(96) getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 88), i64 96, i1 false)
  ret void
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #1

attributes #0 = { nofree norecurse nosync nounwind uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }
attributes #1 = { argmemonly nofree nosync nounwind willreturn }

!0 = !{}
!1 = !{i64 96}
!2 = !{i64 16}
lebedev.ri reopened this revision.Wed, Apr 28, 4:14 AM
lebedev.ri requested changes to this revision.

Thank you.

@zhuhan0 please do get someone else familiar with the code to re-review this before relanding.
Thank you!

This revision now requires changes to proceed.Wed, Apr 28, 4:15 AM

@tpopp I cannot reproduce your test failure with opt -O2 and -O3. My patch only affects memcpy intrinsics in the loop body. Therefore running your test case shouldn't hit my code. Output of opt -O3:

; ModuleID = 'reverse_4d_float_array.ll'
source_filename = "__compute_module"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-grtev4-linux-gnu"

@0 = external dso_local unnamed_addr constant [96 x i8], align 16

; Function Attrs: nofree norecurse nosync nounwind uwtable
define void @Reverse4DFloatArrayOnDim01.3(i8* nocapture readnone %retval, i8* noalias nocapture readnone %run_options, i8** noalias nocapture readnone %params, i8** noalias nocapture readonly %buffer_table, i64* noalias nocapture readnone %prof_counters) local_unnamed_addr #0 {
entry:
  %0 = bitcast i8** %buffer_table to [4 x [3 x [2 x [1 x float]]]]**
  %1 = load [4 x [3 x [2 x [1 x float]]]]*, [4 x [3 x [2 x [1 x float]]]]** %0, align 8, !invariant.load !0, !dereferenceable !1, !align !2
  %2 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 88) to float*), align 8, !alias.scope !3, !noalias !6
  %3 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 0, i64 0, i64 0, i64 0
  store float %2, float* %3, align 16, !alias.scope !6, !noalias !3
  %4 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 92) to float*), align 4, !alias.scope !3, !noalias !6
  %5 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 0, i64 0, i64 1, i64 0
  store float %4, float* %5, align 4, !alias.scope !6, !noalias !3
  %6 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 80) to float*), align 16, !alias.scope !3, !noalias !6
  %7 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 0, i64 1, i64 0, i64 0
  store float %6, float* %7, align 8, !alias.scope !6, !noalias !3
  %8 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 84) to float*), align 4, !alias.scope !3, !noalias !6
  %9 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 0, i64 1, i64 1, i64 0
  store float %8, float* %9, align 4, !alias.scope !6, !noalias !3
  %10 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 72) to float*), align 8, !alias.scope !3, !noalias !6
  %11 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 0, i64 2, i64 0, i64 0
  store float %10, float* %11, align 16, !alias.scope !6, !noalias !3
  %12 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 76) to float*), align 4, !alias.scope !3, !noalias !6
  %13 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 0, i64 2, i64 1, i64 0
  store float %12, float* %13, align 4, !alias.scope !6, !noalias !3
  %14 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 64) to float*), align 16, !alias.scope !3, !noalias !6
  %15 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 1, i64 0, i64 0, i64 0
  store float %14, float* %15, align 8, !alias.scope !6, !noalias !3
  %16 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 68) to float*), align 4, !alias.scope !3, !noalias !6
  %17 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 1, i64 0, i64 1, i64 0
  store float %16, float* %17, align 4, !alias.scope !6, !noalias !3
  %18 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 56) to float*), align 8, !alias.scope !3, !noalias !6
  %19 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 1, i64 1, i64 0, i64 0
  store float %18, float* %19, align 16, !alias.scope !6, !noalias !3
  %20 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 60) to float*), align 4, !alias.scope !3, !noalias !6
  %21 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 1, i64 1, i64 1, i64 0
  store float %20, float* %21, align 4, !alias.scope !6, !noalias !3
  %22 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 48) to float*), align 16, !alias.scope !3, !noalias !6
  %23 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 1, i64 2, i64 0, i64 0
  store float %22, float* %23, align 8, !alias.scope !6, !noalias !3
  %24 = load float, float* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 52) to float*), align 4, !alias.scope !3, !noalias !6
  %25 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 1, i64 2, i64 1, i64 0
  store float %24, float* %25, align 4, !alias.scope !6, !noalias !3
  %26 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 2, i64 0, i64 0, i64 0
  %27 = load <4 x float>, <4 x float>* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 32) to <4 x float>*), align 16, !alias.scope !3, !noalias !6
  %shuffle = shufflevector <4 x float> %27, <4 x float> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %28 = bitcast float* %26 to <4 x float>*
  store <4 x float> %shuffle, <4 x float>* %28, align 16, !alias.scope !6, !noalias !3
  %29 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 2, i64 2, i64 0, i64 0
  %30 = load <4 x float>, <4 x float>* bitcast (i8* getelementptr inbounds ([96 x i8], [96 x i8]* @0, i64 0, i64 16) to <4 x float>*), align 16, !alias.scope !3, !noalias !6
  %shuffle7 = shufflevector <4 x float> %30, <4 x float> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %31 = bitcast float* %29 to <4 x float>*
  store <4 x float> %shuffle7, <4 x float>* %31, align 16, !alias.scope !6, !noalias !3
  %32 = getelementptr inbounds [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 3, i64 1, i64 0, i64 0
  %33 = load <4 x float>, <4 x float>* bitcast ([96 x i8]* @0 to <4 x float>*), align 16, !alias.scope !3, !noalias !6
  %shuffle8 = shufflevector <4 x float> %33, <4 x float> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %34 = bitcast float* %32 to <4 x float>*
  store <4 x float> %shuffle8, <4 x float>* %34, align 16, !alias.scope !6, !noalias !3
  ret void
}

attributes #0 = { nofree norecurse nosync nounwind uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }

!0 = !{}
!1 = !{i64 96}
!2 = !{i64 16}
!3 = !{!4}
!4 = !{!"buffer: {index:1, offset:0, size:96}", !5}
!5 = !{!"XLA global AA domain"}
!6 = !{!7}
!7 = !{!"buffer: {index:0, offset:0, size:96}", !5}

Do you have different compiler args to hit this test failure? Or is this not even an llvm test case?

zhuhan0 added inline comments.Wed, Apr 28, 12:16 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
868–870

This doesn't make sense. Strides of load and store must match exactly.
Doesn't this miscompile the case where load goes forward and store backward or vice verse?

The strides of load and store can be different if the memcpy source and destination are of different types. See my test memcpy-intrinsic-different-types.ll, which was derived from the build failure we saw earlier. I printed out LoadIntStride and StoreIntStride:

LoadIntStride: 32
StoreIntStride: 12

Line 869-870 check for this case and fix the build failure.

868–870

Also, i would like to see better test coverage:

  1. negative stride
  2. load/store stride sign mismatch

This is reasonable. I'll add those.

lebedev.ri added inline comments.Wed, Apr 28, 12:46 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
868–870

I believe, you have answered a question different from the one i asked.

zhuhan0 updated this revision to Diff 341336.Wed, Apr 28, 3:45 PM

Rename variables, fix stride check, add two tests and one more remark.

zhuhan0 marked 4 inline comments as done.Wed, Apr 28, 3:47 PM
zhuhan0 added inline comments.
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
868–870

You're right. That was a silly error. Fixed.

tpopp added a comment.Thu, Apr 29, 1:23 AM
; ModuleID = '__compute_module' 
source_filename = "__compute_module"                                                                                               
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                       
target triple = "x86_64-grtev4-linux-gnu"                                                                                          
                                                                                                                                   
@0 = private unnamed_addr constant [96 x i8] c"\00\00\80?\00\00\00@\00\00@@\00\00\80@\00\00\A0@\00\00\C0@\00\00\E0@\00\00\00A\00\00
\10A\00\00 A\00\000A\00\00@A\00\00PA\00\00`A\00\00pA\00\00\80A\00\00\88A\00\00\90A\00\00\98A\00\00\A0A\00\00\A8A\00\00\B0A\00\00\B8
A\00\00\C0A", align 16                                                                                                             
                                                                                                                                   
; Function Attrs: nofree norecurse nosync nounwind uwtable                                                                         
define void @Reverse4DFloatArrayOnDim01.3(i8* nocapture readnone %retval, i8* noalias nocapture readnone %run_options, i8** noalias
 nocapture readnone %params, i8** noalias nocapture readonly %buffer_table, i64* noalias nocapture readnone %prof_counters) local_u
nnamed_addr #0 {                                                                                                                   
entry:                                                                                                                             
  %0 = bitcast i8** %buffer_table to [4 x [3 x [2 x [1 x float]]]]**
  %1 = load [4 x [3 x [2 x [1 x float]]]]*, [4 x [3 x [2 x [1 x float]]]]** %0, align 8, !invariant.load !0, !dereferenceable !1, !
align !2
  br label %reverse.2.loop_header.dim.1.preheader

reverse.2.loop_header.dim.1.preheader:            ; preds = %entry, %reverse.2.loop_exit.dim.1
  %reverse.2.invar_address.dim.0.06 = phi i64 [ 0, %entry ], [ %invar.inc, %reverse.2.loop_exit.dim.1 ]
  %2 = mul nsw i64 %reverse.2.invar_address.dim.0.06, -24
  %3 = add i64 %2, 88
  br label %reverse.2.loop_header.dim.2.preheader

reverse.2.loop_header.dim.2.preheader:            ; preds = %reverse.2.loop_header.dim.1.preheader, %reverse.2.loop_exit.dim
  %reverse.2.invar_address.dim.1.05 = phi i64 [ 0, %reverse.2.loop_header.dim.1.preheader ], [ %invar.inc1, %reverse.2.loop_exit.di
m.2 ]
  %scevgep = getelementptr [4 x [3 x [2 x [1 x float]]]], [4 x [3 x [2 x [1 x float]]]]* %1, i64 0, i64 %reverse.2.invar_address.di
m.0.06, i64 %reverse.2.invar_address.dim.1.05, i64 0, i64 0
  %scevgep7 = bitcast float* %scevgep to i8*
  %4 = mul nsw i64 %reverse.2.invar_address.dim.1.05, -8
  %5 = add i64 %3, %4
  %scevgep8 = getelementptr [96 x i8], [96 x i8]* @0, i64 0, i64 %5
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %scevgep7, i8* align 4 %scevgep8, i64 8, i1 false)
  br label %reverse.2.loop_exit.dim.2

reverse.2.loop_exit.dim.2:                        ; preds = %reverse.2.loop_header.dim.2.preheader
  %invar.inc1 = add nuw nsw i64 %reverse.2.invar_address.dim.1.05, 1
  %6 = icmp ugt i64 %reverse.2.invar_address.dim.1.05, 1
  br i1 %6, label %reverse.2.loop_exit.dim.1, label %reverse.2.loop_header.dim.2.preheader

reverse.2.loop_exit.dim.1:                        ; preds = %reverse.2.loop_exit.dim.2
  %invar.inc = add nuw nsw i64 %reverse.2.invar_address.dim.0.06, 1
  %7 = icmp ugt i64 %reverse.2.invar_address.dim.0.06, 2
  br i1 %7, label %reverse.2.loop_exit.dim.0, label %reverse.2.loop_header.dim.1.preheader

reverse.2.loop_exit.dim.0:                        ; preds = %reverse.2.loop_exit.dim.1
  ret void
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly %0, i8* noalias nocapture readonly %1, i64 %2, i1 immarg %3
) #1

attributes #0 = { nofree norecurse nosync nounwind uwtable "denormal-fp-math"="preserve-sign" "no-frame-pointer-elim"="false" }
attributes #1 = { argmemonly nofree nosync nounwind willreturn }

!0 = !{}
!1 = !{i64 96}
!2 = !{i64 16}

opt -loop-idiom <%s -S

This shows the first time that this code is run and different IR is generated before and after. It then diverges further on a subsequent execution (where before an after have different inputs now). I am trying to find how to share a full opt command rather than sharing different snippets for the before/after inputs. I hope this first IR helps though

tpopp added a comment.Thu, Apr 29, 1:37 AM

Describing what the code is intended to do (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/tests/reverse_test.cc#L146).

A 4d array is taking in reversing elements across the 0th and 1st dimensions, so for every value previously indexed at [A,B,C,D] in an array of size [W,X,Y,Z], the new index of the value is [W-A-1, X-B-1, C, D].

The original code indexes into proper locations for the first 2 dimensions, and then copies the subdata, while this change results in a single copy after indexing only in dimension 0, which cannot be done as the data in dimension 1 cannot be copied due to the reversal.

dstenb added a subscriber: dstenb.Thu, Apr 29, 7:38 AM

With this patch we got the following assertion:

bool llvm::APInt::operator==(const llvm::APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.

in LoopIdiomRecognize::processLoopMemCpy() at the following comparison:

// Check if the load stride matches the store stride.
if (StrIntStride != LoadIntStride && StrIntStride != -LoadIntStride)
  return false;

for a memcpy done between two address spaces with different pointer sizes.

I don't have a upstream reproducer ready for this, but I'll see if I can create one.

zhuhan0 marked an inline comment as done.Fri, Apr 30, 3:26 PM

Describing what the code is intended to do (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/tests/reverse_test.cc#L146).

A 4d array is taking in reversing elements across the 0th and 1st dimensions, so for every value previously indexed at [A,B,C,D] in an array of size [W,X,Y,Z], the new index of the value is [W-A-1, X-B-1, C, D].

The original code indexes into proper locations for the first 2 dimensions, and then copies the subdata, while this change results in a single copy after indexing only in dimension 0, which cannot be done as the data in dimension 1 cannot be copied due to the reversal.

Makes sense. Thanks for the IR and explanation.

@lebedev.ri pointed out the issue earlier and I fixed it in the newest version. I ran opt -S -loop-diom on the IR you provided and it does not hoist the memcpy anymore.

lebedev.ri added inline comments.Fri, Apr 30, 3:32 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
865

I don't think this should handle the case of different directions.

zhuhan0 updated this revision to Diff 342083.Fri, Apr 30, 4:41 PM

Compare LoadStride and StoreStride after sign extension.

With this patch we got the following assertion:

bool llvm::APInt::operator==(const llvm::APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.

in LoopIdiomRecognize::processLoopMemCpy() at the following comparison:

// Check if the load stride matches the store stride.
if (StrIntStride != LoadIntStride && StrIntStride != -LoadIntStride)
  return false;

for a memcpy done between two address spaces with different pointer sizes.

I don't have a upstream reproducer ready for this, but I'll see if I can create one.

The new version should fix the assertion I think. Let me know if you still see it.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
865

The different directions case is handled below in

// Check if the load stride matches the store stride.
if (StoreStrideInt != LoadStrideInt)
  return false;

Or do you refer to something else?

dstenb added a comment.Mon, May 3, 7:50 AM

With this patch we got the following assertion:

bool llvm::APInt::operator==(const llvm::APInt &) const: Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.

in LoopIdiomRecognize::processLoopMemCpy() at the following comparison:

// Check if the load stride matches the store stride.
if (StrIntStride != LoadIntStride && StrIntStride != -LoadIntStride)
  return false;

for a memcpy done between two address spaces with different pointer sizes.

I don't have a upstream reproducer ready for this, but I'll see if I can create one.

The new version should fix the assertion I think. Let me know if you still see it.

Yes, thanks! That fixed it. I did not manage to create an upstream reproducer from the downstream one unfortunately.

lebedev.ri added inline comments.Mon, May 3, 7:57 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
869

I suspect we will still see crashes here.
Since you use getSExtValue() later, just move that to before this, and compare ints?

zhuhan0 added inline comments.Mon, May 3, 11:16 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
869

I tried that earlier and there was a warning when compiling this code:

[1284/2576] Building CXX object lib/Transforms/Scalar/CMakeFiles/LLVMScalarOpts.dir/LoopIdiomRecognize.cpp.o
/data/users/zhuhan/server-llvm/llvm-project/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp:871:19: warning: comparison of integers of different signs: 'uint64_t' (aka 'unsigned long') and 'int64_t' (aka 'long') [-Wsign-compare]
  if (SizeInBytes != StoreStrideInt && SizeInBytes != -StoreStrideInt) {
      ~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
/data/users/zhuhan/server-llvm/llvm-project/llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp:871:52: warning: comparison of integers of different signs: 'uint64_t' (aka 'unsigned long') and 'int64_t' (aka 'long') [-Wsign-compare]
  if (SizeInBytes != StoreStrideInt && SizeInBytes != -StoreStrideInt) {
                                       ~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~~
2 warnings generated.

And I think this code should be correct because the != operator of APInt is properly overloaded:

inline bool operator!=(uint64_t V1, const APInt &V2) { return V2 != V1; }

and

bool operator!=(uint64_t Val) const { return !((*this) == Val); }

and finally

bool operator==(uint64_t Val) const {
  return (isSingleWord() || getActiveBits() <= 64) && getZExtValue() == Val;
}
lebedev.ri accepted this revision.Tue, May 4, 3:17 AM

Alright.
Please ensure that you have added all the necessary tests.
It looks about reasonable to me now, i think you can try re-landing it.

This revision is now accepted and ready to land.Tue, May 4, 3:17 AM
This revision was landed with ongoing or failed builds.Tue, May 4, 5:07 PM
This revision was automatically updated to reflect the committed changes.

Alright.
Please ensure that you have added all the necessary tests.
It looks about reasonable to me now, i think you can try re-landing it.

Thanks! Let me know if there's any issue.