This is an archive of the discontinued LLVM Phabricator instance.

[SLP] fix miscompile on min/max reductions with extra uses (PR43948)
ClosedPublic

Authored by spatel on Nov 12 2019, 2:48 PM.

Details

Summary

I noticed that we could miscompile reductions while working on the recent 2-way enhancements.

The problem appears to be limited to cases where a min/max reduction has extra uses of the compare operand to the select.

I assume that the existing test in used-reduced-op.ll also shows a miscompile, but nobody noticed in such a large test?

Diff Detail

Event Timeline

spatel created this revision.Nov 12 2019, 2:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 12 2019, 2:48 PM
spatel marked an inline comment as done.Nov 13 2019, 6:23 AM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/reduction.ll
117–118

For reference (and I should improve the variable names or test comments): %t14 is the final step of what we recognize as the min/max reduction, and %EXTRA_USE is the compare part of that min/max op.

ABataev added inline comments.Nov 13 2019, 7:04 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6804–6813

I don't think you need matches here. You can rely on ReductionData.getKind() and then just do something like this:

switch (ReductionData.getKind()) {
case RK_Min:
case RK_Max:
case RK_UMin:
case RK_UMax:
  cast<SelectInstruction>(ReductionRoot)->getCondition()->replaceAllUsesWith(cast<SelectInstruction>(VectorizedTree)->getCondition());
  break;
case RK_Arithmetic:
  break;
}

And even better to create a new member function in Operation data, which will replace all uses for the vectorized instruction like in this code + the final replacement.

spatel marked 2 inline comments as done.Nov 13 2019, 10:10 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6804–6813

This doesn't work because we don't know that the vectorized code ends in a select at this point (it might end in an extractelement of a vector instruction instead).

But I agree that we can simplify the logic a bit, so added a helper:
rGe9bf7a60a036

I also looked at extending the OperationData class as suggested, but I don't see how to do that cleanly because the final element of the reduction (the thing that we want to substitute for the scalar op) isn't part of OperationData currently. Let's make that a follow-up cleanup step.

spatel updated this revision to Diff 229133.Nov 13 2019, 10:12 AM
spatel marked an inline comment as done.

Patch updated:

  1. Reduce code by using min/max helper (rGe9bf7a60a036).
  2. I also tried to make the test clearer (rG142cbe73e9fe).
This revision is now accepted and ready to land.Nov 13 2019, 10:15 AM
This revision was automatically updated to reflect the committed changes.
spatel reopened this revision.Nov 18 2019, 3:38 PM

Reopening - reverted here:
rG6f1cc4151a5a

Looks like we need to adjust the IR insert point for the cmp (the whole reduction?) because we may create invalid IR otherwise ("Instruction does not dominate all uses!").

This revision is now accepted and ready to land.Nov 18 2019, 3:38 PM

For reference (this made it to the mailing list, but not Phab), the code example cited for the revert was likely already miscompiling:

// clang -c -O2 -msse4 repro.cc

using a = void (*)(const void *, long, int *, int *);
int b(int);
template <int, typename> void c(const void *, long, int *d, int *) {
  int a[8];
  int e[1];
  int f;
  for (int g = 0; g < 8; g += 2) {
    for (int h = 0; h < 5; ++h)
      a[3] += e;
    for (int h = 0; h < 3; ++h)
      f = b(f) * 2;
  }
  int i = *d = 0;
  for (int g = 0; g < 8; ++g)
    if (a[g] > i) {
      i = a[g];
      *d = g;
    }
}
a j;
void k() { j = c<8, char>; }

I reduced the IR using bugpoint and then cleaned it up manually a bit to still make some sense (bugpoint introduces undefs that would eventually allow reducing the code to nothing):

define i1 @bad_insertpoint_rdx([8 x i32]* %p) #0 {
; CHECK-LABEL: @bad_insertpoint_rdx(
; CHECK-NEXT:    [[ARRAYIDX22:%.*]] = getelementptr inbounds [8 x i32], [8 x i32]* [[P:%.*]], i64 0, i64 0
; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i32* [[ARRAYIDX22]] to <2 x i32>*
; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i32>, <2 x i32>* [[TMP1]], align 16
; CHECK-NEXT:    [[SPEC_STORE_SELECT87:%.*]] = zext i1 undef to i32
; CHECK-NEXT:    [[RDX_SHUF:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> undef, <2 x i32> <i32 1, i32 undef>
; CHECK-NEXT:    [[RDX_MINMAX_CMP:%.*]] = icmp sgt <2 x i32> [[TMP2]], [[RDX_SHUF]]
; CHECK-NEXT:    [[RDX_MINMAX_SELECT:%.*]] = select <2 x i1> [[RDX_MINMAX_CMP]], <2 x i32> [[TMP2]], <2 x i32> [[RDX_SHUF]]
; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <2 x i32> [[RDX_MINMAX_SELECT]], i32 0
; CHECK-NEXT:    [[TMP4:%.*]] = icmp sgt i32 [[TMP3]], 0
; CHECK-NEXT:    [[OP_EXTRA:%.*]] = select i1 [[TMP4]], i32 [[TMP3]], i32 0
; CHECK-NEXT:    [[CMP23_2:%.*]] = icmp sgt i32 [[SPEC_STORE_SELECT87]], [[OP_EXTRA]]
; CHECK-NEXT:    ret i1 [[CMP23_2]]
;
  %arrayidx22 = getelementptr inbounds [8 x i32], [8 x i32]* %p, i64 0, i64 0
  %t0 = load i32, i32* %arrayidx22, align 16
  %cmp23 = icmp sgt i32 %t0, 0
  %spec.select = select i1 %cmp23, i32 %t0, i32 0
  %arrayidx22.1 = getelementptr inbounds [8 x i32], [8 x i32]* %p, i64 0, i64 1
  %t1 = load i32, i32* %arrayidx22.1, align 4
  %cmp23.1 = icmp sgt i32 %t1, %spec.select
  %spec.store.select87 = zext i1 %cmp23.1 to i32
  %spec.select88 = select i1 %cmp23.1, i32 %t1, i32 %spec.select
  %cmp23.2 = icmp sgt i32 %spec.store.select87, %spec.select88
  ret i1 %cmp23.2
}

The CHECK lines are based on trunk today (this RAUW patch is not in play). So this example miscompiles independently of this patch - see this line:

; CHECK-NEXT:    [[SPEC_STORE_SELECT87:%.*]] = zext i1 undef to i32

The result of the function is based on that value, so we can get the whole thing wrong depending on what we choose for 'undef'.

The result of the function is based on that value, so we can get the whole thing wrong depending on what we choose for 'undef'.

In case that's not clear because SLP leaves dead code everywhere - if you pass the output of SLP to instcombine:

define i1 @bad_insertpoint_rdx([8 x i32]* %p) #0 {
  ret i1 false
}

Reopening - reverted here:
rG6f1cc4151a5a

Looks like we need to adjust the IR insert point for the cmp (the whole reduction?) because we may create invalid IR otherwise ("Instruction does not dominate all uses!").

Seems to me, need to fix the code in line 6745. Instead of Builder.SetInsertPoint(cast<Instruction>(ReductionRoot)); need to set the insertion point to the CMPInst if ReductionRoot is SelectInst.

This revision was automatically updated to reflect the committed changes.