This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
-
IVDescriptors.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
6/7
if-reduction.ll

Differential D141842

[LoopVectorize] Enable integer Mul and Add as select reduction patterns
ClosedPublic

Authored by MattDevereau on Jan 16 2023, 5:16 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
peterwaller-arm
rengolin
fhahn
uabelho

Commits

rGf90103851f9a: [LoopVectorize] Enable integer Mul and Add as select reduction patterns

Summary

[LoopVectorize] Enable integer Mul and Add as select reduction patterns

This patch vectorizes Phi node loop reductions for select's whos condition
comes from a floating-point comparison, with its operands being integers
for Add, Sub, and Mul reductions.

Example:

int foo(float *x, int n) {
    int sum = 0;
    for (int i=0; i<n; ++i) {
        float elem = x[i];
        if (elem > 0) {
            sum += 2;
        }
    }
    return sum;
}

This would previously fail to vectorize due to the integer reduction.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MattDevereau created this revision.Jan 16 2023, 5:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 16 2023, 5:16 AM

Herald added subscribers: shiva0217, hiraditya. · View Herald Transcript

MattDevereau requested review of this revision.Jan 16 2023, 5:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 16 2023, 5:16 AM

Herald added subscribers: llvm-commits, • pcwang-thead. · View Herald Transcript

MattDevereau added a reviewer: peterwaller-arm.Jan 16 2023, 5:16 AM

MattDevereau added inline comments.Jan 16 2023, 5:21 AM

llvm/test/Transforms/LoopVectorize/if-reduction.ll

826

Unfortunately integer flags aren't being propagated here. After having a quick look around the issue appears non-trivial as fast-math flags are propagated for the floating point case with a disclaimer. In RecurrenceDescriptor::AddReductionVar just after where the changes to RecurrenceDescriptor::isConditionalRdxPattern were made:

// FIXME: FMF is allowed on phi, but propagation is not handled correctly.
if (isa<FPMathOperator>(ReduxDesc.getPatternInst()) && !IsAPhi) {
  FastMathFlags CurFMF = ReduxDesc.getPatternInst()->getFastMathFlags();
  if (auto *Sel = dyn_cast<SelectInst>(ReduxDesc.getPatternInst())) {
    // Accept FMF on either fcmp or select of a min/max idiom.
    // TODO: This is a hack to work-around the fact that FMF may not be
    //       assigned/propagated correctly. If that problem is fixed or we
    //       standardize on fmin/fmax via intrinsics, this can be removed.

After a look around for methods of propagating the IR flags I'm not quite sure how to proceed.

Harbormaster completed remote builds in B208024: Diff 489512.Jan 16 2023, 6:01 AM

sdesmalen added reviewers: rengolin, fhahn.Jan 17 2023, 3:37 AM

Herald added a subscriber: StephenFan. · View Herald TranscriptJan 17 2023, 3:37 AM

Seems like a sensible change to me.

nit: Could you add a motivating (C/C++ example) to the commit message?

llvm/test/Transforms/LoopVectorize/if-reduction.ll
826	I wouldn't be too worried about this, it seems the nsw/nuw flags aren't propagated for other reductions either.
842	nit: I guess it also works if you remove this `fast` comparison right?
858	What is the difference between this function and `@fcmp_0_add_select1` ?

MattDevereau updated this revision to Diff 489765.Jan 17 2023, 4:12 AM

MattDevereau marked 2 inline comments as done.

MattDevereau edited the summary of this revision. (Show Details)

MattDevereau added inline comments.

llvm/test/Transforms/LoopVectorize/if-reduction.ll
826	Very well, in the C/C++ example I've added the flags aren't propagated either when 1 is used as an immediate and we don't go through the `select` route.
842	You're correct, yes. I'll go ahead and remove it from the tests I've added.
858	The data type being reduced is i64 in this function whereas its i32 in `@fcmp_0_add_select1`. These tests are integer clones of `@fcmp_0_fadd_select1` and `@fcmp_0_fadd_select2` above in this file, with the exception that these integer variants add an immediate instead of loaded in data.

MattDevereau edited the summary of this revision. (Show Details)Jan 17 2023, 4:50 AM

MattDevereau edited the summary of this revision. (Show Details)Jan 17 2023, 5:36 AM

Harbormaster completed remote builds in B208210: Diff 489765.Jan 17 2023, 6:41 AM

georges added a subscriber: georges.Jan 19 2023, 1:19 PM

sdesmalen accepted this revision.Jan 23 2023, 8:57 AM

This revision is now accepted and ready to land.Jan 23 2023, 8:57 AM

This revision was landed with ongoing or failed builds.Jan 25 2023, 5:25 AM

Closed by commit rGf90103851f9a: [LoopVectorize] Enable integer Mul and Add as select reduction patterns (authored by MattDevereau). · Explain Why

This revision was automatically updated to reflect the committed changes.

MattDevereau added a commit: rGf90103851f9a: [LoopVectorize] Enable integer Mul and Add as select reduction patterns.

uabelho added a subscriber: uabelho.Jan 25 2023, 10:36 PM

Hi,

A heads up that I see a miscompile that I bisected back to this patch. I don't have a reproducer I can share yet but I'm working on it.

In D141842#4081974, @uabelho wrote:

Hi,

A heads up that I see a miscompile that I bisected back to this patch. I don't have a reproducer I can share yet but I'm working on it.

I think the miscompile is exposed by this example:

opt -passes="loop-vectorize" bbi-78206.ll -S -o - -force-vector-width=4

(I'm just using -force-vector-width=4 since VF 4 is what I got for my out of tree target when I saw the miscompile. It's probably not required.)

The input function does a backward search through @table and when it finds an element larger than the input parameter @val, it remembers the index -1 of that element.
Finally it returns the last found index -1.

Ok. With this patch, the vectorizer triggers and it seems like it does not only return _the_ lowest index -1, but in this case it returns the _sum_ of the different "index - 1" for the large enough elements.

So e.g. if @val is 4660 the input function will return 11-1=10.
But with this patch and after vectorization it returns 12-1 + 11-1=21.

bbi-78206.ll910 BDownload

Hi @uabelho, Thanks for the report and reproducer. It seems this snippet of code is incorrectly deducing that this is a reduction. I shall revert this patch and make ammends.

MattDevereau added a reverting change: rG4468e27d9fff: Revert "[LoopVectorize] Enable integer Mul and Add as select reduction patterns".Jan 26 2023, 4:03 AM

MattDevereau reopened this revision.Jan 26 2023, 4:04 AM

This revision is now accepted and ready to land.Jan 26 2023, 4:04 AM

In D141842#4082310, @MattDevereau wrote:

Hi @uabelho, Thanks for the report and reproducer. It seems this snippet of code is incorrectly deducing that this is a reduction. I shall revert this patch and make ammends.

Thanks!

This patch landed a few days ago but was reverted due to a miscompile.

I've added the tests non_reduction_index and non_reduction_index_half which are reproducers for the miscompile. One of the operands of the binary op in a select of a binary op against a phi node must be the false select operand in order to be reduced. This error caused a miscompile of what should be an index decrement to be a vectorized reduction.

MattDevereau added a reviewer: uabelho.Jan 27 2023, 3:33 AM

I've verified that I don't see the miscompile anymore with the updated patch. I have not done any other wider testing.

Harbormaster completed remote builds in B210309: Diff 492687.Jan 27 2023, 4:40 AM

8ff47f6032cbfd49f8fe22d46a48eb602b224661

Revision Contents

Path

Size

llvm/

lib/

Analysis/

IVDescriptors.cpp

23 lines

test/

Transforms/

LoopVectorize/

if-reduction.ll

137 lines

Diff 492687

llvm/lib/Analysis/IVDescriptors.cpp

Show First 20 Lines • Show All 740 Lines • ▼ Show 20 Lines	RecurrenceDescriptor::isConditionalRdxPattern(RecurKind Kind, Instruction *I) {

Instruction *I1 =		Instruction *I1 =
isa<PHINode>(*TrueVal) ? dyn_cast<Instruction>(FalseVal)		isa<PHINode>(*TrueVal) ? dyn_cast<Instruction>(FalseVal)
: dyn_cast<Instruction>(TrueVal);		: dyn_cast<Instruction>(TrueVal);
if (!I1 \|\| !I1->isBinaryOp())		if (!I1 \|\| !I1->isBinaryOp())
return InstDesc(false, I);		return InstDesc(false, I);

Value Op1, Op2;		Value Op1, Op2;
if ((m_FAdd(m_Value(Op1), m_Value(Op2)).match(I1) \|\|		if (!(((m_FAdd(m_Value(Op1), m_Value(Op2)).match(I1) \|\|
m_FSub(m_Value(Op1), m_Value(Op2)).match(I1)) &&		m_FSub(m_Value(Op1), m_Value(Op2)).match(I1)) &&
I1->isFast())		I1->isFast()) \|\|
return InstDesc(Kind == RecurKind::FAdd, SI);		(m_FMul(m_Value(Op1), m_Value(Op2)).match(I1) && (I1->isFast())) \|\|
		((m_Add(m_Value(Op1), m_Value(Op2)).match(I1) \|\|
if (m_FMul(m_Value(Op1), m_Value(Op2)).match(I1) && (I1->isFast()))		m_Sub(m_Value(Op1), m_Value(Op2)).match(I1))) \|\|
return InstDesc(Kind == RecurKind::FMul, SI);		(m_Mul(m_Value(Op1), m_Value(Op2)).match(I1))))
		return InstDesc(false, I);

		Instruction IPhi = isa<PHINode>(Op1) ? dyn_cast<Instruction>(Op1)
		: dyn_cast<Instruction>(Op2);
		if (!IPhi \|\| IPhi != FalseVal)
return InstDesc(false, I);		return InstDesc(false, I);

		return InstDesc(true, SI);
}		}

RecurrenceDescriptor::InstDesc		RecurrenceDescriptor::InstDesc
RecurrenceDescriptor::isRecurrenceInstr(Loop L, PHINode OrigPhi,		RecurrenceDescriptor::isRecurrenceInstr(Loop L, PHINode OrigPhi,
Instruction *I, RecurKind Kind,		Instruction *I, RecurKind Kind,
InstDesc &Prev, FastMathFlags FuncFMF) {		InstDesc &Prev, FastMathFlags FuncFMF) {
assert(Prev.getRecKind() == RecurKind::None \|\| Prev.getRecKind() == Kind);		assert(Prev.getRecKind() == RecurKind::None \|\| Prev.getRecKind() == Kind);
switch (I->getOpcode()) {		switch (I->getOpcode()) {
Show All 16 Lines	RecurrenceDescriptor::isRecurrenceInstr(Loop L, PHINode OrigPhi,
case Instruction::FMul:		case Instruction::FMul:
return InstDesc(Kind == RecurKind::FMul, I,		return InstDesc(Kind == RecurKind::FMul, I,
I->hasAllowReassoc() ? nullptr : I);		I->hasAllowReassoc() ? nullptr : I);
case Instruction::FSub:		case Instruction::FSub:
case Instruction::FAdd:		case Instruction::FAdd:
return InstDesc(Kind == RecurKind::FAdd, I,		return InstDesc(Kind == RecurKind::FAdd, I,
I->hasAllowReassoc() ? nullptr : I);		I->hasAllowReassoc() ? nullptr : I);
case Instruction::Select:		case Instruction::Select:
if (Kind == RecurKind::FAdd \|\| Kind == RecurKind::FMul)		if (Kind == RecurKind::FAdd \|\| Kind == RecurKind::FMul \|\|
		Kind == RecurKind::Add \|\| Kind == RecurKind::Mul)
return isConditionalRdxPattern(Kind, I);		return isConditionalRdxPattern(Kind, I);
[[fallthrough]];		[[fallthrough]];
case Instruction::FCmp:		case Instruction::FCmp:
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::Call:		case Instruction::Call:
if (isSelectCmpRecurrenceKind(Kind))		if (isSelectCmpRecurrenceKind(Kind))
return isSelectCmpPattern(L, OrigPhi, I, Prev);		return isSelectCmpPattern(L, OrigPhi, I, Prev);
if (isIntMinMaxRecurrenceKind(Kind) \|\|		if (isIntMinMaxRecurrenceKind(Kind) \|\|
▲ Show 20 Lines • Show All 799 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/if-reduction.ll

Show First 20 Lines • Show All 815 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %for.body.preheader
%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count		%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
br i1 %exitcond, label %for.end, label %for.body		br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry		for.end: ; preds = %for.body, %entry
%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]		%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
ret float %sum.0.lcssa		ret float %sum.0.lcssa
}		}

		; CHECK-LABEL: @fcmp_0_add_select2(
		; CHECK: %[[V1:.]] = fcmp ogt <4 x float> %[[V0:.]], zeroinitializer
		; CHECK: %[[V3:.]] = add <4 x i64> %[[V2:.]], <i64 2, i64 2, i64 2, i64 2>
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Unfortunately integer flags aren't being propagated here. After having a quick look around the issue appears non-trivial as fast-math flags are propagated for the floating point case with a disclaimer. In `RecurrenceDescriptor::AddReductionVar` just after where the changes to `RecurrenceDescriptor::isConditionalRdxPattern` were made: // FIXME: FMF is allowed on phi, but propagation is not handled correctly. if (isa<FPMathOperator>(ReduxDesc.getPatternInst()) && !IsAPhi) { FastMathFlags CurFMF = ReduxDesc.getPatternInst()->getFastMathFlags(); if (auto Sel = dyn_cast<SelectInst>(ReduxDesc.getPatternInst())) { // Accept FMF on either fcmp or select of a min/max idiom. // TODO: This is a hack to work-around the fact that FMF may not be // assigned/propagated correctly. If that problem is fixed or we // standardize on fmin/fmax via intrinsics, this can be removed. After a look around for methods of propagating the IR flags I'm not quite sure how to proceed. MattDevereau:* Unfortunately integer flags aren't being propagated here. After having a quick look around the…
		sdesmalenUnsubmitted Done Reply Inline Actions I wouldn't be too worried about this, it seems the nsw/nuw flags aren't propagated for other reductions either. sdesmalen: I wouldn't be too worried about this, it seems the nsw/nuw flags aren't propagated for other…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Very well, in the C/C++ example I've added the flags aren't propagated either when 1 is used as an immediate and we don't go through the `select` route. MattDevereau: Very well, in the C/C++ example I've added the flags aren't propagated either when 1 is used as…
		; CHECK: select <4 x i1> %[[V1]], <4 x i64> %[[V3]], <4 x i64> %[[V2]]
		define i64 @fcmp_0_add_select2(ptr noalias %x, i64 %N) nounwind readonly {
		entry:
		%cmp.1 = icmp sgt i64 %N, 0
		br i1 %cmp.1, label %for.header, label %for.end

		for.header: ; preds = %entry
		br label %for.body

		for.body: ; preds = %header, %for.body
		%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
		%sum.1 = phi i64 [ 0, %for.header ], [ %sum.2, %for.body ]
		%arrayidx = getelementptr inbounds float, ptr %x, i64 %indvars.iv
		%0 = load float, ptr %arrayidx, align 4
		%cmp.2 = fcmp ogt float %0, 0.000000e+00
		%add = add nsw i64 %sum.1, 2
		sdesmalenUnsubmitted Done Reply Inline Actions nit: I guess it also works if you remove this `fast` comparison right? sdesmalen: nit: I guess it also works if you remove this `fast` comparison right?
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions You're correct, yes. I'll go ahead and remove it from the tests I've added. MattDevereau: You're correct, yes. I'll go ahead and remove it from the tests I've added.
		%sum.2 = select i1 %cmp.2, i64 %add, i64 %sum.1
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%exitcond = icmp eq i64 %indvars.iv.next, %N
		br i1 %exitcond, label %for.end, label %for.body

		for.end: ; preds = %for.body, %entry
		%1 = phi i64 [ 0, %entry ], [ %sum.2, %for.body ]
		ret i64 %1
		}

		; CHECK-LABEL: @fcmp_0_sub_select1(
		; CHECK: %[[V1:.]] = fcmp ogt <4 x float> %[[V0:.]], zeroinitializer
		; CHECK: %[[V3:.]] = sub <4 x i32> %[[V2:.]], <i32 2, i32 2, i32 2, i32 2>
		; CHECK: select <4 x i1> %[[V1]], <4 x i32> %[[V3]], <4 x i32> %[[V2]]
		define i32 @fcmp_0_sub_select1(ptr noalias %x, i32 %N) nounwind readonly {
		entry:
		sdesmalenUnsubmitted Not Done Reply Inline Actions What is the difference between this function and `@fcmp_0_add_select1` ? sdesmalen: What is the difference between this function and `@fcmp_0_add_select1` ?
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions The data type being reduced is i64 in this function whereas its i32 in `@fcmp_0_add_select1`. These tests are integer clones of `@fcmp_0_fadd_select1` and `@fcmp_0_fadd_select2` above in this file, with the exception that these integer variants add an immediate instead of loaded in data. MattDevereau: The data type being reduced is i64 in this function whereas its i32 in `@fcmp_0_add_select1`.
		%cmp.1 = icmp sgt i32 %N, 0
		br i1 %cmp.1, label %for.header, label %for.end

		for.header: ; preds = %entry
		%zext = zext i32 %N to i64
		br label %for.body

		for.body: ; preds = %header, %for.body
		%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
		%sum.1 = phi i32 [ 0, %for.header ], [ %sum.2, %for.body ]
		%arrayidx = getelementptr inbounds float, ptr %x, i64 %indvars.iv
		%0 = load float, ptr %arrayidx, align 4
		%cmp.2 = fcmp ogt float %0, 0.000000e+00
		%sub = sub nsw i32 %sum.1, 2
		%sum.2 = select i1 %cmp.2, i32 %sub, i32 %sum.1
		%indvars.iv.next = sub nuw nsw i64 %indvars.iv, 1
		%exitcond = icmp eq i64 %indvars.iv.next, %zext
		br i1 %exitcond, label %for.end, label %for.body

		for.end: ; preds = %for.body, %entry
		%1 = phi i32 [ 0, %entry ], [ %sum.2, %for.body ]
		ret i32 %1
		}

		; CHECK-LABEL: @fcmp_0_mult_select1(
		; CHECK: %[[V1:.]] = fcmp ogt <4 x float> %[[V0:.]], zeroinitializer
		; CHECK: %[[V3:.]] = mul <4 x i32> %[[V2:.]], <i32 2, i32 2, i32 2, i32 2>
		; CHECK: select <4 x i1> %[[V1]], <4 x i32> %[[V3]], <4 x i32> %[[V2]]
		define i32 @fcmp_0_mult_select1(ptr noalias %x, i32 %N) nounwind readonly {
		entry:
		%cmp.1 = icmp sgt i32 %N, 0
		br i1 %cmp.1, label %for.header, label %for.end

		for.header: ; preds = %entry
		%zext = zext i32 %N to i64
		br label %for.body

		for.body: ; preds = %for.body, %for.header
		%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
		%sum.1 = phi i32 [ 0, %for.header ], [ %sum.2, %for.body ]
		%arrayidx = getelementptr inbounds float, ptr %x, i64 %indvars.iv
		%0 = load float, ptr %arrayidx, align 4
		%cmp.2 = fcmp ogt float %0, 0.000000e+00
		%mult = mul nsw i32 %sum.1, 2
		%sum.2 = select i1 %cmp.2, i32 %mult, i32 %sum.1
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%exitcond = icmp eq i64 %indvars.iv.next, %zext
		br i1 %exitcond, label %for.end, label %for.body

		for.end: ; preds = %for.body, %entry
		%1 = phi i32 [ 0, %entry ], [ %sum.2, %for.body ]
		ret i32 %1
		}

		@table = constant [13 x i16] [i16 10, i16 35, i16 69, i16 147, i16 280, i16 472, i16 682, i16 1013, i16 1559, i16 2544, i16 4553, i16 6494, i16 10000], align 1

		; CHECK-LABEL: @non_reduction_index(
		; CHECK-NOT: <4 x i16>
		define i16 @non_reduction_index(i16 noundef %val) {
		entry:
		br label %for.body

		for.cond.cleanup: ; preds = %for.body
		%spec.select.lcssa = phi i16 [ %spec.select, %for.body ]
		ret i16 %spec.select.lcssa

		for.body: ; preds = %entry, %for.body
		%i.05 = phi i16 [ 12, %entry ], [ %sub, %for.body ]
		%k.04 = phi i16 [ 0, %entry ], [ %spec.select, %for.body ]
		%arrayidx = getelementptr inbounds [13 x i16], ptr @table, i16 0, i16 %i.05
		%0 = load i16, ptr %arrayidx, align 1
		%cmp1 = icmp ugt i16 %0, %val
		%sub = add nsw i16 %i.05, -1
		%spec.select = select i1 %cmp1, i16 %sub, i16 %k.04
		%cmp.not = icmp eq i16 %sub, 0
		br i1 %cmp.not, label %for.cond.cleanup, label %for.body
		}

		@tablef = constant [13 x half] [half 10.0, half 35.0, half 69.0, half 147.0, half 280.0, half 472.0, half 682.0, half 1013.0, half 1559.0, half 2544.0, half 4556.0, half 6496.0, half 10000.0], align 1

		; CHECK-LABEL: @non_reduction_index_half(
		; CHECK-NOT: <4 x half>
		define i16 @non_reduction_index_half(half noundef %val) {
		entry:
		br label %for.body

		for.cond.cleanup: ; preds = %for.body
		%spec.select.lcssa = phi i16 [ %spec.select, %for.body ]
		ret i16 %spec.select.lcssa

		for.body: ; preds = %entry, %for.body
		%i.05 = phi i16 [ 12, %entry ], [ %sub, %for.body ]
		%k.04 = phi i16 [ 0, %entry ], [ %spec.select, %for.body ]
		%arrayidx = getelementptr inbounds [13 x i16], ptr @table, i16 0, i16 %i.05
		%0 = load half, ptr %arrayidx, align 1
		%fcmp1 = fcmp ugt half %0, %val
		%sub = add nsw i16 %i.05, -1
		%spec.select = select i1 %fcmp1, i16 %sub, i16 %k.04
		%cmp.not = icmp eq i16 %sub, 0
		br i1 %cmp.not, label %for.cond.cleanup, label %for.body
		}

; Make sure any check-not directives are not triggered by function declarations.		; Make sure any check-not directives are not triggered by function declarations.
; CHECK: declare		; CHECK: declare

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Enable integer Mul and Add as select reduction patternsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 492687

llvm/lib/Analysis/IVDescriptors.cpp

llvm/test/Transforms/LoopVectorize/if-reduction.ll

[LoopVectorize] Enable integer Mul and Add as select reduction patterns
ClosedPublic