This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
2
IVDescriptors.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
scalable-reductions.ll

Differential D96350

[SVE][LoopVectorize] Enable vectorization of fmin/fmax with nnan
Needs ReviewPublic

Authored by kmclaughlin on Feb 9 2021, 9:09 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
spatel
dmgreen
efriedma

Summary

The fmin/fmax tests added by D95245 use the no-nans-fp-math function
attribute, and fail to vectorize when the attribute is removed in favour of using
nnan directly in the instructions. This patch changes isRecurrenceInstr
to also check if the no-NaNs flag is set on the fcmp/select.

I'm not sure if there are any problems with this approach, which is why I've
split this out from D95245.

Diff Detail

Event Timeline

kmclaughlin created this revision.Feb 9 2021, 9:09 AM

Herald added a reviewer: efriedma. · View Herald TranscriptFeb 9 2021, 9:09 AM

Herald added subscribers: psnobl, hiraditya, tschuett. · View Herald Transcript

kmclaughlin requested review of this revision.Feb 9 2021, 9:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 9 2021, 9:09 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B88478: Diff 322411.Feb 9 2021, 9:09 AM

kmclaughlin mentioned this in D95245: [SVE] Add support for scalable vectorization of loops with int/fast FP reductions.Feb 9 2021, 9:13 AM

david-arm added a subscriber: david-arm.Feb 9 2021, 9:26 AM

david-arm added inline comments.

llvm/lib/Analysis/IVDescriptors.cpp
598	Hi @kmclaughlin, I've not looked into this in the same detail as you have, but I wonder if the reason we only checked the function attribute previously is because the reduction is used outside the loop? Not saying that's a good reason for not vectorising though. :) Just that perhaps it was technically the easier solution since we might have to also look outside the loop to make sure users of the value also don't care about NaNs? Perhaps @dmgreen or @spatel have an idea?

Thanks for working on this. I was headed this direction after D95690.
Using FMF instead of function attributes should be ok here, but we need to be careful about at least 2 things before we make this change:

The existing predicate is inadequate for fmin/fmax reductions. We should be requiring nsz too (or the function-level "no-signed-zeros-fp-math"="true"). The code as-is can miscompile because it is missing that check.
IR-level FMF are currently not fully propagated as we would like. They don't appear on load/store or function arguments. Because of that, we should do a union of flags between the fcmp and select (as shown in D95690 too).

llvm/lib/Analysis/IVDescriptors.cpp
598	I think the use of FP function attributes is historical. The instruction-level FMF were introduced later, and unfortunately they still are not adequate for all use cases: https://llvm.org/PR38086 https://llvm.org/PR35607 https://llvm.org/PR35538

Hi @spatel, thanks for the explanation. I've created D96604 to try and address the missing check for no-signed-zeros at the function-level.

I added a comment about this patch to:
https://llvm.org/PR35538#c4
Let me know if you see any other blockers.
(I didn't find the bugzilla ID for @kmclaughlin to cc on the bug report.)

Matt added a subscriber: Matt.May 28 2021, 1:21 PM

Revision Contents

Path

Size

llvm/

lib/

Analysis/

IVDescriptors.cpp

13 lines

test/

Transforms/

LoopVectorize/

AArch64/

scalable-reductions.ll

34 lines

Diff 322411

llvm/lib/Analysis/IVDescriptors.cpp

Show First 20 Lines • Show All 587 Lines • ▼ Show 20 Lines	RecurrenceDescriptor::isRecurrenceInstr(Instruction *I, RecurKind Kind,
case Instruction::FSub:		case Instruction::FSub:
case Instruction::FAdd:		case Instruction::FAdd:
return InstDesc(Kind == RecurKind::FAdd, I, UAI);		return InstDesc(Kind == RecurKind::FAdd, I, UAI);
case Instruction::Select:		case Instruction::Select:
if (Kind == RecurKind::FAdd \|\| Kind == RecurKind::FMul)		if (Kind == RecurKind::FAdd \|\| Kind == RecurKind::FMul)
return isConditionalRdxPattern(Kind, I);		return isConditionalRdxPattern(Kind, I);
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;
case Instruction::FCmp:		case Instruction::FCmp:
case Instruction::ICmp:		case Instruction::ICmp: {
if (!isIntMinMaxRecurrenceKind(Kind) &&		if (isFPMinMaxRecurrenceKind(Kind) &&
(!HasFunNoNaNAttr \|\| !isFPMinMaxRecurrenceKind(Kind)))		(HasFunNoNaNAttr \|\| I->hasNoNaNs()))
		david-armUnsubmitted Not Done Reply Inline Actions Hi @kmclaughlin, I've not looked into this in the same detail as you have, but I wonder if the reason we only checked the function attribute previously is because the reduction is used outside the loop? Not saying that's a good reason for not vectorising though. :) Just that perhaps it was technically the easier solution since we might have to also look outside the loop to make sure users of the value also don't care about NaNs? Perhaps @dmgreen or @spatel have an idea? david-arm: Hi @kmclaughlin, I've not looked into this in the same detail as you have, but I wonder if the…
		spatelUnsubmitted Not Done Reply Inline Actions I think the use of FP function attributes is historical. The instruction-level FMF were introduced later, and unfortunately they still are not adequate for all use cases: https://llvm.org/PR38086 https://llvm.org/PR35607 https://llvm.org/PR35538 spatel: I think the use of FP function attributes is historical. The instruction-level FMF were…
return InstDesc(false, I);		return isMinMaxSelectCmpPattern(I, Prev);
		if (isIntMinMaxRecurrenceKind(Kind))
return isMinMaxSelectCmpPattern(I, Prev);		return isMinMaxSelectCmpPattern(I, Prev);
		return InstDesc(false, I);
		}
}		}
}		}

bool RecurrenceDescriptor::hasMultipleUsesOf(		bool RecurrenceDescriptor::hasMultipleUsesOf(
Instruction I, SmallPtrSetImpl<Instruction > &Insts,		Instruction I, SmallPtrSetImpl<Instruction > &Insts,
unsigned MaxNumUses) {		unsigned MaxNumUses) {
unsigned NumUses = 0;		unsigned NumUses = 0;
for (User::op_iterator Use = I->op_begin(), E = I->op_end(); Use != E;		for (User::op_iterator Use = I->op_begin(), E = I->op_end(); Use != E;
▲ Show 20 Lines • Show All 621 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions.ll

Show First 20 Lines • Show All 269 Lines • ▼ Show 20 Lines
for.end:		for.end:
%sum.0.lcssa = phi bfloat [ 0.000000e+00, %entry ], [ %add, %for.body ]		%sum.0.lcssa = phi bfloat [ 0.000000e+00, %entry ], [ %add, %for.body ]
ret bfloat %sum.0.lcssa		ret bfloat %sum.0.lcssa
}		}

; FMIN (FAST)		; FMIN (FAST)

; CHECK-REMARK: vectorized loop (vectorization width: vscale x 8, interleaved count: 2)		; CHECK-REMARK: vectorized loop (vectorization width: vscale x 8, interleaved count: 2)
define float @fmin_fast(float* noalias nocapture readonly %a, i64 %n) #0 {		define float @fmin_fast(float* noalias nocapture readonly %a, i64 %n) {
; CHECK-LABEL: @fmin_fast		; CHECK-LABEL: @fmin_fast
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK: %[[LOAD1:.*]] = load <vscale x 8 x float>		; CHECK: %[[LOAD1:.*]] = load <vscale x 8 x float>
; CHECK: %[[LOAD2:.*]] = load <vscale x 8 x float>		; CHECK: %[[LOAD2:.*]] = load <vscale x 8 x float>
; CHECK: %[[FCMP1:.*]] = fcmp olt <vscale x 8 x float> %[[LOAD1]]		; CHECK: %[[FCMP1:.*]] = fcmp nnan olt <vscale x 8 x float> %[[LOAD1]]
; CHECK: %[[FCMP2:.*]] = fcmp olt <vscale x 8 x float> %[[LOAD2]]		; CHECK: %[[FCMP2:.*]] = fcmp nnan olt <vscale x 8 x float> %[[LOAD2]]
; CHECK: %[[SEL1:.*]] = select <vscale x 8 x i1> %[[FCMP1]], <vscale x 8 x float> %[[LOAD1]]		; CHECK: %[[SEL1:.*]] = select <vscale x 8 x i1> %[[FCMP1]], <vscale x 8 x float> %[[LOAD1]]
; CHECK: %[[SEL2:.*]] = select <vscale x 8 x i1> %[[FCMP2]], <vscale x 8 x float> %[[LOAD2]]		; CHECK: %[[SEL2:.*]] = select <vscale x 8 x i1> %[[FCMP2]], <vscale x 8 x float> %[[LOAD2]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK: %[[FCMP:.*]] = fcmp olt <vscale x 8 x float> %[[SEL1]], %[[SEL2]]		; CHECK: %[[FCMP:.*]] = fcmp nnan olt <vscale x 8 x float> %[[SEL1]], %[[SEL2]]
; CHECK-NEXT: %[[SEL:.*]] = select <vscale x 8 x i1> %[[FCMP]], <vscale x 8 x float> %[[SEL1]], <vscale x 8 x float> %[[SEL2]]		; CHECK-NEXT: %[[SEL:.*]] = select nnan <vscale x 8 x i1> %[[FCMP]], <vscale x 8 x float> %[[SEL1]], <vscale x 8 x float> %[[SEL2]]
; CHECK-NEXT: call float @llvm.vector.reduce.fmin.nxv8f32(<vscale x 8 x float> %[[SEL]])		; CHECK-NEXT: call nnan float @llvm.vector.reduce.fmin.nxv8f32(<vscale x 8 x float> %[[SEL]])
entry:		entry:
%cmp6 = icmp sgt i64 %n, 0		%cmp6 = icmp sgt i64 %n, 0
br i1 %cmp6, label %for.body, label %for.end		br i1 %cmp6, label %for.body, label %for.end

for.body:		for.body:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
%sum.07 = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]		%sum.07 = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]
%arrayidx = getelementptr inbounds float, float* %a, i64 %iv		%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
%0 = load float, float* %arrayidx, align 4		%0 = load float, float* %arrayidx, align 4
%cmp.i = fcmp olt float %0, %sum.07		%cmp.i = fcmp nnan olt float %0, %sum.07
%.sroa.speculated = select i1 %cmp.i, float %0, float %sum.07		%.sroa.speculated = select nnan i1 %cmp.i, float %0, float %sum.07
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, %n		%exitcond.not = icmp eq i64 %iv.next, %n
br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0		br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

for.end:		for.end:
%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]		%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]
ret float %sum.0.lcssa		ret float %sum.0.lcssa
}		}

; FMAX (FAST)		; FMAX (FAST)

; CHECK-REMARK: vectorized loop (vectorization width: vscale x 8, interleaved count: 2)		; CHECK-REMARK: vectorized loop (vectorization width: vscale x 8, interleaved count: 2)
define float @fmax_fast(float* noalias nocapture readonly %a, i64 %n) #0 {		define float @fmax_fast(float* noalias nocapture readonly %a, i64 %n) {
; CHECK-LABEL: @fmax_fast		; CHECK-LABEL: @fmax_fast
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK: %[[LOAD1:.*]] = load <vscale x 8 x float>		; CHECK: %[[LOAD1:.*]] = load <vscale x 8 x float>
; CHECK: %[[LOAD2:.*]] = load <vscale x 8 x float>		; CHECK: %[[LOAD2:.*]] = load <vscale x 8 x float>
; CHECK: %[[FCMP1:.*]] = fcmp fast ogt <vscale x 8 x float> %[[LOAD1]]		; CHECK: %[[FCMP1:.*]] = fcmp nnan ogt <vscale x 8 x float> %[[LOAD1]]
; CHECK: %[[FCMP2:.*]] = fcmp fast ogt <vscale x 8 x float> %[[LOAD2]]		; CHECK: %[[FCMP2:.*]] = fcmp nnan ogt <vscale x 8 x float> %[[LOAD2]]
; CHECK: %[[SEL1:.*]] = select <vscale x 8 x i1> %[[FCMP1]], <vscale x 8 x float> %[[LOAD1]]		; CHECK: %[[SEL1:.*]] = select <vscale x 8 x i1> %[[FCMP1]], <vscale x 8 x float> %[[LOAD1]]
; CHECK: %[[SEL2:.*]] = select <vscale x 8 x i1> %[[FCMP2]], <vscale x 8 x float> %[[LOAD2]]		; CHECK: %[[SEL2:.*]] = select <vscale x 8 x i1> %[[FCMP2]], <vscale x 8 x float> %[[LOAD2]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK: %[[FCMP:.*]] = fcmp fast ogt <vscale x 8 x float> %[[SEL1]], %[[SEL2]]		; CHECK: %[[FCMP:.*]] = fcmp nnan ogt <vscale x 8 x float> %[[SEL1]], %[[SEL2]]
; CHECK-NEXT: %[[SEL:.*]] = select fast <vscale x 8 x i1> %[[FCMP]], <vscale x 8 x float> %[[SEL1]], <vscale x 8 x float> %[[SEL2]]		; CHECK-NEXT: %[[SEL:.*]] = select nnan <vscale x 8 x i1> %[[FCMP]], <vscale x 8 x float> %[[SEL1]], <vscale x 8 x float> %[[SEL2]]
; CHECK-NEXT: call fast float @llvm.vector.reduce.fmax.nxv8f32(<vscale x 8 x float> %[[SEL]])		; CHECK-NEXT: call nnan float @llvm.vector.reduce.fmax.nxv8f32(<vscale x 8 x float> %[[SEL]])
entry:		entry:
%cmp6 = icmp sgt i64 %n, 0		%cmp6 = icmp sgt i64 %n, 0
br i1 %cmp6, label %for.body, label %for.end		br i1 %cmp6, label %for.body, label %for.end

for.body:		for.body:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
%sum.07 = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]		%sum.07 = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]
%arrayidx = getelementptr inbounds float, float* %a, i64 %iv		%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
%0 = load float, float* %arrayidx, align 4		%0 = load float, float* %arrayidx, align 4
%cmp.i = fcmp fast ogt float %0, %sum.07		%cmp.i = fcmp nnan ogt float %0, %sum.07
%.sroa.speculated = select i1 %cmp.i, float %0, float %sum.07		%.sroa.speculated = select nnan i1 %cmp.i, float %0, float %sum.07
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%exitcond.not = icmp eq i64 %iv.next, %n		%exitcond.not = icmp eq i64 %iv.next, %n
br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0		br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

for.end:		for.end:
%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]		%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %.sroa.speculated, %for.body ]
ret float %sum.0.lcssa		ret float %sum.0.lcssa
}		}
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	for.body:
%exitcond.not = icmp eq i64 %inc, %n		%exitcond.not = icmp eq i64 %inc, %n
br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0		br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

for.end:		for.end:
%sum.0.lcssa = phi i32 [ 2, %entry ], [ %mul, %for.body ]		%sum.0.lcssa = phi i32 [ 2, %entry ], [ %mul, %for.body ]
ret i32 %sum.0.lcssa		ret i32 %sum.0.lcssa
}		}

attributes #0 = { "no-nans-fp-math"="true" }

!0 = distinct !{!0, !1, !2, !3, !4}		!0 = distinct !{!0, !1, !2, !3, !4}
!1 = !{!"llvm.loop.vectorize.width", i32 8}		!1 = !{!"llvm.loop.vectorize.width", i32 8}
!2 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}		!2 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
!3 = !{!"llvm.loop.interleave.count", i32 2}		!3 = !{!"llvm.loop.interleave.count", i32 2}
!4 = !{!"llvm.loop.vectorize.enable", i1 true}		!4 = !{!"llvm.loop.vectorize.enable", i1 true}