This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] improve IR fast-math-flags propagation in reductions
ClosedPublic

Authored by spatel on Jan 29 2021, 11:16 AM.

Details

Summary

This is another step (see D95452) towards correcting fast-math-flags bugs in vector reductions.

There are multiple bugs visible in the test diffs, and this is still not working as it should. We still use function attributes (rather than FMF) to drive part of the logic, but we are not checking for the correct FP function attributes.

Note that FMF may not be propagated optimally on selects (example in https://llvm.org/PR35607 ). That's why I'm proposing to union the FMF of a fcmp+select pair and avoid regressions on existing vectorizer tests.

Diff Detail

Event Timeline

spatel created this revision.Jan 29 2021, 11:16 AM
spatel requested review of this revision.Jan 29 2021, 11:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 29 2021, 11:16 AM
spatel added inline comments.Jan 29 2021, 11:22 AM
llvm/test/Transforms/LoopVectorize/X86/reduction-fastmath.ll
302–303

This is a miscompile - we created fast ops from the more strict original code.

304

Yet we created a non-FMF reduction intrinsic...

345

The ninf is not actually necessary, but nsz should be required.

llvm/test/Transforms/LoopVectorize/float-minmax-instruction-flag.ll
107–108

This test demonstrates that we are not even relying on the instruction-level FMF, and it's another miscompile. The function only has the equivalent of nnan. I think we should require nsz too.

random idea from a bystander:

Is there some way to do symbolic execution of IR that results in some kind of "output state" of FMF at the end of a block of IR?

If such a thing existed, then running it on the IR before and after a pass, and comparing the state of FMF at the end of that chunk of IR, would validate if the pass was correctly propagating FMF.

I only ask because I've seen other such "lets validate by doing a restricted symbolic execution of some state" tools floating around.

Feel free to ignore me if I'm talking nonsense, it's late here :-)

spatel added a comment.Feb 1 2021, 5:01 AM

Is there some way to do symbolic execution of IR that results in some kind of "output state" of FMF at the end of a block of IR?
If such a thing existed, then running it on the IR before and after a pass, and comparing the state of FMF at the end of that chunk of IR, would validate if the pass was correctly propagating FMF.

The closest that I know of is Alive2:
https://alive2.llvm.org/ce/z/2oNg9N
In that example, I added a nnan clause to the fadd, and Alive shows us how it is invalid (we produce poison in the transformed case where the original did not).
Alive2 is currently limited in its understanding of FMF, and I'm not sure if it can ever provide full FP analysis.

If we just want a weak/cheap solution, we could add a step to IR verification that checks if any FMF exist in the output that were not present in the input. For example, in the loop vectorizer tests shown here, we are obviously adding flags that are not anywhere in the original code. The verifier check would have to account for FP ops that have special semantics (eg, the intrinsic maxnum doesn't exactly specify how the sign of zero affects the result), and it would also need to account for FP function attributes since we still partly rely on those.

It's beyond the scope of anything I'm trying to do in this patch, but it's an interesting idea - file a bugzilla and/or llvm-dev note to get more attention?

It looks like we don't expand non-fast vecreduce_fmax in ExpandReductions:

// FIXME: We only expand 'fast' reductions here because the underlying
//        code in createMinMaxOp() assumes that comparisons use 'fast'
//        semantics.

And otherwise expand VECREDUCE_FMAX to (I think) FMAXNUM. Not using ExpandReductions sounds fine to me, but do we also need to fix those assumptions during lowering, if we are fixing the assumptions in the vectorizer? I guess we are still requiring NoNan so FMAXNUM should be fine? I didn't see anything normally requiring nsz for that.

spatel added a comment.Feb 1 2021, 8:45 AM

It looks like we don't expand non-fast vecreduce_fmax in ExpandReductions:

// FIXME: We only expand 'fast' reductions here because the underlying
//        code in createMinMaxOp() assumes that comparisons use 'fast'
//        semantics.

And otherwise expand VECREDUCE_FMAX to (I think) FMAXNUM. Not using ExpandReductions sounds fine to me, but do we also need to fix those assumptions during lowering, if we are fixing the assumptions in the vectorizer? I guess we are still requiring NoNan so FMAXNUM should be fine? I didn't see anything normally requiring nsz for that.

There's still a bug in ExpandReductions, and we do need to fix that - it's preventing expected vectorization from SLP as noted here:
https://llvm.org/PR23116

But yes, since we are still requiring NoNan here, I think we are safe (this patch can't make things worse unless I've missed some loophole in lowering).

There was also this example:
https://llvm.org/PR43574
...but we either solved that one or made it invisible with the changes so far. I need to investigate the IR after each pass.

dmgreen accepted this revision.Feb 1 2021, 9:05 AM

There's still a bug in ExpandReductions, and we do need to fix that - it's preventing expected vectorization from SLP as noted here:
https://llvm.org/PR23116

Why does ISel not lower it more optimally? ExpandReductions never seemed like it should be needed to me.

But yes, since we are still requiring NoNan here, I think we are safe (this patch can't make things worse unless I've missed some loophole in lowering).

Yeah, this seems more correct. LGTM :)

There was also this example:
https://llvm.org/PR43574
...but we either solved that one or made it invisible with the changes so far. I need to investigate the IR after each pass.

I thought that we haven't been able to vectorizer min/max reductions in the loop vectorizer for a while as the minnum/maxnum intrinsics do not get matched - it's still expecting a icmp/select in the reduction recognition code.

This revision is now accepted and ready to land.Feb 1 2021, 9:05 AM

There's still a bug in ExpandReductions, and we do need to fix that - it's preventing expected vectorization from SLP as noted here:
https://llvm.org/PR23116

Why does ISel not lower it more optimally? ExpandReductions never seemed like it should be needed to me.

I think there was some problem dealing with the intrinsics in SDAG, but I don't remember the details. cc'ing @aemerson @RKSimon @craig.topper for a better answer.

But yes, since we are still requiring NoNan here, I think we are safe (this patch can't make things worse unless I've missed some loophole in lowering).

I thought that we haven't been able to vectorizer min/max reductions in the loop vectorizer for a while as the minnum/maxnum intrinsics do not get matched - it's still expecting a icmp/select in the reduction recognition code.

I think that's correct - I managed to get SLP to recognize the intrinsics, but LV still doesn't afaik. That's another step on this path. So the plan will be to get LV to deal with the FP minnum/maxnum, then handle the recent integer min/max intrinsics, then canonicalize to those intrinsics in instcombine.

dmgreen added a subscriber: nikic.Feb 1 2021, 9:32 AM

I think there was some problem dealing with the intrinsics in SDAG, but I don't remember the details. cc'ing @aemerson @RKSimon @craig.topper for a better answer.

I think that might have been true originally, but I believe @nikic improved things significantly with the lowering in ISel. I'm not sure if there were other issues, but it seems like it should be viable - and possibly more efficient to do the lowering in ISel. It's only expanding an instruction after all. ISel is pretty good at that kind of thing!

nikic added a comment.Feb 1 2021, 10:02 AM

I think there was some problem dealing with the intrinsics in SDAG, but I don't remember the details. cc'ing @aemerson @RKSimon @craig.topper for a better answer.

I think that might have been true originally, but I believe @nikic improved things significantly with the lowering in ISel. I'm not sure if there were other issues, but it seems like it should be viable - and possibly more efficient to do the lowering in ISel. It's only expanding an instruction after all. ISel is pretty good at that kind of thing!

Right. Historically we did not have full legalization support for reductions, but those have been implemented in the meantime, and the last holes were plugged in LLVM 12 (sequential reductions). I think that at this point, there isn't a good reason to use the IR expansion into shuffles anymore. Migrating X86 in particular would probably take some effort though, as I believe it has quite a bit of custom shuffle matching code.

MaskRay added a subscriber: MaskRay.Feb 1 2021, 6:25 PM
MaskRay added inline comments.
llvm/lib/Analysis/IVDescriptors.cpp
312

I notice that the assert may fail. We will work on a reproduce.

spatel added inline comments.Feb 1 2021, 7:06 PM
llvm/lib/Analysis/IVDescriptors.cpp
312

Ok, thanks. I have missed some way that this loop works then, but it should be fine to use dyn_cast here to guard against that possibility.

Here's what the test reduced to. To reproduce run opt -passes='loop-vectorize' test.ll with the assert in place.

target datalayout = "e-m:e-p:32:32-i64:64-n32:64-S128"
target triple = "wasm32-unknown--wasm"

define void @f9() local_unnamed_addr {
entry:
  br label %"for f9.s0.v0"

"for f9.s0.v0":                                   ; preds = %"for f9.s0.v0", %entry
  %f9.s0.v0 = phi i32 [ 0, %entry ], [ %tmp3, %"for f9.s0.v0" ]
  %t14 = icmp eq i32 %f9.s0.v0, 5
  %t15 = select reassoc nnan ninf nsz contract afn i1 %t14, float 0x36A0000000000000, float 0.000000e+00
  %tmp3 = add nuw nsw i32 %f9.s0.v0, 1
  br i1 false, label %"end for f9.s0.v0", label %"for f9.s0.v0"

"end for f9.s0.v0":                               ; preds = %"for f9.s0.v0"
  ret void
}
spatel added a comment.Feb 3 2021, 6:31 AM

Here's what the test reduced to. To reproduce run opt -passes='loop-vectorize' test.ll with the assert in place.

Thanks! I see what happened - we speculatively assume that we matched a reduction pattern based on the cmp only. So in this example, we are trying to match an integer smax recurrence before checking that the subsequent select is actually part of a valid integer min/max pattern. The logic is confusing, but I'm not sure how to improve it yet. Hopefully, the complexity all goes away when we canonicalize to min/max intrinsics. I added a slightly modified version of the test, so we don't fall into the same bug again:
916c4121c1

I think there was some problem dealing with the intrinsics in SDAG, but I don't remember the details. cc'ing @aemerson @RKSimon @craig.topper for a better answer.

I think that might have been true originally, but I believe @nikic improved things significantly with the lowering in ISel. I'm not sure if there were other issues, but it seems like it should be viable - and possibly more efficient to do the lowering in ISel. It's only expanding an instruction after all. ISel is pretty good at that kind of thing!

Right. Historically we did not have full legalization support for reductions, but those have been implemented in the meantime, and the last holes were plugged in LLVM 12 (sequential reductions). I think that at this point, there isn't a good reason to use the IR expansion into shuffles anymore. Migrating X86 in particular would probably take some effort though, as I believe it has quite a bit of custom shuffle matching code.

Yes, ExpandReductions wasn't intended to be a permanent solution, but as a transitionary step for LLVM to generate the intrinsics in the vectorizer, while the legalization/target specific shuffle matching support was fleshed out.