When tail-folding the vectoriser sometimes generates a select instruction
that selects between data and a splat of -0.0. We already have a transform
in InstCombine that folds a select of a masked load and zeroinitializer
into a single masked load with a zeroinitializer passthru value. We can
do something similar for splats of -0.0 when the function has the
attribute no-signed-zeros-fp-math set to true. In this case we can
just replace a splat of -0.0 with zeroinitializer, i.e. a splat of 0.0.
Details
Diff Detail
Unit Tests
Event Timeline
I was hoping to do that originally, but I think that select instructions do not have any fast math flags attached?
OK. The loop vectoriser is kicking out selects that don't have any FMF and there is no other pass afterwards that adds them. So I guess an alternative would be to automatically add FMF flags to all select instructions even though it's actually a bitwise operation? If that's acceptable to everyone of course.
In general how safe it is to use the function's no-signed-zeros-fp-math attribute to infer it's safe to convert one bit pattern to another? Fair enough if we're dealing with floating point operations then we can ignore the signedness of zero but this is just doing a select of some loaded data, which could be immediately stored out. So it feels dangerous to just change the data in this way.
This is a good point, so I wonder if I can restrict this to single uses of the select where the user is a FP operator? We can even check the user's FMF to see how the data is going to be interpreted.
I guess it depends what other reviewers think about this patch and whether it's useful or not. Another way of achieving the same result is to ensure the vectoriser uses a splat of +0.0 for the select when the nsz flag is set.
Hi @nikic, looking again at this I realised that it's not the vectoriser generating the splat of -0.0. I think it's instcombine itself! A sample output from the vectoriser looks like this:
%wide.masked.load = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr %11, i32 4, <vscale x 4 x i1> %active.lane.mask, <vscale x 4 x float> poison), !tbaa !10 %12 = fadd fast <vscale x 4 x float> %wide.masked.load, %vec.phi %13 = select <vscale x 4 x i1> %active.lane.mask, <vscale x 4 x float> %12, <vscale x 4 x float> %vec.phi
and I think that instcombine moves the select between the fadd and masked.load. In doing so it creates a new select that selects between %wide.masked.load and a splat of -0.0. So perhaps the real solution here is to teach instcombine to create a sensible splat value?
If you can get the vectorizer to produce the better IR in the first place, that would be ideal. Function attributes are the legacy annotations for fast-math; the goal is to rely on FMF whenever possible in the optimizer and codegen.
We should probably canonicalize -0.0 to +0.0 in "select nsz":
https://alive2.llvm.org/ce/z/Z_-yGP
We already canonicalize -0.0 to 0.0 in fcmp (no FMF needed in that case).
Based on the comments on this patch I am abandoning it in favour of a different approach (D126774) where we use the nsz flag on a SelectInst to control the identity value used in certain folds.
I also created D126778 for the loop vectoriser to add the fast math flags to select instructions involved in tail-folding + reductions.