This is an archive of the discontinued LLVM Phabricator instance.

[X86] Lower _mm[256|512]_cmp[.]_mask intrinsics to native llvm IR
ClosedPublic

Authored by GBuella on Apr 13 2018, 4:12 AM.

Details

Summary

Lowering some vector comparision builtins to fcmp IR instructions.
This ignores the signaling behaviour specified in the predicate
argument of said builtins.

Affected AVX512 builtins:

builtin_ia32_cmpps128_mask
builtin_ia32_cmpps256_mask
builtin_ia32_cmpps512_mask
builtin_ia32_cmppd128_mask
builtin_ia32_cmppd256_mask
builtin_ia32_cmppd512_mask

Diff Detail

Repository
rC Clang

Event Timeline

GBuella created this revision.Apr 13 2018, 4:12 AM

The fcmp opcode has no defined behavior with NaN operands in the comparisions handled in this patch.

Could you describe the problem here in a bit more detail? As far as I know, the LLVM IR fcmp should return the same result as the x86 CMPPD, even without fast-math.

The fcmp opcode has no defined behavior with NaN operands in the comparisions handled in this patch.

Could you describe the problem here in a bit more detail? As far as I know, the LLVM IR fcmp should return the same result as the x86 CMPPD, even without fast-math.

So, I'm still looking into this.
What I see is, yes, fcmp just so happens to work the same as x86 CMPPD.
An example:

fcmp olt <2 x double> %x, %y

becomes vcmpltpd.

But this only holds for condition codes 0 - 7.

Where LLVM IR has a condition "olt" <- ordered less-than, x86 cmppd has two corresponding condition codes: 0x01->"less-than (ordered, signaling)", which is "vcmpltpd" and 0x11->"less-than (ordered, nonsignaling)" which is "vcmplt_oqps"

Now, if the builtin's CC argument is 1 (which refers to vcmpltps), we lower it to "fcmp olt", which then results in "vcmpltps", we are ok, yes.
But in the IR, there is no information about the user expecting "vcmpltps" vs "vcmplt_oqps".

Do I understand these tricks right?
If we are ok with this (hard to understand) approach, I can just lower these without fast-math as well, as long as CC < 8, by modifying this condition:

if (CC < 8 && !UsesNonDefaultRounding && getLangOpts().FastMath) {

Although, I'm still looked at what happens with sNaN, and with qNaN constants, once these comparisons are lowered to fcmp.

There is no difference between "signalling" and "non-signalling" unless you're using "#pragma STDC FENV_ACCESS", which is currently not supported. Presumably the work to implement that will include some LLVM IR intrinsic which can encode the difference, but for now we can ignore it.

There is no difference between "signalling" and "non-signalling" unless you're using "#pragma STDC FENV_ACCESS", which is currently not supported. Presumably the work to implement that will include some LLVM IR intrinsic which can encode the difference, but for now we can ignore it.

Does that mean, it is OK to generate the vcmpltpd instruction for both of these intrinsic calls:

_mm_cmp_ps_mask(a, b, _CMP_EQ_OQ);
_mm_cmp_ps_mask(a, b, _CMP_EQ_OS);

?
In that case we can lower both of these to fcmp olt.
I'm still not sure, if this is what a user would expect...

GBuella accepted this revision.May 14 2018, 3:44 AM
This comment was removed by GBuella.
This revision is now accepted and ready to land.May 14 2018, 3:44 AM
GBuella requested review of this revision.May 14 2018, 3:46 AM

Oops, accepted this by accident.

I don't see any reason to exactly match the constant specified by the user, as long as the result is semantically equivalent.

Once we have llvm.experimental.constrained.fcmp, this code should be updated to emit it; that will preserve the user's intended exception semantics.

GBuella updated this revision to Diff 150767.Jun 11 2018, 8:34 AM

I altered the code, to ignore the the signaling behaviour, as suggested.
Also, it handles more vector cmp builtins now.

efriedma added inline comments.Jun 11 2018, 2:26 PM
lib/CodeGen/CGBuiltin.cpp
10071

Given we're ignoring floating-point exceptions, we should also ignore the "rounding mode" operand (__MM_FROUND_NO_EXC only affects exceptions, and the other values are irrelevant because there isn't any actual rounding involved).

10156

*interesting.

GBuella added inline comments.Jun 12 2018, 11:18 AM
lib/CodeGen/CGBuiltin.cpp
10071

Oh, alltight. @craig.topper , Do you also agree on ignoring all of these, and just lowering all the comparisions to fcmp?
That is the easiest path of course.

craig.topper added inline comments.Jun 12 2018, 4:22 PM
lib/CodeGen/CGBuiltin.cpp
10071

I think I'm fine with ignoring it, but definitely leave a comment because we will probably have to revisit this code in the future as we continue towards supporting FENV_ACCESS.

10143

Move this into the "ReturnsMask" path, and use getVectorFCmpIR for the other path.

spatel added inline comments.Jun 13 2018, 6:44 AM
lib/CodeGen/CGBuiltin.cpp
10090–10100
llvm::Type *ResultType = ConvertType(E->getType());
if (CC == 0x0f || CC == 0x1f)
  return llvm::Constant::getAllOnesValue(ResultType);
if (CC == 0x0b || CC == 0x1b)
  return llvm::Constant::getNullValue(ResultType);

?

Also, can we use the defined predicate names instead of hex values in this code?

GBuella added inline comments.Jun 13 2018, 10:11 AM
lib/CodeGen/CGBuiltin.cpp
10090–10100

Well, I believe we would need to predefine them first.
I only found those in the client header avxintrin.h.

GBuella updated this revision to Diff 151203.Jun 13 2018, 11:07 AM
GBuella edited the summary of this revision. (Show Details)

Ignoring signaling behviour - and rounding mode with it.
Also lowering __builtin_ia32_cmpsd and __builtin_ia32_cmpss.

spatel added inline comments.Jun 13 2018, 11:15 AM
lib/CodeGen/CGBuiltin.cpp
10205–10210

On 2nd thought, why are we optimizing when we have matching IR predicates for these?
Just translate to FCMP_TRUE / FCMP_FALSE instead of special-casing these values.
InstSimplify can handle the constant folding if optimization is on.

GBuella added inline comments.Jun 14 2018, 12:43 AM
lib/CodeGen/CGBuiltin.cpp
10205–10210

I don't know, these TRUE/FALSE cases were already handled here, I only rearranged the code.
Does this cause any problems? I mean, if it meant an extra dozen lines of code I would get it, but as it is, does it hurt anything?

spatel added inline comments.Jun 14 2018, 6:50 AM
lib/CodeGen/CGBuiltin.cpp
10205–10210

It's mostly about being consistent. I think it's generally out-of-bounds for clang to optimize code. That's not its job.

The potential end user difference is that in unoptimized code, a user might expect to see the vcmpXX asm corresponding to the source-level intrinsic when debugging.

I agree that this is changing existing behavior, so it's better if we make this change before or after this patch.

craig.topper accepted this revision.Jun 17 2018, 6:11 PM

LGTM with that one comment.

lib/CodeGen/CGBuiltin.cpp
10168

Sink this into the switch as the default case with an llvm_unreachable

This revision is now accepted and ready to land.Jun 17 2018, 6:11 PM

The question still left is, should we remove, auto upgrade the LLVM intrinsics not used anymore, or keep them around for when the signal behaviour is going to matter?

GBuella updated this revision to Diff 151710.Jun 18 2018, 7:08 AM

We should probably just leave them around. Leave a NOTE so they don't get deleted.

GBuella updated this revision to Diff 151884.Jun 19 2018, 3:23 AM

Added __builtin_ia32_cmpsd_mask & __builtin_ia32_cmpss_mask.

craig.topper requested changes to this revision.Jun 19 2018, 10:01 AM
craig.topper added inline comments.
test/CodeGen/avx-builtins.c
241

This doesn't look right. This is a scalar instructino it shoudl only be comparing a single double. There should be insertelementss and extractelements.

test/CodeGen/avx512f-builtins.c
7449

I don't think this is right either.

This revision now requires changes to proceed.Jun 19 2018, 10:01 AM
GBuella updated this revision to Diff 152060.Jun 20 2018, 4:52 AM
GBuella edited the summary of this revision. (Show Details)

I was overzealous with the intrinsics, I lower really only the packed comparison now.

This revision is now accepted and ready to land.Jun 20 2018, 9:49 AM
This revision was automatically updated to reflect the committed changes.

Hello. It seems you were well aware that you are changing the semantics of FP operation here by ignoring the signaling/quiet portion of the immediate. But what shall the user do now? There was no way to force quiet FP comparison behavior in C language, so intrinsics and reliance on quiet compare (and SAE bit in AVX512) were natural way of forcing it. And now you are taking them out. Is there a switch that could prevent this optimization? I think it could be more tolerable if you only did this under fast-math.

It seems you were well aware that you are changing the semantics of FP operation here by ignoring the signaling/quiet portion of the immediate.

There's ongoing work to support code that accesses the floating point environment (see http://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics). There isn't any way to turn it on in clang yet, though. And until that's implemented, LLVM can trigger arbitrary floating-point exceptions anywhere in your code, so there's no point to trying to generate a "quiet" compare.

can trigger arbitrary floating-point exceptions anywhere in your code

I believe this statement reflects the current state of many compilers on the market, I guess I just don't see the reason why making things worse. It seems the original intent of the commit was to add support for masked compares, and that could have been achieved without breaking what already worked.

I hope the patch is ultimately helping some performance optimization, but it is breaking the original intent of some legitimate programs that worked before, and introduces correctness regression. So to me it must be at least guarded by a flip-switch.

The reference to constrained floating-point intrinsics work is relevant, but it will obviously take time and extra effort to enable and then to unbreak all the cases that are made broken here. Instead one could postpone lowering of the particular instructions until it was possible without violation of the semantics...

can trigger arbitrary floating-point exceptions anywhere in your code

I believe this statement reflects the current state of many compilers on the market, I guess I just don't see the reason why making things worse. It seems the original intent of the commit was to add support for masked compares, and that could have been achieved without breaking what already worked.

I hope the patch is ultimately helping some performance optimization, but it is breaking the original intent of some legitimate programs that worked before, and introduces correctness regression. So to me it must be at least guarded by a flip-switch.

The reference to constrained floating-point intrinsics work is relevant, but it will obviously take time and extra effort to enable and then to unbreak all the cases that are made broken here. Instead one could postpone lowering of the particular instructions until it was possible without violation of the semantics...

That seems like a legitimate concern to me.

I believe this patch was part of a larger effort to get rid of the dependence on intrinsics. We have a very broad preference for expressing things using native IR whenever possible because (among other reasons) intrinsics block most optimizations and we don't want to teach optimizations to reason about target-specific intrinsics. In this particular case we may have overreached because it isn't strictly possible to express all the semantics of this built-in accurately using native IR.

There is a patch currently active to add constrained fcmp intrinsics, but it doesn't have a masked variant.

Yes, in constrained-fp mode we might need intrinsics, at least short-term. I assume you'll probably add target-independent constrained masked fp vector operations at some point, but that's probably not a priority. But that still leaves two problems. One, clang doesn't currently have any flag that actually makes sense to control this. (I assume it will be added soon, but it doesn't exist yet.) I mean, it's technically possible to gate it under one of the fast-math flags, like @nastafev suggested, but that's not semantically correct. And two, the removed intrinsics didn't have the right semantics for constrained-fp mode anyway: they were marked readnone. So we need new intrinsics anyway.

So yes, it's possible we could revert this patch, and that might fix @nastafev's code for the next few months, but it doesn't help us at all in terms of making constrained fp work correctly in general.

Agreed. Reverting this patch wouldn't move us forward on constrained FP handling. What I'm saying (and what I think @nastafev is saying) is that this patch is taking a built-in that allows the user to express specific signalling behavior and throwing away that aspect of its semantics. Granted we do say that FP environment access is unsupported, so technically unmasking FP exceptions or even reading the status flag is undefined behavior. But it seems like there's a pretty big difference between using that claim to justify enabling some optimization that might do constant folding in a way that assumes the default rounding mode versus using that claim to disregard part of the semantics of a built-in that is modeling a target-specific instruction.

I'm not insisting that we have to revert this patch or anything like that. I'm just saying that we should think about it. My personal preference would actually be to have this code re-implemented using the new constrained fcmp intrinsic when it lands. That still leaves the masking part of this unsettled, but as you say that's probably not a priority right now.

BTW, here's a link to the constrained fcmp review: https://reviews.llvm.org/D54121

Thanks, I agree with @andrew.w.kaylor and his interpretation.
I was trying to convey the message that the programmer operating with intrinsics relies on the semantics they carry because there's no other way to express that semantics. Re-optimizing what's already optimized (hand-written code with intrinsics) may be nice, but not critical in his (my) view, whereas violating semantics defeats the purpose - I could have written that same loop around generic compare myself if that was enough for my purposes. I would not insist on the way you resolve this, this is not urgent, but I do believe this is a regression and it deserves a fix.