Page MenuHomePhabricator

[FPEnv] Add constrained FCMP intrinsic
AbandonedPublic

Authored by cameron.mcinally on Nov 5 2018, 1:40 PM.

Details

Summary

This isn't so much a patch as it is an RFC...

The compares are an odd duck wrt constrained intrinsics, since they are represented internally by the FCMPInst class. There are several ways that we could handle this quirk, one of which is shown in this Diff.

This patch attempts to use one intrinsic to represent the FCMP with a dedicated operand for the condition code. That way we don't need separate intrinsics for EQ/LT/LTE/etc. IMO, this solution is most in line with the other constrained intrinsics.

There are some weaknesses here though, like encoding and decoding the condition code. I'll comment inline on some of the other issues that need addressing too.

Diff Detail

Event Timeline

include/llvm/IR/Intrinsics.td
600

This isn't quite right. The result's vector width should be tied to the operands' vector width. Something like:

def int_experimental_constrained_fcmp : Intrinsic<[ llvm_anyint_ty ],
                                                    [ llvm_i8_ty,
                                                      LLVMVectorSameWidth<0, llvm_anyscalarfloat_ty>,
                                                      LLVMMatchType<1>,
                                                      llvm_metadata_ty,
                                                      llvm_metadata_ty ]>;

We would need to accommodate vectors of floats and doubles here, but TableGen isn't expressive enough to do that right now.

Should we extend TableGen to handle a situation like this? That's assuming the "one compare intrinsic" solution I'm proposing is what we decide on, of course.

lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
772

ExpandStrictFPOp(...) and friends would need to be updated to handle FCmp's operand types. The operand types do not match the result type, as the other constrained intrinsics do.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
5658

Oh, this is a temporary stop-gap. This code will need to be extended to create the STRICT_FCMP node. Just FYI for now...

This is definitely a tricky case. I don't really like any of the available solutions. I'll try to think more about it, and maybe someone else will have a brilliant suggestion.

include/llvm/IR/Intrinsics.td
595

This can't be 'any' int type. It needs to be either i1 or a vector of i1, right?

596

I don't like this. If you define this as an i8 the constant will get hard-coded in the IR. So instead of 'oeq' you'll see '1' or '2' or something. The reason I used metadata arguments for rounding mode and exception behavior was to allow readable strings.

All good points.

Here's another data point. The X86 builtins for compares explicitly list all the condition codes:

v4sf __builtin_ia32_cmpeqps (v4sf, v4sf)
v4sf __builtin_ia32_cmpltps (v4sf, v4sf)
v4sf __builtin_ia32_cmpleps (v4sf, v4sf)
v4sf __builtin_ia32_cmpgtps (v4sf, v4sf)
v4sf __builtin_ia32_cmpgeps (v4sf, v4sf)
v4sf __builtin_ia32_cmpunordps (v4sf, v4sf)
v4sf __builtin_ia32_cmpneqps (v4sf, v4sf)
v4sf __builtin_ia32_cmpnltps (v4sf, v4sf)
v4sf __builtin_ia32_cmpnleps (v4sf, v4sf)
v4sf __builtin_ia32_cmpngtps (v4sf, v4sf)
v4sf __builtin_ia32_cmpngeps (v4sf, v4sf)
v4sf __builtin_ia32_cmpordps (v4sf, v4sf)

So that's one way we could handle this. Although, these don't appear to cover both the quiet and signaling variants that IEEE-754 requires (I'm not sure how the X86 builtins handle that??).

In total, IEEE-754 specifies 22 compares that we need to cover:

boolean compareQuietEqual(source1, source2)
boolean compareQuietNotEqual(source1, source2) 
boolean compareSignalingEqual(source1, source2) 
boolean compareSignalingGreater(source1, source2) 
boolean compareSignalingGreaterEqual(source1, source2) 
boolean compareSignalingLess(source1, source2)
boolean compareSignalingLessEqual(source1, source2)
boolean compareSignalingNotEqual(source1, source2)
boolean compareSignalingNotGreater(source1, source2) 
boolean compareSignalingLessUnordered(source1, source2) 
boolean compareSignalingNotLess(source1, source2)
boolean compareSignalingGreaterUnordered(source1, source2) 
boolean compareQuietGreater(source1, source2)
boolean compareQuietGreaterEqual(source1, source2) 
boolean compareQuietLess(source1, source2)
boolean compareQuietLessEqual(source1, source2)
boolean compareQuietUnordered(source1, source2) 
boolean compareQuietNotGreater(source1, source2) 
boolean compareQuietLessUnordered(source1, source2) 
boolean compareQuietNotLess(source1, source2)
boolean compareQuietGreaterUnordered(source1, source2) 
boolean compareQuietOrdered(source1, source2).

I think my preference would be to have the predicate in the function name. I briefly toyed with the idea of hacking AsmWriter to print a constant integer as the corresponding predicate string, but I think that would look too weird. Also, I don't think we should open the door to something trying to use the value that represents the predicate as if it were a real value. In this sense the predicate argument, if we had one, should really be a token but I don't think we want to add new token constants just for this.

So this is what I'm imagining

def int_experimental_constrained_fcmp_eq_quiet : Intrinsic<[ llvm_anyi1_ty ],
                                                           [ llvm_anyfloat_ty,
                                                             LLVMScalarOrSameVectorWidth<0, llvm_anyfloat_ty>,
                                                             llvm_metadata_ty,
                                                             llvm_metadata_ty ]>;

And then 21 more of those. I've been talking to Craig about a way to make this a little more compact with multiclass, but that's the general idea.

What do you think?

That's cool with me. Definitely a ton of work, but it's clean and complete.

If I'm not mistaken, we could use inverse operations and swap the true/false operands for SOME of these. I'll have to dig through IEEE-754 to confirm which ones, but it could save some typing. Although, the exact relations are fairly dense, so front end writers won't be happy. E.g.:

compareSignalingGreaterEqual(a,b) is equivalent to compareSignalingLessUnordered(b, a)

My opinion is not very strong on this though. The brevity might not be worth the added complexity. At least, not as a first pass at these operations.

Also, I see you added llvm_anyi1_ty. So we would have to extend TableGen to handle that, I assume. And we'd also have to extend TableGen to handle the LLVMScalarOrSameVectorWidth<0, llvm_anyfloat_ty> problem. That should really be "llvm_anyscalarfloat_ty" or something like that.

And finally nit-picking a little here, but this should really be:

def int_experimental_constrained_fcmp_eq_quiet : Intrinsic<[ llvm_anyi1_ty ],
                                                           [ LLVMScalarOrSameVectorWidth<0, llvm_anyfloat_ty>,
                                                             LLVMMatchType<0>,
                                                             llvm_metadata_ty,
                                                             llvm_metadata_ty ]>;
compareSignalingGreaterEqual(a,b) is equivalent to compareSignalingLessUnordered(b, a)

Is it? If a or b is NaN the first one will return false but the second will return true. I was trying to think through the combinations and figure out why IEEE-754 specifies these 22 of the possible 32 combinations and why there are only 12 X86 builtins instead of 16. I think it comes down to eliminating combinations that aren't needed.

Craig pointed out to me that the 16 predicates specified for the fcmp instruction come from a bit matrix of the four possible predicate conditions, even though some of those (like true and false) don't really make any sense. I guess when you're building a processor you need to handle combinations like that in a well-defined manner.

compareSignalingGreaterEqual(a,b) is equivalent to compareSignalingLessUnordered(b, a)

Is it? If a or b is NaN the first one will return false but the second will return true.

Ah, yeah, I botched that. Those tables show the negations:

(a >= b) negated is !(a >= b) [signaling and quiet] or (a ?< b) [quiet]

Which makes sense. There are 4 relations in our set: { LT, EQ, GT, UN }. So:

{ GT, EQ } is !{ LT, UN }

So, yeah, all 22 operations are distinct. Unless we want to try to pattern match NOT(compareXXX(a,b)). That's probably not safe to do though, in case the compare operation and NOT get broken up during optimizations.

More insight...

There are 12 x86 builtins and only 22 IEEE functions (not 24 as expected) since the following two IEEE functions don't make sense:

compareSignalingOrdered(a, b)
compareSignalingUnordered(a, b)

Those would always raise invalid on NaNs (both SNaNs and QNaNs), so are really of no consequence.

That implies the respective x86 builtins, builtin_ia32_cmpunordxx and builtin_ia32_cmpordxx, map to compareQuietUnordered(...) and compareQuietOrdered(...).

The X86 builtin story is weird. There should be 9 builtins. I'm not sure how you found 12. 8 representing the encodings used by the SSE1/SSE2 cmpps/pd/ss/sd listed below. And 9th intrinsic that takes a 5 bit immediate to cover the 32 values that the AVX vcmpps/pd/ss/sd.

TARGET_BUILTIN(__builtin_ia32_cmpeqps, "V4fV4fV4f", "ncV:128:", "sse")
TARGET_BUILTIN(__builtin_ia32_cmpltps, "V4fV4fV4f", "ncV:128:", "sse")
TARGET_BUILTIN(__builtin_ia32_cmpleps, "V4fV4fV4f", "ncV:128:", "sse")
TARGET_BUILTIN(__builtin_ia32_cmpunordps, "V4fV4fV4f", "ncV:128:", "sse")
TARGET_BUILTIN(__builtin_ia32_cmpneqps, "V4fV4fV4f", "ncV:128:", "sse")
TARGET_BUILTIN(__builtin_ia32_cmpnltps, "V4fV4fV4f", "ncV:128:", "sse")
TARGET_BUILTIN(__builtin_ia32_cmpnleps, "V4fV4fV4f", "ncV:128:", "sse")
TARGET_BUILTIN(__builtin_ia32_cmpordps, "V4fV4fV4f", "ncV:128:", "sse")

All 9 builtins map to the same IR intrinsic that takes a 5 bit immediate. Or at least they use to. I think some map directly to fcmp these days. We have 8 separate SSE builtins because that's what gcc did way back in the SSE1 days. When AVX expanded to 32 comparison predicates, gcc decided to use one builtin with an immediate instead of adding 24 more. Clang uses the 8 legacy builtins to match gcc and to prevent users from trying to use an AVX encoding when targetting an SSE only CPU. Within the avxintrin.h file I believe we should have wrappers around the builtin for all 32 possible encodiings that just pass the correct immediate to the builtin.

The X86 builtin story is weird. There should be 9 builtins. I'm not sure how you found 12.

Oh, so those are from the online GCC documentation. In the early days of writing the AVX512 builtins, the GCC AVX512 intrinsics were defined first and were used as the reference to keep the two in sync. Old habit, I suppose.

So it looks like LLVM doesn't have any of the GT/GE intrinsics, but rather uses NLE/NGT. And the extra 9th intrinsic is the general cmp.

Here's a first whack at a list of operations needed internally:

compareSignalingEqual(a, b)  
compareSignalingNotEqual(a, b)
compareSignalingLess(a, b) <-> compareSignalingGreater(b, a)
compareSignalingLessEqual(a, b) <-> compareSignalingGreaterEqual(b, a)
compareSignalingLessUnordered(a, b) <-> compareSignalingGreaterUnordered(b, a)
compareSignalingNotLess(a, b) <-> compareSignalingNotGreater(b, a) 

compareQuietUnordered(a, b) 
compareQuietOrdered(a, b)
compareQuietEqual(a, b)
compareQuietNotEqual(a, b) 
compareQuietLess(a, b) <-> compareQuietGreater(b, a)
compareQuietLessEqual(a, b) <-> compareQuietGreaterEqual(b, a)
compareQuietLessUnordered(a, b) <-> compareQuietGreaterUnordered(b, a) 
compareQuietNotLess(a, b) <-> compareQuietNotGreater(b, a)

Anyone see problems with this?

Thinking aloud...

I'm playing with the idea of being able to implement the signaling variants of each operation through the quiet variants. I mean, a signaling compare is really just:

bool compareSignalingEqual(a, b) {
  return is_ordered(a,b) ? compareQuietEqual(a, b) : signal Invalid;
}

But it may be difficult to get the flags right. I haven't thought that all the way through yet.

This is probably not worth serious consideration at this point. Just wanted to throw it out there.

I'm working on the changes that we discussed, but they're pretty invasive. A prospective patch is coming soon, but I wanted us to start thinking about how we'll handle these intrinsics at the SelectionDAG level. There are no CMP ISD nodes (also, what does legalization look like??), so this will likely be a significant change.

What do you mean there are no CMP ISD nodes? CMPs are represented by SETCC nodes in selectiondag.

Oh, right. I missed that. So there should probably be a new STRICT_SETCC node. Thanks.

cameron.mcinally abandoned this revision.Nov 16 2018, 1:56 PM

Abandon this Revision for D54649...