We do not have an instruction for this in PTX prior to SM 8.0, so we are
expanding it. However, there is no expansion defined for this op in LLVM, so
define a custom expansion for the NVPTX backend instead (the same thing does
not really work on LLVM level due to fminnum/fmaxnum semantics for
-0.0 / +0.0).
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I believe it is:
For FMINIMUM:
Tmp1 = -0.0, Tmp2 = +0.0 => Tmp1 ULT Tmp2 == True => Tmp3 = Tmp1 = -0.0; Tmp2 UO Tmp2 = False => Result = Tmp3 = -0.0 Tmp1 = +0.0, Tmp2 = -0.0 => Tmp1 ULT Tmp2 == False => Tmp3 = Tmp2 = -0.0; Tmp2 UO Tmp2 = False => Result = Tmp3 = -0.0
For FMAXIMUM:
Tmp1 = -0.0, Tmp2 = +0.0 => Tmp1 UGT Tmp2 == False => Tmp3 = Tmp2 = +0.0; Tmp2 UO Tmp2 = False => Result = Tmp3 = +0.0 Tmp1 = +0.0, Tmp2 = -0.0 => Tmp1 UGT Tmp2 == True => Tmp3 = Tmp1 = +0.0; Tmp2 UO Tmp2 = False => Result = Tmp3 = +0.0
Right, I checked the standard now. For comparisons they're considered equal, but for maximum / minimum there's a special exception that they're considered as -0.0 < 0.0. Why would anyone define this so inconsistently ... -.-
The problem for us here is performance. We would need more instructions to implement it correctly via comparisons (2 to compare + select, at least 2 for 0 handling, at least 2 for NaNs). I'm mostly concerned with the PTX backend, and for it the correct and efficient way to expand this would be to use minnum (builtin instruction), and then a NaN check. If that is true, we generate our own NaN constant (if I'm reading the standard correctly it just requires us to return a quiet NaN, which doesn't have to be the same NaN as any of the operands).
I'll try doing that instead.
Change lowering to minnum/maxnum + NaN check
Also update the failing arm test to pass.
I'm not an expert on ARM, but looking at ARMISelLowering.cpp, it does
specify exactly under which conditions minimum / maximum instructions
are available. Thus, they also likely had the same silent failure
that the PTX side had, and the test was likely wrong (since the checks
were auto-generated).
I don't think this is right either. LLVM defines minnum according to the old semantics, which don't specify an order between zeroes. We'd need a separate ISD opcode for minnum according to 2018 semantics.
Unfortunately, I don't have the bandwidth to chase this rabbit hole further, especially since our use case is insensitive to what happens for -0 and +0. I can add a TODO comment to fix this. Though I would still argue for submitting this, as it is correct modulo -0/+0, which is far preferable to the current state where we have a silent failure (and produce invalid code) for the backends that attempt to expand the op (like in NVPTX).
That sounds like an unrelated bug. If an operation is Expand, but we don't support expanding it, shouldn't that result in an isel failure?
Yes, there is an orthogonal bug where this is a silent, and not a real failure. However, even if that bug is fixed, we would still fail (just earlier), which is fixed by this change.
Interestingly though, for the NVPTX backend this will end up producing the correct code, since minnum is lowered to the min/max PTX instructions, which defines the 2018 semantics. (I do agree though that the intermediate code is not correct.)
Limit expansion only to NVPTX backend
There it ends up handling all the cases correctly, due to the expanded
semantics of the min/max PTX instructions for +/-0.0.
I think this is the best we can do short of adding a new op. We only do the expansion for the NVPTX backed, and don't support it otherwise. In the NVPTX backed, the intermediate code still ends up being semantically incorrect for +/-0.0, but since FMINNUM/FMAXNUM lower to PTX min/max, which do implement the 2018 semantics of those ops, the final PTX ends up being correct.
Once we have an op in LLVM that represents the 2018 semantics, we can lower it to that instead, to make the intermediate code semantically correct as well.
It's not OK to have wrong intermediate code. We do have the "new" semantic opcodes already in FMINIMUM/FMAXIMUM
llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp | ||
---|---|---|
2227–2228 ↗ | (On Diff #474535) | Should go through APFloat |
2227–2228 ↗ | (On Diff #474535) | Can also move this to generic code and check which of the variants are legal |
It's even less OK to fail altogether (which is what is happening without this patch). And we're not talking about the new semantics for FMINIMUM/FMAXIMUM, but for the new semantics of FMINNUM/FMAXNUM (the +/-0 handling changed).
We do not have an instruction for this in PTX prior to SM 8.0,
I assume that we're talking about min.nan.*/man.nan.* instruction variants that appeared in PTX7.0 on sm80+.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#floating-point-instructions-max
docs.nvidia.com/cuda/parallel-thread-execution/index.html#half-precision-floating-point-instructions-max
Looks like we do not properly constrain instruction availability in llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td#L940
There are no predicates on sm80+ or ptx70 and we sort of rely on custom lowering, and not always correctly as things stand:
https://godbolt.org/z/G8vYb5ajT
This patch should help generating correct instructions for fp64.
Still, I think we need to fix instruction definitions to correctly reflect their availability.
On a side note, we may want to add some min/max correctness tests to CUDA tests in llvm test-suite. Considering that we have different lowering on different GPUs, we do want to make sure that we actually do consistently get the results we expect across different GPUs and CUDA versions. We currently do not have any sm80 GPUs on cuda buildbots, but we'll get them eventually.
llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp | ||
---|---|---|
615 ↗ | (On Diff #474565) | I think this should be refactored into a more generic GetGPUAction(sm, ptx, ifAvailableAction, FallbackAction). This would make it clear which actions we take and why. GetMinMax action just says 'magic'. |
llvm/test/CodeGen/NVPTX/fminimum-fmaximum.ll | ||
67 | What exactly do we end up generating here? If both setp and minthe are inputs for selp then -DAG should be removed from selp |
Hard disagree
And we're not talking about the new semantics for FMINIMUM/FMAXIMUM, but for the new semantics of FMINNUM/FMAXNUM (the +/-0 handling changed).
The 2019 final spec does not have minnum or maxnum; they were removed and replaced with minimum and maximum which have specified signed zero behavior. Unless there was a draft revision I missed, there was never a defined minnum with specified -0 behavior. It would be helpful to define minnum/maxnum variants with specified -0 ordered less than +0
I don't have access to the 2019 spec, but as far as I know it specifies both minimum and minimumNumber, where minimumNumber is 2008 minnum with a) specified signed zero behavior and b) fixed sNaN behavior (i.e. the FMINNUM rather than FMINNUM_IEEE behavior). That's what I meant by the 2019 minnum semantics.
OK, yes I was confused by the name change. "minimum" is basically the same with the specified signed zero behavior. Regardless, we should have another pair of min/max with the defined signed zero handling
Alternatively, since I believe we don't actually have any users that don't specify the correct signed zero handling, we could just redefine FMINNUM_IEEE/FMAXNUM_IEEE to have the new behavior.
What exactly do we end up generating here?
If both setp and minthe are inputs for selp then -DAG should be removed from selp