This is an archive of the discontinued LLVM Phabricator instance.

[X86] Support llvm.{min,max}imum.f{16,32,64}
ClosedPublic

Authored by e-kud on Mar 8 2023, 5:01 PM.

Diff Detail

Event Timeline

e-kud created this revision.Mar 8 2023, 5:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 5:01 PM
e-kud published this revision for review.Mar 8 2023, 5:09 PM
e-kud edited the summary of this revision. (Show Details)
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 5:10 PM
pengfei added inline comments.Mar 8 2023, 6:36 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
53866

The code doesn't combine anything, should be moved to LowerFMinimumFMaximum?

53907

Should we not do it for hasNoSignedZeros?

llvm/test/CodeGen/X86/fminimum-fmaximum.ll
182

The test does show anything interesting.

pengfei added inline comments.Mar 8 2023, 6:43 PM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
182

does -> doesn't

e-kud updated this revision to Diff 504008.Mar 9 2023, 7:07 PM

Rebased.
Supported cases with nsz and nnan.
Updated tests.

e-kud marked 3 inline comments as done.
e-kud added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
53866

I've tried to find a better place to handle ISD::FMAXIMUM but there is no place in chain Selecting->Combining->Legalization. So, I was inspired by ISD::FMAXNUM that is actually lowered during combining.

If you are about naming solely, it is not a big deal to change the name, but all callees in PerformDAGCombine are named as combine* even for ISD::FMAXNUM. Do we really want to have a single LowerFMinimumFMaximum?

llvm/test/CodeGen/X86/fminimum-fmaximum.ll
182

Probably, yes. I wanted to show that we are able to fold all these checks with constant arguments. But we've already tested folding of zero and nan checks. Original FMAX and FMIN are tested as well. Dropped it.

Do you have plan to support minimumNumber?

llvm/lib/Target/X86/X86ISelLowering.cpp
53866

No, not the name. Did you try set action of FMAXIMUM to Custom? But I'm fine given it's similar to combineFMinNumFMaxNum.

53877

Subtarget.hasFP16() && VT == MVT::f16

53898

Better to give a table like above

//                   Op1                        Op1      
//               Num     xNaN                +0     -0  
//            -----------------           --------------
//       Num  |  Max  |  qNaN |       +0  |  +0  |  +0 |
// Op0        -----------------  Op0      --------------
//      xNaN  |  qNaN |  qNaN |       -0  |  +0  |  -0 |
//            -----------------           --------------
53924–53927

Can we check if it is 0 or NaN? We can put both in the second operand I think.

pengfei added inline comments.Mar 10 2023, 12:13 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
53924–53927

We can use vfpclassps/d on AVX512DQ to optimize it.

pengfei added inline comments.Mar 12 2023, 3:08 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
53866

Answer my previous question: combineFMinNumFMaxNum was intended here due to the reason described in D15294. I suppose the problem doesn't exist to combineFMinimumFMaximum. So I prefer to LowerFMinimumFMaximum.

e-kud marked an inline comment as done.Mar 12 2023, 7:26 PM

Do you have plan to support minimumNumber?

Sorry, I didn't get it. Could you be more specific?

llvm/lib/Target/X86/X86ISelLowering.cpp
53866

Yes, thank you. I was unaware of this mechanism. It works but I've needed to include extra logic because with Custom lowering there is no more setcc combining.

53924–53927

It was my first attempt to compare with 0 and NaN at the same time. We have two problems. The first is that comparison with zero returns ZF regardless positive or negative zero is provided. The second is that we still need to know whether the second operand is NaN or not. We may have (0.0, NaN) as arguments. We checked that the first op is not NaN and is zero. Replaced it with the second when the second is NaN. It seems to me that we always need to check both operands on NaN and one check on zero.

We can use vfpclassps/d on AVX512DQ if one of operands is known never NaN. Working on it.

Do you have plan to support minimumNumber?

Sorry, I didn't get it. Could you be more specific?

It's accompanying function of minimum in IEEE-754 2019, it will be introduced in new C/C++ standard too. I thought you are working for that.

e-kud updated this revision to Diff 505312.Mar 14 2023, 4:16 PM

Rebased.
Moved from combine to lowering.
Supported f16 version.
Added optimization for avx512dq.
Added and updated tests.

e-kud marked 7 inline comments as done.Mar 14 2023, 4:21 PM

Do you have plan to support minimumNumber?

Sorry, I didn't get it. Could you be more specific?

It's accompanying function of minimum in IEEE-754 2019, it will be introduced in new C/C++ standard too. I thought you are working for that.

I can't find any lib calls or specific intrinsics for them. I'd like to add them separately if we have any users or needs of them, do we?

e-kud retitled this revision from [X86] Support llvm.{min,max}imum.f{32,64} to [X86] Support llvm.{min,max}imum.f{16,32,64}.Mar 14 2023, 4:26 PM
e-kud updated this revision to Diff 505346.Mar 14 2023, 6:33 PM

Broke formatting for premerge checks

pengfei added inline comments.Mar 14 2023, 8:20 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
1004–1005

Make the format align with its context.

2133–2134

ditto.

30266

Put Max/Min together is a bit confusing. My first impression is it can return either +0 or -0 for a single comparison.

30295

Need hasFP16() for f16.

34102

ditto format.
Why do we need getNode?

Do you have plan to support minimumNumber?

Sorry, I didn't get it. Could you be more specific?

It's accompanying function of minimum in IEEE-754 2019, it will be introduced in new C/C++ standard too. I thought you are working for that.

I can't find any lib calls or specific intrinsics for them. I'd like to add them separately if we have any users or needs of them, do we?

I heard glibc is supporting them. I'm fine with leaving it to the future. Do you have plan on the vector support?

llvm/test/CodeGen/X86/fminimum-fmaximum.ll
5

Guess you missed a AVX512F, i.e. AVX,AVX512,AVX512F

e-kud updated this revision to Diff 505684.Mar 15 2023, 7:56 PM
e-kud marked 5 inline comments as done.

Addressed formatting comments.
Check f16 explicitly even if avx512f16 implies avx512dq for now.

pengfei added inline comments.Mar 15 2023, 8:08 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
30295

(VT == MVT::f16 && Subtarget.hasFP16()) || Subtarget.hasDQI()

pengfei added inline comments.Mar 15 2023, 8:10 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
30295

Sorry, mistake. It should be:
(VT != MVT::f16 && Subtarget.hasDQI()) || Subtarget.hasFP16()

e-kud added a comment.Mar 15 2023, 8:12 PM

Do you have plan to support minimumNumber?

Sorry, I didn't get it. Could you be more specific?

It's accompanying function of minimum in IEEE-754 2019, it will be introduced in new C/C++ standard too. I thought you are working for that.

I can't find any lib calls or specific intrinsics for them. I'd like to add them separately if we have any users or needs of them, do we?

I heard glibc is supporting them. I'm fine with leaving it to the future. Do you have plan on the vector support?

Indeed, found this

  • <math.h> functions for floating-point maximum and minimum, corresponding to new operations in IEEE 754-2019, and corresponding <tgmath.h> macros, are added from draft ISO C2X: fmaximum, fmaximum_num, fmaximum_mag, fmaximum_mag_num, fminimum, fminimum_num, fminimum_mag, fminimum_mag_num and corresponding functions for float, long double, _FloatN and _FloatNx.

About vector support. I want to try to implement vector support alternatively, transform floats into integers using shifts and compare them as integers preserving float semantics, even -0 < +0. I've tried this approach for scalars but current approach produces less code. Probably vectors can benefit more.

llvm/lib/Target/X86/X86ISelLowering.cpp
30295

Yes, thank you, missed it as fp16 implies dq. We actually can check only VT == MVT::f16 because above the predicate Subtarget.hasFP16() && VT == MVT::f16 has been checked already. Do we want to avoid such implicit implication? Alternative is VT != MVT::f16 && Subtarget.hasDQI() || VT == MVT::f16 && Subtarget.hasFP16().

34102

Format doesn't work, we have 80+ chars if it's aligned with return. We don't need getNode, I missed it when moved from combine to lowering.

pengfei accepted this revision.Mar 15 2023, 8:13 PM

LGTM.

llvm/lib/Target/X86/X86ISelLowering.cpp
30295

Oh, we have checked hasFP16 in line 29876. So the way used here is correct. Sorry for the noise.

This revision is now accepted and ready to land.Mar 15 2023, 8:13 PM
e-kud updated this revision to Diff 510659.Apr 3 2023, 6:30 PM

Support i686 target: can't use integer representation of double -0.0

e-kud updated this revision to Diff 511826.Apr 7 2023, 4:34 PM

Rebased

e-kud requested review of this revision.Apr 7 2023, 5:28 PM

I think you can merge it. We don't need all reviewers sign off.

RKSimon added inline comments.Apr 13 2023, 8:01 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
30233

You should be able to drop this early-out - we should never get here

30240

Again, you can drop this as the setOperationAction calls should ensure we never get here - replace it with an assertion if you're worried.

30278

auto *

30280

auto *

llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
154

please can you add vector test coverage to ensure we scalarize?

llvm/test/CodeGen/X86/fminimum-fmaximum.ll
963

please can you add vector test coverage to ensure we scalarize?

e-kud marked 6 inline comments as done.Apr 14 2023, 7:49 PM
e-kud added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
30240

Yes, you are right. Everything must be handled in X86TargetLowering. Dropped these checks.

llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
154

Yes, I've added them. Also it reminded me about several commented tests with the intrinsics. Uncommented them as well.

e-kud updated this revision to Diff 513836.Apr 14 2023, 7:49 PM
e-kud marked 2 inline comments as done.

Rebased.
Uncommented existing tests for the intrinsics.
Addressed to comments.

RKSimon added inline comments.Apr 18 2023, 5:19 AM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
711

add nounwind attribute to get rid of the .cfi noise

e-kud marked an inline comment as done.Apr 18 2023, 5:19 PM
e-kud added inline comments.
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
711

I've allowed myself to add nounwind in half.ll since I've touched it. I think the attribute was missed.

e-kud updated this revision to Diff 514793.Apr 18 2023, 5:20 PM
e-kud marked an inline comment as done.

Rebased.
Added nounwind attribute to tests.

@pengfei @goldstein.w.n Any more comments?

llvm/test/CodeGen/X86/half.ll
957

pre-commit the nounwind change to keep it separate from this patch - you shouldn't really need dso_local either

pengfei accepted this revision.Apr 23 2023, 4:08 AM
This revision is now accepted and ready to land.Apr 23 2023, 4:08 AM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

e-kud updated this revision to Diff 516594.Apr 24 2023, 6:46 PM

Rebased.
Excluded refactor of half.ll.

e-kud marked an inline comment as done.Apr 24 2023, 7:37 PM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

llvm/test/CodeGen/X86/half.ll
957

I haven't got commit access yet, here it is https://reviews.llvm.org/D149114

skatkov added a comment.EditedApr 24 2023, 8:57 PM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

BTW. my measurements on micro benchmarking (I know micro might cause that branch predictor works good), the version https://godbolt.org/z/rEj9GPfnY is the best one for scalar but it cannot be implemented in SelectionDAG as has a CFG.

skatkov added inline comments.Apr 24 2023, 9:06 PM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
121

Here is what I mentioned in terms of non-optimial vectorized version at least on AVX.

RKSimon accepted this revision.Apr 25 2023, 12:31 AM

LGTM

llvm/lib/Target/X86/X86ISelLowering.cpp
30340

(style) remove braces from single line if()

llvm/test/CodeGen/X86/fminimum-fmaximum.ll
121

For now we just want x86 to support the intrinsics, vector optimization is better handled as a followup.

skatkov added inline comments.Apr 25 2023, 1:09 AM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
121

It is exactly what I said in the beginning of this discussion: side question not delaying landing this patch.

RKSimon added inline comments.Apr 25 2023, 1:58 AM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
121

yup - cheers

e-kud updated this revision to Diff 516778.Apr 25 2023, 6:29 AM
e-kud marked 5 inline comments as done.

Fix single line if style.

e-kud marked an inline comment as done.Apr 25 2023, 6:29 AM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

BTW. my measurements on micro benchmarking (I know micro might cause that branch predictor works good), the version https://godbolt.org/z/rEj9GPfnY is the best one for scalar but it cannot be implemented in SelectionDAG as has a CFG.

Both versions are incorrect. They don't work as expected in case of (-0.0, 0.0), (0.0, -0.0) inputs. Because comiss and max treat negative and positive zeros as equal.

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

BTW. my measurements on micro benchmarking (I know micro might cause that branch predictor works good), the version https://godbolt.org/z/rEj9GPfnY is the best one for scalar but it cannot be implemented in SelectionDAG as has a CFG.

Both versions are incorrect. They don't work as expected in case of (-0.0, 0.0), (0.0, -0.0) inputs. Because comiss and max treat negative and positive zeros as equal.

Take a look carefully. In case of equality we check the sign of the first value. Thus comparison using ucommis is ok.

e-kud added a comment.EditedApr 25 2023, 8:20 AM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

BTW. my measurements on micro benchmarking (I know micro might cause that branch predictor works good), the version https://godbolt.org/z/rEj9GPfnY is the best one for scalar but it cannot be implemented in SelectionDAG as has a CFG.

Both versions are incorrect. They don't work as expected in case of (-0.0, 0.0), (0.0, -0.0) inputs. Because comiss and max treat negative and positive zeros as equal.

Take a look carefully. In case of equality we check the sign of the first value. Thus comparison using ucommis is ok.

There is a need of %cmp = icmp slt i32 %bc, 0 -> %cmp = icmp sge i32 %bc, 0 to make valid.
Yeah, I got the idea that generic case may be more efficient implemented in such way.

I wonder whether any problems with landing this patch?

RKSimon added inline comments.May 4 2023, 5:43 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
1006

Do we have test coverage with SSE1 only?

This revision was landed with ongoing or failed builds.May 4 2023, 6:05 AM
This revision was automatically updated to reflect the committed changes.
e-kud marked an inline comment as done.May 4 2023, 6:40 AM
e-kud added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
1006

Apparently, no. There is a fatal error: error in backend: Access past stack top! with doubles and +sse,-sse2

It seems I need to split float and double tests into two separate files to test SSE1 only. Are there better alternatives?

RKSimon added inline comments.May 4 2023, 7:08 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
1006

I'd be very tempted to limit float maximum/minimum to SSE2 or later tbh

Via Conduit