Page MenuHomePhabricator

[X86] Support llvm.{min,max}imum.f{16,32,64}
ClosedPublic

Authored by e-kud on Mar 8 2023, 5:01 PM.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
e-kud marked 7 inline comments as done.Mar 14 2023, 4:21 PM

Do you have plan to support minimumNumber?

Sorry, I didn't get it. Could you be more specific?

It's accompanying function of minimum in IEEE-754 2019, it will be introduced in new C/C++ standard too. I thought you are working for that.

I can't find any lib calls or specific intrinsics for them. I'd like to add them separately if we have any users or needs of them, do we?

e-kud retitled this revision from [X86] Support llvm.{min,max}imum.f{32,64} to [X86] Support llvm.{min,max}imum.f{16,32,64}.Mar 14 2023, 4:26 PM
e-kud updated this revision to Diff 505346.Mar 14 2023, 6:33 PM

Broke formatting for premerge checks

pengfei added inline comments.Mar 14 2023, 8:20 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
1004–1005

Make the format align with its context.

2133–2134

ditto.

30266

Put Max/Min together is a bit confusing. My first impression is it can return either +0 or -0 for a single comparison.

30295

Need hasFP16() for f16.

34102

ditto format.
Why do we need getNode?

Do you have plan to support minimumNumber?

Sorry, I didn't get it. Could you be more specific?

It's accompanying function of minimum in IEEE-754 2019, it will be introduced in new C/C++ standard too. I thought you are working for that.

I can't find any lib calls or specific intrinsics for them. I'd like to add them separately if we have any users or needs of them, do we?

I heard glibc is supporting them. I'm fine with leaving it to the future. Do you have plan on the vector support?

llvm/test/CodeGen/X86/fminimum-fmaximum.ll
5

Guess you missed a AVX512F, i.e. AVX,AVX512,AVX512F

e-kud updated this revision to Diff 505684.Mar 15 2023, 7:56 PM
e-kud marked 5 inline comments as done.

Addressed formatting comments.
Check f16 explicitly even if avx512f16 implies avx512dq for now.

pengfei added inline comments.Mar 15 2023, 8:08 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
30295

(VT == MVT::f16 && Subtarget.hasFP16()) || Subtarget.hasDQI()

pengfei added inline comments.Mar 15 2023, 8:10 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
30295

Sorry, mistake. It should be:
(VT != MVT::f16 && Subtarget.hasDQI()) || Subtarget.hasFP16()

e-kud added a comment.Mar 15 2023, 8:12 PM

Do you have plan to support minimumNumber?

Sorry, I didn't get it. Could you be more specific?

It's accompanying function of minimum in IEEE-754 2019, it will be introduced in new C/C++ standard too. I thought you are working for that.

I can't find any lib calls or specific intrinsics for them. I'd like to add them separately if we have any users or needs of them, do we?

I heard glibc is supporting them. I'm fine with leaving it to the future. Do you have plan on the vector support?

Indeed, found this

  • <math.h> functions for floating-point maximum and minimum, corresponding to new operations in IEEE 754-2019, and corresponding <tgmath.h> macros, are added from draft ISO C2X: fmaximum, fmaximum_num, fmaximum_mag, fmaximum_mag_num, fminimum, fminimum_num, fminimum_mag, fminimum_mag_num and corresponding functions for float, long double, _FloatN and _FloatNx.

About vector support. I want to try to implement vector support alternatively, transform floats into integers using shifts and compare them as integers preserving float semantics, even -0 < +0. I've tried this approach for scalars but current approach produces less code. Probably vectors can benefit more.

llvm/lib/Target/X86/X86ISelLowering.cpp
30295

Yes, thank you, missed it as fp16 implies dq. We actually can check only VT == MVT::f16 because above the predicate Subtarget.hasFP16() && VT == MVT::f16 has been checked already. Do we want to avoid such implicit implication? Alternative is VT != MVT::f16 && Subtarget.hasDQI() || VT == MVT::f16 && Subtarget.hasFP16().

34102

Format doesn't work, we have 80+ chars if it's aligned with return. We don't need getNode, I missed it when moved from combine to lowering.

pengfei accepted this revision.Mar 15 2023, 8:13 PM

LGTM.

llvm/lib/Target/X86/X86ISelLowering.cpp
30295

Oh, we have checked hasFP16 in line 29876. So the way used here is correct. Sorry for the noise.

This revision is now accepted and ready to land.Mar 15 2023, 8:13 PM
e-kud updated this revision to Diff 510659.Apr 3 2023, 6:30 PM

Support i686 target: can't use integer representation of double -0.0

e-kud updated this revision to Diff 511826.Apr 7 2023, 4:34 PM

Rebased

e-kud requested review of this revision.Apr 7 2023, 5:28 PM

I think you can merge it. We don't need all reviewers sign off.

RKSimon added inline comments.Apr 13 2023, 8:01 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
30233

You should be able to drop this early-out - we should never get here

30240

Again, you can drop this as the setOperationAction calls should ensure we never get here - replace it with an assertion if you're worried.

30278

auto *

30280

auto *

llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
154

please can you add vector test coverage to ensure we scalarize?

llvm/test/CodeGen/X86/fminimum-fmaximum.ll
963

please can you add vector test coverage to ensure we scalarize?

e-kud marked 6 inline comments as done.Apr 14 2023, 7:49 PM
e-kud added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
30240

Yes, you are right. Everything must be handled in X86TargetLowering. Dropped these checks.

llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll
154

Yes, I've added them. Also it reminded me about several commented tests with the intrinsics. Uncommented them as well.

e-kud updated this revision to Diff 513836.Apr 14 2023, 7:49 PM
e-kud marked 2 inline comments as done.

Rebased.
Uncommented existing tests for the intrinsics.
Addressed to comments.

RKSimon added inline comments.Apr 18 2023, 5:19 AM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
711

add nounwind attribute to get rid of the .cfi noise

e-kud marked an inline comment as done.Apr 18 2023, 5:19 PM
e-kud added inline comments.
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
711

I've allowed myself to add nounwind in half.ll since I've touched it. I think the attribute was missed.

e-kud updated this revision to Diff 514793.Apr 18 2023, 5:20 PM
e-kud marked an inline comment as done.

Rebased.
Added nounwind attribute to tests.

@pengfei @goldstein.w.n Any more comments?

llvm/test/CodeGen/X86/half.ll
957

pre-commit the nounwind change to keep it separate from this patch - you shouldn't really need dso_local either

pengfei accepted this revision.Apr 23 2023, 4:08 AM
This revision is now accepted and ready to land.Apr 23 2023, 4:08 AM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

e-kud updated this revision to Diff 516594.Apr 24 2023, 6:46 PM

Rebased.
Excluded refactor of half.ll.

e-kud marked an inline comment as done.Apr 24 2023, 7:37 PM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

llvm/test/CodeGen/X86/half.ll
957

I haven't got commit access yet, here it is https://reviews.llvm.org/D149114

skatkov added a comment.EditedApr 24 2023, 8:57 PM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

BTW. my measurements on micro benchmarking (I know micro might cause that branch predictor works good), the version https://godbolt.org/z/rEj9GPfnY is the best one for scalar but it cannot be implemented in SelectionDAG as has a CFG.

skatkov added inline comments.Apr 24 2023, 9:06 PM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
121

Here is what I mentioned in terms of non-optimial vectorized version at least on AVX.

RKSimon accepted this revision.Apr 25 2023, 12:31 AM

LGTM

llvm/lib/Target/X86/X86ISelLowering.cpp
30340

(style) remove braces from single line if()

llvm/test/CodeGen/X86/fminimum-fmaximum.ll
121

For now we just want x86 to support the intrinsics, vector optimization is better handled as a followup.

skatkov added inline comments.Apr 25 2023, 1:09 AM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
121

It is exactly what I said in the beginning of this discussion: side question not delaying landing this patch.

RKSimon added inline comments.Apr 25 2023, 1:58 AM
llvm/test/CodeGen/X86/fminimum-fmaximum.ll
121

yup - cheers

e-kud updated this revision to Diff 516778.Apr 25 2023, 6:29 AM
e-kud marked 5 inline comments as done.

Fix single line if style.

e-kud marked an inline comment as done.Apr 25 2023, 6:29 AM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

BTW. my measurements on micro benchmarking (I know micro might cause that branch predictor works good), the version https://godbolt.org/z/rEj9GPfnY is the best one for scalar but it cannot be implemented in SelectionDAG as has a CFG.

Both versions are incorrect. They don't work as expected in case of (-0.0, 0.0), (0.0, -0.0) inputs. Because comiss and max treat negative and positive zeros as equal.

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

BTW. my measurements on micro benchmarking (I know micro might cause that branch predictor works good), the version https://godbolt.org/z/rEj9GPfnY is the best one for scalar but it cannot be implemented in SelectionDAG as has a CFG.

Both versions are incorrect. They don't work as expected in case of (-0.0, 0.0), (0.0, -0.0) inputs. Because comiss and max treat negative and positive zeros as equal.

Take a look carefully. In case of equality we check the sign of the first value. Thus comparison using ucommis is ok.

e-kud added a comment.EditedApr 25 2023, 8:20 AM

I have a side question (not delaying landing this one)

It looks like this change lowers vectorized form of intrinsic in a non-optimal way.
So I wonder whether there are some plans to improve it as follow-up?

Yes, for sure. I've tried to find a way to implement vectorized version for SSE, but nothing's come to my mind. We need at least SSE2 for PCMPEQ{W,D} because all fp comparison instructions treat -0.0 and 0.0 as equal. It seems that pentium3 will suffer from not optimal fmaximum/fminimum...

Why we cannot do something like this https://godbolt.org/z/Yxfn3jTj1 ?
I mean at least for avx?

BTW. my measurements on micro benchmarking (I know micro might cause that branch predictor works good), the version https://godbolt.org/z/rEj9GPfnY is the best one for scalar but it cannot be implemented in SelectionDAG as has a CFG.

Both versions are incorrect. They don't work as expected in case of (-0.0, 0.0), (0.0, -0.0) inputs. Because comiss and max treat negative and positive zeros as equal.

Take a look carefully. In case of equality we check the sign of the first value. Thus comparison using ucommis is ok.

There is a need of %cmp = icmp slt i32 %bc, 0 -> %cmp = icmp sge i32 %bc, 0 to make valid.
Yeah, I got the idea that generic case may be more efficient implemented in such way.

I wonder whether any problems with landing this patch?

RKSimon added inline comments.Thu, May 4, 5:43 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
1006

Do we have test coverage with SSE1 only?

This revision was landed with ongoing or failed builds.Thu, May 4, 6:05 AM
This revision was automatically updated to reflect the committed changes.
e-kud marked an inline comment as done.Thu, May 4, 6:40 AM
e-kud added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
1006

Apparently, no. There is a fatal error: error in backend: Access past stack top! with doubles and +sse,-sse2

It seems I need to split float and double tests into two separate files to test SSE1 only. Are there better alternatives?

RKSimon added inline comments.Thu, May 4, 7:08 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
1006

I'd be very tempted to limit float maximum/minimum to SSE2 or later tbh