This is an archive of the discontinued LLVM Phabricator instance.

[X86] Implement smarter instruction lowering for FP_TO_UINT from vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.
ClosedPublic

Authored by RKSimon on Oct 19 2020, 6:36 AM.

Details

Summary

We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs. We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:

small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))

Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.

@TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).

Diff Detail

Event Timeline

TomHender created this revision.Oct 19 2020, 6:36 AM
TomHender requested review of this revision.Oct 19 2020, 6:36 AM
TomHender retitled this revision from * [x86] Implement smarter instruction lowering for FP_TO_UINT for vXf32 to vXi32 from SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction. to * [x86] Implement smarter instruction lowering for FP_TO_UINT from vXf32 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction..Oct 19 2020, 6:39 AM

For the costs - ideally you need to run the code snippet through llvm-mca (you can do this in godbolt) for various cpus of that level (e.g. avx1 -> sandybridge/btver2/bdver2, avx2 -> znver2/haswell etc.) and use the worst case throughput cost of those runs. For older targets (sse2...) we're more limited on testable cpu targets, I tend to just use slm's costs as they tend to match weak pre-avx cpus).

llvm/lib/Target/X86/X86ISelLowering.cpp
21176

Avoid operator order warnings:

if ((VT == MVT::v4i32 && SrcVT == MVT::v4f32) || 
    (VT == MVT::v8i32 && SrcVT == MVT::v8f32))
21201

clang format this block

TomHender updated this revision to Diff 299048.Oct 19 2020, 7:45 AM
TomHender edited the summary of this revision. (Show Details)

@RKSimon Thank you for your very speedy response.

Should I change the other costs that seem wrong to me? Git blame suggests that they are from 2014 and I think they are just way outdated.

TomHender marked 2 inline comments as done.Oct 19 2020, 7:45 AM

@RKSimon I ran it through llvm-mca now. It gives me a reciprocal throughput of 3.5 for Silvermont and 3 for Haswell for the new instruction sequence.

The problem I see with these values is that they seem incorrect relative to the others. Many values are off by a large factor from what I get when I run them through llvm-mca.
Like { ISD::SINT_TO_FP, MVT::v2f64, MVT::v16i8, 16*10 } for SSE2 for example. I don't understand how anyone came up with an astronomic cost of 160. @RKSimon's method gives a value of 3 for me (Godbolt-Link). Even exploring the generated machine code for ancient LLVM versions or the Agner Fog instruction table for VIA Nano 3000 cannot explain this magnitude.

I must be misunderstanding something still.

Gentle ping.

I am still unsure about the cost model thing. Is it possible to get any information about this? Is the policy to just not worry and ignore the other entries?

yubing added a comment.Jan 6 2021, 9:29 PM

No regression appeared in our internal testcases.
It seems the transform is correct, have you verified it with alive-tv?

No regression appeared in our internal testcases.
It seems the transform is correct, have you verified it with alive-tv?

I was curious to see if I could model it:
https://alive2.llvm.org/ce/z/RXcYY9

Converting #x4f800000 (4294967296) to uint32_t is poison, not 0 though. Am I reading the Alive output correctly? (cc @lebedev.ri @aqjune @nlopes @nikic )

No regression appeared in our internal testcases.
It seems the transform is correct, have you verified it with alive-tv?

I was curious to see if I could model it:
https://alive2.llvm.org/ce/z/RXcYY9

Converting #x4f800000 (4294967296) to uint32_t is poison, not 0 though. Am I reading the Alive output correctly? (cc @lebedev.ri @aqjune @nlopes @nikic )

That is how i would read that alive2 report, yes.
Perhaps one needs https://llvm.org/docs/LangRef.html#llvm-fptoui-sat-intrinsic ?

No regression appeared in our internal testcases.
It seems the transform is correct, have you verified it with alive-tv?

I was curious to see if I could model it:
https://alive2.llvm.org/ce/z/RXcYY9

Converting #x4f800000 (4294967296) to uint32_t is poison, not 0 though. Am I reading the Alive output correctly? (cc @lebedev.ri @aqjune @nlopes @nikic )

That's a nice bug in Alive 😅
It's because UINT_MAX is not representable accurately as a float, so the generated constraints are wrong. I'll have a look, thanks!

No regression appeared in our internal testcases.
It seems the transform is correct, have you verified it with alive-tv?

I was curious to see if I could model it:
https://alive2.llvm.org/ce/z/RXcYY9

Converting #x4f800000 (4294967296) to uint32_t is poison, not 0 though. Am I reading the Alive output correctly? (cc @lebedev.ri @aqjune @nlopes @nikic )

FYI: a fixed version of alive says your file is correct. will commit a fix soon (optimizing it now)

@TomHender Reverse ping - are you still looking at this please?

@TomHender Would you like to continue with this? If not I'd like to commandeer this and complete it.

Sorry for the delayed response from me. I was not actively checking this anymore since there was no further input for a while.

It think back in October I was still looking for a resolution of the apparent cost model inconsistencies. @RKSimon suggested using the LLVM-MCA numbers but it seemed to me that I was adding to the mess, due to apparent inconsistencies in comparison to the other numbers. Thus I was hoping for either a clarification of why this is indeed correct or alternatively an acknowledgement of the issue and that the other costs are still to be updated to be LLVM-MCA-like in the future so this change 'fits in' in the longterm.

In any case you are welcome to take this over as I currently don't expect to have time for LLVM this month.

RKSimon commandeered this revision.Jul 4 2021, 12:32 PM
RKSimon edited reviewers, added: TomHender; removed: RKSimon.

Sorry for the delayed response from me. I was not actively checking this anymore since there was no further input for a while.

It think back in October I was still looking for a resolution of the apparent cost model inconsistencies. @RKSimon suggested using the LLVM-MCA numbers but it seemed to me that I was adding to the mess, due to apparent inconsistencies in comparison to the other numbers. Thus I was hoping for either a clarification of why this is indeed correct or alternatively an acknowledgement of the issue and that the other costs are still to be updated to be LLVM-MCA-like in the future so this change 'fits in' in the longterm.

In any case you are welcome to take this over as I currently don't expect to have time for LLVM this month.

OK, I'm going to have a go at getting this finished - I'm slowly cleaning up the costs based on real world throughput numbers from llvm-mca (see D103695) instead of the inconsistent magic numbers we had (which often were based on instruction counts). Thanks for your work on this!

RKSimon planned changes to this revision.Jul 10 2021, 5:49 AM

I'm still overhauling the cast costs

RKSimon updated this revision to Diff 358048.Jul 12 2021, 1:24 PM
RKSimon retitled this revision from * [x86] Implement smarter instruction lowering for FP_TO_UINT from vXf32 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction. to * [x86] Implement smarter instruction lowering for FP_TO_UINT from vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction..
RKSimon edited the summary of this revision. (Show Details)

Updated patch to also handle vXf64->vXi32 vector and f32/f64->u64/u32 scalar fptoui cases.

I've also added AVX1 support for v8f32->v8i32 where we still need a VBLENDPS (instead of a VSRAI ymm).

Refreshed costs using the helper script from D103695 for reference.

Can we make the cost model changes as a preliminary/independent commit?

Can we make the cost model changes as a preliminary/independent commit?

Are there any specific cost changes that you think can be pulled out? Most of the updates are necessary to match the change in codegen.

spatel accepted this revision.Jul 13 2021, 9:41 AM

Can we make the cost model changes as a preliminary/independent commit?

Are there any specific cost changes that you think can be pulled out? Most of the updates are necessary to match the change in codegen.

Ah, right.
LGTM - some of the test diffs with an extra instruction seem like they would not be wins, but I'm guessing those patterns are rare and the perf diff would be in the noise.

This revision is now accepted and ready to land.Jul 13 2021, 9:41 AM
RKSimon retitled this revision from * [x86] Implement smarter instruction lowering for FP_TO_UINT from vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction. to [X86] Implement smarter instruction lowering for FP_TO_UINT from vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction..Jul 14 2021, 3:18 AM
RKSimon edited the summary of this revision. (Show Details)
This revision was landed with ongoing or failed builds.Jul 14 2021, 4:04 AM
This revision was automatically updated to reflect the committed changes.

just FYI, this patch is causing problems in https://crbug.com/1230187
still investigating

just FYI, this patch is causing problems in https://crbug.com/1230187
still investigating

Any luck?

Not yet, it's hard to figure out exactly which file is causing the problem.
Does this patch exploit UB? Or should it just work for all inputs?

Not yet, it's hard to figure out exactly which file is causing the problem.
Does this patch exploit UB? Or should it just work for all inputs?

If the floating point value is negative (apart from -0.0) then there can be diffs - but given its fp_to_uint then yes that ub.

AFAICT, this patch probably just adjusted some UB behavior due to converting negative floats to unsigned ints. Sorry for the noise.