Page MenuHomePhabricator

* [x86] Implement smarter instruction lowering for FP_TO_UINT from vXf32 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.
Needs ReviewPublic

Authored by TomHender on Oct 19 2020, 6:36 AM.



We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs. We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:
small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))

Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.

I checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).

I have adjusted some cost model values for this but I am not sure if I have done that right. The given costs don't look very consistent to me. For example a conversion from 8 floats to 8 uint8/int8 gives me a cost of 7 although fewer instructions are generated and latencywise the dependency chain is way shorter. v4i32 to v4f64 is even more extreme with a cost of 16 although the generated code is nice and simple. I have set the new cost for the conversion of this patch to 8 based on what was set previously and based on the latency of the longest dependency chain (The explanation on top of the file says it should be that). Additionally type legalization isn't done before looking up the cost tables so I had to add multiple entries for different vector widths which seems redundant.

Diff Detail

Event Timeline

TomHender created this revision.Oct 19 2020, 6:36 AM
TomHender requested review of this revision.Oct 19 2020, 6:36 AM
TomHender retitled this revision from * [x86] Implement smarter instruction lowering for FP_TO_UINT for vXf32 to vXi32 from SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction. to * [x86] Implement smarter instruction lowering for FP_TO_UINT from vXf32 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction..Oct 19 2020, 6:39 AM

For the costs - ideally you need to run the code snippet through llvm-mca (you can do this in godbolt) for various cpus of that level (e.g. avx1 -> sandybridge/btver2/bdver2, avx2 -> znver2/haswell etc.) and use the worst case throughput cost of those runs. For older targets (sse2...) we're more limited on testable cpu targets, I tend to just use slm's costs as they tend to match weak pre-avx cpus).


Avoid operator order warnings:

if ((VT == MVT::v4i32 && SrcVT == MVT::v4f32) || 
    (VT == MVT::v8i32 && SrcVT == MVT::v8f32))

clang format this block

TomHender updated this revision to Diff 299048.Oct 19 2020, 7:45 AM
TomHender edited the summary of this revision. (Show Details)

@RKSimon Thank you for your very speedy response.

Should I change the other costs that seem wrong to me? Git blame suggests that they are from 2014 and I think they are just way outdated.

TomHender marked 2 inline comments as done.Oct 19 2020, 7:45 AM

@RKSimon I ran it through llvm-mca now. It gives me a reciprocal throughput of 3.5 for Silvermont and 3 for Haswell for the new instruction sequence.

The problem I see with these values is that they seem incorrect relative to the others. Many values are off by a large factor from what I get when I run them through llvm-mca.
Like { ISD::SINT_TO_FP, MVT::v2f64, MVT::v16i8, 16*10 } for SSE2 for example. I don't understand how anyone came up with an astronomic cost of 160. @RKSimon's method gives a value of 3 for me (Godbolt-Link). Even exploring the generated machine code for ancient LLVM versions or the Agner Fog instruction table for VIA Nano 3000 cannot explain this magnitude.

I must be misunderstanding something still.

Gentle ping.

I am still unsure about the cost model thing. Is it possible to get any information about this? Is the policy to just not worry and ignore the other entries?