This is an archive of the discontinued LLVM Phabricator instance.

[x86] use vector instructions to lower FP->int->FP casts
ClosedPublic

Authored by spatel on Apr 10 2020, 1:16 PM.

Details

Summary

As discussed in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c13
...we can avoid the likely slow round-trip from XMM to GPR to XMM by using the vector versions of the convert instructions.

Based on experimental results from recent Intel/AMD chips, we don't need to worry about triggering denorm stalls while operating on garbage data in the high lanes with convert instructions, so this is expected to always be as good or better perf than the scalar instruction equivalent. FP exceptions are also not a concern because strict code should not be using the regular SDAG opcodes.

Diff Detail

Event Timeline

spatel created this revision.Apr 10 2020, 1:16 PM

Looks ok to me in principle.

pcordes accepted this revision.Apr 10 2020, 2:07 PM
This revision is now accepted and ready to land.Apr 10 2020, 2:07 PM

Looks good to me, too. SSE4.1 still uses roundss which is at least as good. (1 uop on SnB and AMD, 2 uops on HSW/SKL).

While we're looking at this, could we use [v]roundss $11, src, src, dst when available for no-fast-math?

float->int overflow is UB so ISO C allows us to assume it doesn't happen. Unless we want to go beyond that, we don't need to preserve the behaviour of producing 0x80000000 as an integer result from the overflowing (int) cast and then converting that back to a float that represents that negative value, for out-of-range inputs. But if this optimization stage happens late enough on IR that does expect those semantics, that's different. But behaviour for similar asm instructions on other ISAs might not give the same bit-pattern so I'm not sure how useful this is. e.g. ARM conversions saturate to the bounds of the int value-range.

(I meant to just mark my review as "looks good to me", but I think that actually marked it as officially accepted. Let me know if that wasn't what the preferred course of action.)

While we're looking at this, could we use [v]roundss $11, src, src, dst when available for no-fast-math?

float->int overflow is UB so ISO C allows us to assume it doesn't happen. Unless we want to go beyond that, we don't need to preserve the behaviour of producing 0x80000000 as an integer result from the overflowing (int) cast and then converting that back to a float that represents that negative value, for out-of-range inputs. But if this optimization stage happens late enough on IR that does expect those semantics, that's different. But behaviour for similar asm instructions on other ISAs might not give the same bit-pattern so I'm not sure how useful this is. e.g. ARM conversions saturate to the bounds of the int value-range.

The problem is -0.0 rather than overflow. (The non-obvious "#0" attribute on some of the tests is for "no-signed-zeros-fp-math"="true"; I should change the test name to make it clearer.)
We used to produce roundss more aggressively, but we had to back that out because it's wrong:
D48085
We also added a bailout flag for the transform because there's too much code in the wild that relied on some particular UB overflow behavior.

The problem is -0.0 rather than overflow.

Ah yes, that's a showstopper for correctness, thanks.

Is anyone working on a similar patch for double-precision? Zen2 apparently has single-uop CVTPD2DQ XMM, XMM and DQ2PD.

https://www.uops.info/table.html?search=cvtpd2dq&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_CON=on&cb_SNB=on&cb_HSW=on&cb_SKX=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_sse=on

It may be break-even on other CPUs for throughput, and a win for latency.

Conroe and Nehalem have single-uop CVTTSD2SI (R32, XMM) so scalar round trip is best there for throughput.

SnB-family CPUs have 2 uop cvt to/from SI or PD<->DQ, but going to xmm->GP-int is port 0. So a scalar round trip distributes back-end port pressure more evenly. So the best choice depends on surrounding code if tuning for Intel without caring about Zen2. In a loop that *only* does (float)(int)x, on Skylake scalar has 1/clock throughput because it avoids a port5 bottleneck. The 2 instructions are p0 + p01 and p5 + p01. vs. cvttpd2dq and back would both be p5 + p01.

We can avoid a false dependency by converting back into the XMM reg we came from, so scalar round trip can avoid needing an extra instruction for xor-zeroing the destination. (But apparently we missed that optimization for float in the testcases without this patch).

RKSimon accepted this revision.Apr 11 2020, 4:08 AM

LGTM, cheers - I'm not sure whats stopping us handling the float/double-int-double/float conversions as well in this patch though.

llvm/lib/Target/X86/X86ISelLowering.cpp
19157

What's stopping us getting this done in this patch as well?

spatel marked 2 inline comments as done.Apr 11 2020, 5:36 AM
spatel added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
19157

useVectorCast() needs to be updated (or we just account for instruction availability directly here).

Not a big deal, but I figured it was better to go piecemeal with the legacy of broken software from the related fptrunc patches. :)

I'll add more tests and make the enhancements soon.

This revision was automatically updated to reflect the committed changes.
spatel marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2020, 7:28 AM