This is an archive of the discontinued LLVM Phabricator instance.

[X86] Avoid folding scalar loads into unary sse intrinsics
ClosedPublic

Authored by mkuper on Dec 23 2015, 3:46 AM.

Details

Summary

Not folding these cases tends to avoid partial register updates:

sqrtss (%eax), %xmm0

Has a partial update of %xmm0, while

movss (%eax), %xmm0
sqrtss %xmm0, %xmm0

Has a clobber of the high lanes immediately before the partial update, avoiding a potential stall.

Given this, we only want to fold when optimizing for size.
This is consistent with the patterns we already have for the fp/int converts, and in X86InstrInfo::foldMemoryOperandImpl()

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 43519.Dec 23 2015, 3:46 AM
mkuper retitled this revision from to [X86] Avoid folding scalar loads into unary sse intrinsics.
mkuper updated this object.
mkuper added reviewers: RKSimon, spatel, andreadb.
mkuper added subscribers: llvm-commits, DavidKreitzer.
spatel edited edge metadata.Dec 29 2015, 10:10 AM

This is consistent with the patterns we already have for the fp/int converts...

We still need to fix converts?

#include <xmmintrin.h>
__m128 foo(__m128 x, int *y) { return _mm_cvtsi32_ss(x, *y); }

$ ./clang -O1 ss2si.c -S -o -

cvtsi2ssl  (%rdi), %xmm1  <--- false dependency on xmm1?
movss      %xmm1, %xmm0
lib/Target/X86/X86InstrSSE.td
3392 ↗(On Diff #43519)

80-cols.

3433 ↗(On Diff #43519)

80-cols.

This is consistent with the patterns we already have for the fp/int converts...

We still need to fix converts?

#include <xmmintrin.h>
__m128 foo(__m128 x, int *y) { return _mm_cvtsi32_ss(x, *y); }

$ ./clang -O1 ss2si.c -S -o -

cvtsi2ssl  (%rdi), %xmm1  <--- false dependency on xmm1?
movss      %xmm1, %xmm0

Right, I was talking about this:

def CVTSD2SSrm  : I<0x5A, MRMSrcMem, (outs FR32:$dst), (ins f64mem:$src),
                      "cvtsd2ss\t{$src, $dst|$dst, $src}",
                      [(set FR32:$dst, (fround (loadf64 addr:$src)))],
                      IIC_SSE_CVT_Scalar_RM>,
                      XD,
                  Requires<[UseSSE2, OptForSize]>, Sched<[WriteCvtF2FLd]>;

But this is actually the non-intrinsic pattern.

lib/Target/X86/X86InstrSSE.td
3392 ↗(On Diff #43519)

The TDs don't enforce 80-cols consistently, and I never remember whether they should. Thanks. :-)

spatel accepted this revision.Dec 30 2015, 8:39 AM
spatel edited edge metadata.
def CVTSD2SSrm  : I<0x5A, MRMSrcMem, (outs FR32:$dst), (ins f64mem:$src),
                    "cvtsd2ss\t{$src, $dst|$dst, $src}",
                    [(set FR32:$dst, (fround (loadf64 addr:$src)))],
                    IIC_SSE_CVT_Scalar_RM>,
                    XD,
                Requires<[UseSSE2, OptForSize]>, Sched<[WriteCvtF2FLd]>;

Ah, I managed to miss that one.
How about adding some 'FIXME' notes and/or changing the other defs since we're currently inconsistent about this? LGTM otherwise.

float f1(int *x) { return *x; } 
double f2(int *x) { return *x; }
float f3(long long *x) { return *x; }
double f4(long long *x) { return *x; }
float f5(double *x) { return *x; }
double f6(float *x) { return *x; }

$ ./clang -O1 ss2si.c -S -o - |grep cvt
cvtsi2ssl	(%rdi), %xmm0
cvtsi2sdl	(%rdi), %xmm0
cvtsi2ssq	(%rdi), %xmm0
cvtsi2sdq	(%rdi), %xmm0
cvtsd2ss	%xmm0, %xmm0
cvtss2sd	%xmm0, %xmm0

Regarding handling this via ExeDepsFix - it's not clear to me that its current solution:

xorps %xmm0, %xmm0
cvtsi2ssl (%rdi) %xmm0

would be better than unfolding the load. I think the xorps instruction saves a byte in all cases, but it may be micro-arch-dependent whether that's actually cheaper?

lib/Target/X86/X86InstrSSE.td
3392 ↗(On Diff #43519)

It seems like we mostly try to follow the law, but I would fully support a new rule for these files.

80-cols causes a lot of extra suffering trying to make sense of this code that's already hard to read. :)

This revision is now accepted and ready to land.Dec 30 2015, 8:39 AM

Thanks, Sanjay.

How about adding some 'FIXME' notes and/or changing the other defs since we're currently inconsistent about this? LGTM otherwise.

float f1(int *x) { return *x; } 
double f2(int *x) { return *x; }
float f3(long long *x) { return *x; }
double f4(long long *x) { return *x; }
float f5(double *x) { return *x; }
double f6(float *x) { return *x; }

I'll add FIXMEs.

Regarding handling this via ExeDepsFix - it's not clear to me that its current solution:

xorps %xmm0, %xmm0
cvtsi2ssl (%rdi) %xmm0

would be better than unfolding the load. I think the xorps instruction saves a byte in all cases, but it may be micro-arch-dependent whether that's actually cheaper?

I think it generally is better (the xorps idiom should be recognized by any modern Intel CPU, at least). But David is the real authority on this.

This revision was automatically updated to reflect the committed changes.