This is an archive of the discontinued LLVM Phabricator instance.

Fix for bug 41512: lower INSERT_VECTOR_ELT(ZeroVec, 0, Elt) to SCALAR_TO_VECTOR(Elt) for all SSE flavors
ClosedPublic

Authored by Serge_Preis on Apr 17 2019, 10:21 PM.

Details

Summary

Current LLVM uses pxor+pinsrb on SSE4+ for INSERT_VECTOR_ELT(ZeroVec, 0, Elt) insead of much simpler movd.
INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is idiomatic construct which is used e.g. for _mm_cvtsi32_si128(Elt) and for lowest element initialization in _mm_set_epi32.
So such inefficient lowering leads to significant performance digradations in ceratin cases switching from SSSE3 to SSE4.
https://bugs.llvm.org/show_bug.cgi?id=41512

Here INSERT_VECTOR_ELT(ZeroVec, 0, Elt) is simply converted to SCALAR_TO_VECTOR(Elt) when applicable since latter is closer match to desired behavior and always efficiently lowered to movd and alike.

Diff Detail

Event Timeline

Serge_Preis created this revision.Apr 17 2019, 10:21 PM
Serge_Preis edited the summary of this revision. (Show Details)Apr 17 2019, 10:22 PM

Your test cases need to be a lot simpler - I'd recommend looking at buildvec-insertvec.ll and possibly adding your tests to that file instead of adding this new file.

You will need to use the utils/update_llc_test_checks.py script to autogenerate your codegen checks as well.

Your test cases need to be a lot simpler - I'd recommend looking at buildvec-insertvec.ll and possibly adding your tests to that file instead of adding this new file.

The problem is that in tiny kernel llvm behaves as I expect it to, while in a loop it underperforms: https://gcc.godbolt.org/z/PljujX -- compare Sum2 and Loop() code generation.
I will do my best to minimize the case.

You will need to use the utils/update_llc_test_checks.py script to autogenerate your codegen checks as well.

Ok, I will look into this.

Your test cases need to be a lot simpler - I'd recommend looking at buildvec-insertvec.ll and possibly adding your tests to that file instead of adding this new file.

The problem is that in tiny kernel llvm behaves as I expect it to, while in a loop it underperforms: https://gcc.godbolt.org/z/PljujX -- compare Sum2 and Loop() code generation.
I will do my best to minimize the case.

If we are getting this right sometimes, then we might already have the transform that we want, but it is limited in some way that prevents getting the larger case.

I doubt that the loop itself is needed to demonstrate the problem because I see 'movd' codegen even with a loop as long as it is not unrolled.

If we are getting this right sometimes, then we might already have the transform that we want, but it is limited in some way that prevents getting the larger case.
I doubt that the loop itself is needed to demonstrate the problem because I see 'movd' codegen even with a loop as long as it is not unrolled.

After more experimentation I tend to agree. Also the most basic case produces pinsrd even in a small kernel (https://gcc.godbolt.org/z/HAmNha), so will just create test out of it.

If we are getting this right sometimes, then we might already have the transform that we want, but it is limited in some way that prevents getting the larger case.
I doubt that the loop itself is needed to demonstrate the problem because I see 'movd' codegen even with a loop as long as it is not unrolled.

After more experimentation I tend to agree. Also the most basic case produces pinsrd even in a small kernel (https://gcc.godbolt.org/z/HAmNha), so will just create test out of it.

Not sure if this is minimal, but this seems to show the problem:

define <2 x i64> @pinsr(i32 %x, i32 %y) {
  %ins1 = insertelement <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef>, i32 %x, i32 0
  %ins2 = insertelement <4 x i32> <i32 undef, i32 0, i32 undef, i32 undef>, i32 %y, i32 0
  %b1 = bitcast <4 x i32> %ins1 to <2 x i64>
  %b2 = bitcast <4 x i32> %ins2 to <2 x i64>
  %r = shufflevector <2 x i64> %b1, <2 x i64> %b2, <2 x i32> <i32 0, i32 2>
  ret <2 x i64> %r
}

$ llc -o - pinsr.ll -mattr=sse4.2
	pxor	%xmm1, %xmm1
	pxor	%xmm0, %xmm0
	pinsrd	$0, %edi, %xmm0
	pinsrd	$0, %esi, %xmm1
	punpcklqdq	%xmm1, %xmm0    ## xmm0 = xmm0[0],xmm1[0]
	retq

Fixed test according to review comments

I added some baseline tests here:
rL358687

Please rebase/update against those and upload a new patch.

With updated insertelement-zero.ll test

This looks like the expected improvement to me, but given the 'blend' comment/test, I'd like someone else to confirm that I'm not overlooking some weird corner case.

llvm/test/CodeGen/X86/insertelement-zero.ll
503 ↗(On Diff #195795)

Remove 'FIXME' from this line.

633–635 ↗(On Diff #195795)

I looked at this case, and it is intentional because there's a comment in X86InstrSSE.td:
"these were changed to use blends because blends have better throughput on sandybridge and haswell"

Serge_Preis marked 2 inline comments as done.

Deleted FIXME

Serge_Preis marked an inline comment as done.Apr 18 2019, 9:29 PM
Serge_Preis added inline comments.
llvm/test/CodeGen/X86/insertelement-zero.ll
633–635 ↗(On Diff #195795)

My change itself doen't dictate any particular instruction selection. it just makes more obvious to ISel the idiom it deals with. If some instructions are better than others for 1st element intialization they should be naturaly used after the fix. So I'm just happy to see confirmation here.

Friendly ping: how should I proceed with this review?

RKSimon added inline comments.Apr 29 2019, 6:36 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
16972

Subtarget.is64Bit() should be superfluous

Serge_Preis added inline comments.Apr 29 2019, 11:28 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
16972

This is a good question, On one (conservative) hand I am aiming at movq instruction, whch is only available on 64bit subtarget (and similar places have same set of checks). On other hand the logic behind the conversion is pretty generic and doesn't depend on subtarget.. So the question is "are we sure that 64bit SCALAR_TO_VECTOR will be handled efficiently now on 32bit subtarget or such generalization will require further patching?"

craig.topper added inline comments.Apr 29 2019, 11:44 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
16972

Type legalization would have guaranteed there are no i64 values in the DAG if this is not a 64-bit target before we get to this code.

Removed subtarget check in response to review comments

Serge_Preis marked an inline comment as done.Apr 29 2019, 11:59 PM
Serge_Preis added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
16972

Ok. uploaded new iteration without subtarget check

RKSimon accepted this revision.Apr 30 2019, 1:26 AM

LGTM - @Serge_Preis do you have commit access?

This revision is now accepted and ready to land.Apr 30 2019, 1:26 AM

@Serge_Preis do you have commit access?

No, unfortunately don't.

This revision was automatically updated to reflect the committed changes.