This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add general memory folding for (V)INSERTPS instruction
ClosedPublic

Authored by RKSimon on Oct 22 2015, 10:27 AM.

Details

Summary

This patch improves the memory folding of the inserted float element for the (V)INSERTPS instruction.

The existing implementation occurs in the DAGCombiner and relies on the narrowing of a whole vector load into a scalar load (into a vector) to then allow folding to occur later on. Not only has this proven problematic for debug builds, but it also prevents other memory folds (notably stack reloads) from happening.

This patch removes the old implementation and moves the folding code to the X86 foldMemoryOperand handler. A new private 'special case' function - foldMemoryOperandSpecial - has been added to deal with memory folding of instructions that can't just use the lookup tables (insertps is the first of several that could be done).

It also tweaks the memory operand folding code with an additional pointer offset that allows existing memory addresses to be modified, in this case to convert the vector address to the explicit address of the scalar element that will be inserted.

Unlike the previous implementation we now set the insertion source index to zero, this is ignored for the (V)INSERTPSrm version, so this mainly beneficial so shuffle decodes don't show a pointer offset.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 38145.Oct 22 2015, 10:27 AM
RKSimon retitled this revision from to [X86][SSE] Add general memory folding for (V)INSERTPS instruction.
RKSimon updated this object.
RKSimon added reviewers: qcolombet, filcab, andreadb, spatel.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
filcab edited edge metadata.Nov 4 2015, 4:22 AM

LGTM, but I'm not the most well versed in the MachineInstr stuff.
Thanks for making the filecheck stuff match more of the instruction, too!

lib/Target/X86/X86ISelLowering.cpp
26651 ↗(On Diff #38145)

Please split the clang-format (or manual format) changes from the rest.

lib/Target/X86/X86InstrInfo.cpp
4894 ↗(On Diff #38145)

Please split the clang-format changes from the rest.

4923 ↗(On Diff #38145)

I would s/Special/Custom/, but that's just me bikeshedding :-)

lib/Target/X86/X86InstrInfo.h
515 ↗(On Diff #38145)

Can you double-check the doxygen that gets generated? IIRC, it stops at the first '.'.

Unless there's a special case for "e.g.", it's probably best to replace it with a more colloquial "like", or "for example:".

RKSimon updated this revision to Diff 39199.Nov 4 2015, 6:46 AM
RKSimon edited edge metadata.

Refreshed patch based on Filipe's comments

RKSimon marked 4 inline comments as done.Nov 4 2015, 6:47 AM
rob.lougher added a comment.EditedNov 4 2015, 6:48 AM

Hi Simon,

We discovered a bug internally caused by the non-zeroing of the countS bits in the folding of the insertps load. Although countS bits are ignored when loading from memory on insertps, we need to explicitly set them to 0 as another optimization may later "unfold" the load. This is demonstrated by the following testcase (the checks are based on the RUN lines from the sse41.ll file).

`define <4 x float> @foo(<4 x float>* %v0, <4 x float>* %v1) {
; X32-LABEL: foo:
; X32: BB#0: ; X32-NEXT: movl {{[0-9]+}}(%esp), %eax ; X32-NEXT: movl {{[0-9]+}}(%esp), %ecx ; X32-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero ; X32-NEXT: movaps (%eax), %xmm0 ; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[0] ; X32-NEXT: addps %xmm1, %xmm0 ; X32-NEXT: retl ; ; X64-LABEL: foo: ; X64: BB#0:
; X64-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; X64-NEXT: movaps (%rdi), %xmm0
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[0]
; X64-NEXT: addps %xmm1, %xmm0
; X64-NEXT: retq

%a = getelementptr inbounds <4 x float>, <4 x float>* %v1, i64 0, i64 1
%b = load float, float* %a, align 4
%c = insertelement <4 x float> undef, float %b, i32 0
%d = load <4 x float>, <4 x float>* %v1, align 16
%e = load <4 x float>, <4 x float>* %v0, align 16
%f = shufflevector <4 x float> %e, <4 x float> %d, <4 x i32> <i32 0, i32 1, i32 2, i32 5>
%g = fadd <4 x float> %c, %f
ret <4 x float> %g

}
`
Another minor comment is that your change will do general memory load folding in addition to stack folding, but you've only got tests for stack folding.

Another minor comment is that your change will do general memory load folding in addition to stack folding, but you've only got tests for stack folding.

The changes in test/CodeGen/X86/avx.ll and test/CodeGen/X86/sse41.ll cover general folded loads.

I'll add the test case to the patch shortly.

RKSimon updated this revision to Diff 39203.Nov 4 2015, 7:25 AM

Added Rob's additional test case

rob.lougher accepted this revision.Nov 4 2015, 7:44 AM
rob.lougher added a reviewer: rob.lougher.

Hi Simon,

Thanks for adding the test. This looks good to me.

Rob.

This revision is now accepted and ready to land.Nov 4 2015, 7:44 AM
This revision was automatically updated to reflect the committed changes.