This is an archive of the discontinued LLVM Phabricator instance.

Improve EmitLoweredSelect for contiguous pseudo CMOV instructions.
ClosedPublic

Authored by kbsmith1 on Jul 22 2015, 2:00 PM.

Details

Summary

This change improves EmitLoweredSelect so that multiple contiguous pseudo CMOV
instructions with the same (or exactly opposite) conditions get lowered using a single
new basic-block. This eliminates a unnecessary extra basic-blocks (and CFG merge points)
when contiguous CMOVs are being lowered.

Diff Detail

Repository
rL LLVM

Event Timeline

kbsmith1 updated this revision to Diff 30393.Jul 22 2015, 2:00 PM
kbsmith1 retitled this revision from to Improve EmitLoweredSelect for contiguous pseudo CMOV instructions..
kbsmith1 updated this object.
kbsmith1 added reviewers: mkuper, spatel, ab.

Ping on this review.

ab requested changes to this revision.Jul 30 2015, 10:56 AM
ab edited edge metadata.

A few nits here and there.
Testing all types is nice, but isn't IMO the most interesting part: the PHI trickery is!

lib/Target/X86/X86ISelLowering.cpp
19785 ↗(On Diff #30393)

of -> if

19788 ↗(On Diff #30393)

What about something like the IMO more descriptive "isCMOVPseudo" ?

19790–19810 ↗(On Diff #30393)

Can you sort these? I see the pseudo lowering switch is equally unordered, can you fix that too?

19814 ↗(On Diff #30393)

return false here?

20042 ↗(On Diff #30393)

Since this isn't shared between the loops, can you define it in the for(;;) instead?

20044–20047 ↗(On Diff #30393)

IMO the field names don't add much information, as they're always accompanied with the same-name variable. What about a pair instead of this struct?

20097 ↗(On Diff #30393)

That "all" bothers me ;)

-> "Now remove the CMOV(s)." ?

test/CodeGen/X86/pseudo_cmov_lower.ll
1–2 ↗(On Diff #30393)

Why the explicit cpu?

Also, IIRC we always use the CMOV pseudos for SSE selects, so can you do x86_64 tests for those as well? I don't expect interesting differences (except much more readable tests), so you can probably only use i386 for non-SSE types.

7 ↗(On Diff #30393)

I think these could use more explicit tests, for the copies and PHIs.

217–222 ↗(On Diff #30393)

These probably require AVX512. You can ignore them, I think.

This revision now requires changes to proceed.Jul 30 2015, 10:56 AM

Ahmed, thank you for the review. Please see my responses to your comments, and I will be uploading an improved version in just a bit.

lib/Target/X86/X86ISelLowering.cpp
19785 ↗(On Diff #30393)

OK.

19788 ↗(On Diff #30393)

OK.

19790–19810 ↗(On Diff #30393)

OK.

19814 ↗(On Diff #30393)

OK.

20042 ↗(On Diff #30393)

OK.

20044–20047 ↗(On Diff #30393)

Let me try a pair and see whether the code looks cleaner.

20097 ↗(On Diff #30393)

OK.

test/CodeGen/X86/pseudo_cmov_lower.ll
1–2 ↗(On Diff #30393)

This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV pseudos occur. For this specific test, the few FP things in here need to use 32 bit cpu to test RFP register type CMOV pseudos. pseudo_cmov_lower1.ll tests the SSE/SSE2 type pseudo CMOVs (for 32 bit cpu) by using -mcpu=pentium, which I thought would be nice to explicitly say that this is a CPU without CMOV support. I also wanted to make sure that the test worked properly when a CPU with CMOV support was used, but when cmov was turned off explicitly. I can add a x86_64 test based on pseudo_cmov_lower1.ll, or perhaps just as another run line in that test. I don't understand why x86_64 will make the tests any more readable, can you elaborate on that?

7 ↗(On Diff #30393)

Can you explain more what you are looking for here? I cannot really test for copies and phis in the resulting output assembly code, so maybe I just don't quite understand your comment.

217–222 ↗(On Diff #30393)

Thanks for that info.

kbsmith1 updated this revision to Diff 31071.Jul 30 2015, 1:48 PM
kbsmith1 edited edge metadata.

This addresses all of Ahmed's comments in the source code. Still working on addressing comments in the test cases.

kbsmith1 updated this revision to Diff 31081.Jul 30 2015, 3:24 PM
kbsmith1 edited edge metadata.

This adds a new test, pseudo_cmov_lower2.ll which tests the PHI operand rewrite portion of the change-set. With this change, I think all of Ahmed's comments have been addressed.

ab added a comment.Jul 31 2015, 11:06 AM

Thanks for the update! A couple answers inline.

lib/Target/X86/X86ISelLowering.cpp
20074–20075 ↗(On Diff #31081)

Merge both lines into

RegRewriteTable[DestReg] = std::make_pair(Op1Reg, Op2Reg)

?

test/CodeGen/X86/pseudo_cmov_lower.ll
2–3 ↗(On Diff #31081)

This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV pseudos occur. For this specific test, the few FP things in here need to use 32 bit cpu to test RFP register type CMOV pseudos.

Makes sense

pseudo_cmov_lower1.ll tests the SSE/SSE2 type pseudo CMOVs (for 32 bit cpu) by using -mcpu=pentium, which I thought would be nice to explicitly say that this is a CPU without CMOV support. I also wanted to make sure that the test worked properly when a CPU with CMOV support was used, but when cmov was turned off explicitly.

That sounds like testing -mattr and CPU features, which should be done elsewhere, no? Here, if all you want is to assert that we're compiling for a target without CMOV, -mattr=-cmov does exactly that (but could be left out with just i386). IMO, explicit CPUs should only be used when the CPU actually matters (e.g. when testing scheduling models?).

I can add a x86_64 test based on pseudo_cmov_lower1.ll, or perhaps just as another run line in that test.

Great!

I don't understand why x86_64 will make the tests any more readable, can you elaborate on that?

Only because of the less noisy ABI, really. For tests like foo4/foo5, with very explicit checks.

8 ↗(On Diff #31081)

Right now the tests check for a specific number of branches. I'm also interested in the flow of data and register assignments, leading to the final return value. Basically, the various MOVs in each block, and the final SUB.

Consider, for instance, a bug where one of the PHIs has inverted operands (or even the feature where opposite CCs lead to inverted PHI operands): we should test for that.

By the way, I was testing this out when I noticed that this:

define i32 @foo(i32 %v0, i32 %v1, i32 %v2, i32 %v3, i32 %v4) nounwind {
entry:
  %cmp = icmp slt i32 %v0, 0  ;; <- %v0 disables the opt,  %v1 enables it
  %v3.v4 = select i1 %cmp, i32 %v3, i32 %v4
  %v1.v2 = select i1 %cmp, i32 %v1, i32 %v2
  %sub = sub i32 %v1.v2, %v3.v4
  ret i32 %sub
}

seems to not trigger the optimization (I get JS twice). Shouldn't it?

Thanks for the additional comments. See some explanations inline. I'll see if I can improve a few of the tests to test for operand orders, but that isn't generally doable on all the tests due to the complexities of how downstream affects the code.

lib/Target/X86/X86ISelLowering.cpp
20074–20075 ↗(On Diff #31081)

OK.

test/CodeGen/X86/pseudo_cmov_lower.ll
2–3 ↗(On Diff #31081)

OK, I'll just use i386-linux-gnu rather than the -mcpu options.

8 ↗(On Diff #31081)

Regarding the first part. It is very difficult to test that the PHIs operands are in the correct order (well, really that they come from the proper BB. The code seen effectively after the transform looks like:

BB1: cmp

jns BB3:

BB2: // empty
BB3: phi op1(from BB1), op2(from BB2)

phi op11(from BB1), op12(from BB2)

The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation. I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed. In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself.

Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason. select's of two memory operands ends up being lowered into a select of two LEAs that represent the address of the two memory operands, and then a single dereference of the selected pointer. This means you don't have to speculate a load, or increase the number of loads n the program. The actual IR for the example in question when you get to EmitLoweredSelect is this:

(gdb) print BB->dump()
BB#0: derived from LLVM BB %entry

CMP32mi8 <fi#-1>, 1, %noreg, 0, %noreg, 0, %EFLAGS<imp-def>; mem:LD4[FixedStack-1](align=16)
%vreg0<def> = LEA32r <fi#-4>, 1, %noreg, 0, %noreg; GR32:%vreg0
%vreg1<def> = LEA32r <fi#-5>, 1, %noreg, 0, %noreg; GR32:%vreg1
%vreg2<def> = CMOV_GR32 %vreg1<kill>, %vreg0<kill>, 15, %EFLAGS<imp-use> ;GR32:%vreg2,%vreg1,%vreg0
%vreg3<def> = LEA32r <fi#-2>, 1, %noreg, 0, %noreg; GR32:%vreg3
%vreg4<def> = LEA32r <fi#-3>, 1, %noreg, 0, %noreg; GR32:%vreg4
%vreg5<def> = CMOV_GR32 %vreg4<kill>, %vreg3<kill>, 15, %EFLAGS<imp-use>; GR32:%vreg5,%vreg4,%vreg3
%vreg6<def> = MOV32rm %vreg5<kill>, 1, %noreg, 0, %noreg; mem:LD4[<unknown>] GR32:%vreg6,%vreg5
%vreg7<def,tied1> = SUB32rm %vreg6<tied0>, %vreg2<kill>, 1, %noreg, 0, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[<unknown>] GR32:%vreg7,%vreg6,%vreg2
%EAX<def> = COPY %vreg7; GR32:%vreg7
RETL %EAX

$1 = void

As you can see the pseudo CMOVs are not contiguous, and thus this new code doesn't apply.

ab added inline comments.Jul 31 2015, 12:53 PM
test/CodeGen/X86/pseudo_cmov_lower.ll
8 ↗(On Diff #31081)

The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation. I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed. In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself.

I realize that, and this is a shortcoming of our test infrastructure that the MI serialization effort intends to address. Still, right now, in my opinion, overly explicit tests are better than no tests.

So, my (radical) advice is: use the utils/update_llc_test_checks.py script, and tweak it locally to remove the SP-offset scrubs (l46). You'll probably want to get rid of foo9, and simplify foo8 (just focus on two types rather than the combination of all?). Also, I think the reuse of %v2 might make the output harder to follow than they would be if you had v1.v2 and v3.v4, but that's just a gut feeling. When someone changes anything downstream that affects this, they can just run the script again, and have a nice diff of what changed.

What do you think? If you can come up with a cleaner way, that'd be great! But I'm uncomfortable with no testing at all :/

Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason. [...]

Thanks for investigating the example!

kbsmith1 updated this revision to Diff 31161.Jul 31 2015, 2:01 PM

Updated code to fix the one remaining suggestion. Added a test foo3 to pseudo_cmov_lower2.ll which does the checks that Ahmed requested for exact operand order. This should be pretty stable because of x86_64 register passing conventions really kind of force specific patterns down stream.

kbsmith1 updated this revision to Diff 31187.Jul 31 2015, 10:01 PM

Updated pseudo_cmov_lower2.ll test routine foo3, and added routine foo4. foo3 now is a simpler pattern to check for, and foo4 tests correctness of transformation with the "opposite condition" code in EmitLoweredSelect gets used.

kbsmith1 marked 12 inline comments as done.Aug 4 2015, 8:02 AM

Ahmed, I think I have addressed all the comments and improvements that you asked for. Is this now good to go from your point of view?

ab accepted this revision.Aug 5 2015, 10:40 AM
ab edited edge metadata.

I'd still prefer tests with only -NEXT from the label to the ret, but this LGTM as is, thanks!

-Ahmed

lib/Target/X86/X86ISelLowering.cpp
19858 ↗(On Diff #31187)

rewrting -> rewriting

test/CodeGen/X86/pseudo_cmov_lower.ll
218–223 ↗(On Diff #31187)

If this test isn't actually testing CMOV_VNI1, should it be removed?

This revision is now accepted and ready to land.Aug 5 2015, 10:40 AM
kbsmith1 updated this revision to Diff 31374.Aug 5 2015, 10:58 AM
kbsmith1 edited edge metadata.

Fixed "rewrting" spelling. Added comment on CMOV_V*I1 test to indicate this test is to be
updated if a way is found to generate these pseudo CMOV opcodes.

kbsmith1 marked an inline comment as done.Aug 5 2015, 11:00 AM
ab added inline comments.Aug 5 2015, 11:02 AM
test/CodeGen/X86/pseudo_cmov_lower.ll
219–224 ↗(On Diff #31374)

Does enabling, say, avx512f work?

Michael,

Could you commit these changes for me since I don't yet have commit permission to the llvm svn repository?

Thank you,
Kevin

Added inline comment.

test/CodeGen/X86/pseudo_cmov_lower.ll
219–224 ↗(On Diff #31374)

I will look into that. If it does, then I will remove this specific test, and add such a necessary test back as a separate test file in a separate change set.

Michael, Can you commit these changes for me since I don't yet have commit permission in LLVM svn?

This revision was automatically updated to reflect the committed changes.