This change improves EmitLoweredSelect so that multiple contiguous pseudo CMOV
instructions with the same (or exactly opposite) conditions get lowered using a single
new basic-block. This eliminates a unnecessary extra basic-blocks (and CFG merge points)
when contiguous CMOVs are being lowered.
Details
Diff Detail
Event Timeline
A few nits here and there.
Testing all types is nice, but isn't IMO the most interesting part: the PHI trickery is!
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
19785 | of -> if | |
19788 | What about something like the IMO more descriptive "isCMOVPseudo" ? | |
19790–19810 | Can you sort these? I see the pseudo lowering switch is equally unordered, can you fix that too? | |
19814 | return false here? | |
20040 | Since this isn't shared between the loops, can you define it in the for(;;) instead? | |
20042–20045 | IMO the field names don't add much information, as they're always accompanied with the same-name variable. What about a pair instead of this struct? | |
20089–20092 | That "all" bothers me ;) -> "Now remove the CMOV(s)." ? | |
test/CodeGen/X86/pseudo_cmov_lower.ll | ||
2–3 | Why the explicit cpu? Also, IIRC we always use the CMOV pseudos for SSE selects, so can you do x86_64 tests for those as well? I don't expect interesting differences (except much more readable tests), so you can probably only use i386 for non-SSE types. | |
8 | I think these could use more explicit tests, for the copies and PHIs. | |
218–223 | These probably require AVX512. You can ignore them, I think. |
Ahmed, thank you for the review. Please see my responses to your comments, and I will be uploading an improved version in just a bit.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
19785 | OK. | |
19788 | OK. | |
19790–19810 | OK. | |
19814 | OK. | |
20040 | OK. | |
20042–20045 | Let me try a pair and see whether the code looks cleaner. | |
20089–20092 | OK. | |
test/CodeGen/X86/pseudo_cmov_lower.ll | ||
2–3 | This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV pseudos occur. For this specific test, the few FP things in here need to use 32 bit cpu to test RFP register type CMOV pseudos. pseudo_cmov_lower1.ll tests the SSE/SSE2 type pseudo CMOVs (for 32 bit cpu) by using -mcpu=pentium, which I thought would be nice to explicitly say that this is a CPU without CMOV support. I also wanted to make sure that the test worked properly when a CPU with CMOV support was used, but when cmov was turned off explicitly. I can add a x86_64 test based on pseudo_cmov_lower1.ll, or perhaps just as another run line in that test. I don't understand why x86_64 will make the tests any more readable, can you elaborate on that? | |
8 | Can you explain more what you are looking for here? I cannot really test for copies and phis in the resulting output assembly code, so maybe I just don't quite understand your comment. | |
218–223 | Thanks for that info. |
This addresses all of Ahmed's comments in the source code. Still working on addressing comments in the test cases.
This adds a new test, pseudo_cmov_lower2.ll which tests the PHI operand rewrite portion of the change-set. With this change, I think all of Ahmed's comments have been addressed.
Thanks for the update! A couple answers inline.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
20103–20104 | Merge both lines into RegRewriteTable[DestReg] = std::make_pair(Op1Reg, Op2Reg) ? | |
test/CodeGen/X86/pseudo_cmov_lower.ll | ||
3–4 |
Makes sense
That sounds like testing -mattr and CPU features, which should be done elsewhere, no? Here, if all you want is to assert that we're compiling for a target without CMOV, -mattr=-cmov does exactly that (but could be left out with just i386). IMO, explicit CPUs should only be used when the CPU actually matters (e.g. when testing scheduling models?).
Great!
Only because of the less noisy ABI, really. For tests like foo4/foo5, with very explicit checks. | |
9 | Right now the tests check for a specific number of branches. I'm also interested in the flow of data and register assignments, leading to the final return value. Basically, the various MOVs in each block, and the final SUB. Consider, for instance, a bug where one of the PHIs has inverted operands (or even the feature where opposite CCs lead to inverted PHI operands): we should test for that. By the way, I was testing this out when I noticed that this: define i32 @foo(i32 %v0, i32 %v1, i32 %v2, i32 %v3, i32 %v4) nounwind { entry: %cmp = icmp slt i32 %v0, 0 ;; <- %v0 disables the opt, %v1 enables it %v3.v4 = select i1 %cmp, i32 %v3, i32 %v4 %v1.v2 = select i1 %cmp, i32 %v1, i32 %v2 %sub = sub i32 %v1.v2, %v3.v4 ret i32 %sub } seems to not trigger the optimization (I get JS twice). Shouldn't it? |
Thanks for the additional comments. See some explanations inline. I'll see if I can improve a few of the tests to test for operand orders, but that isn't generally doable on all the tests due to the complexities of how downstream affects the code.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
20103–20104 | OK. | |
test/CodeGen/X86/pseudo_cmov_lower.ll | ||
3–4 | OK, I'll just use i386-linux-gnu rather than the -mcpu options. | |
9 | Regarding the first part. It is very difficult to test that the PHIs operands are in the correct order (well, really that they come from the proper BB. The code seen effectively after the transform looks like: BB1: cmp jns BB3: BB2: // empty phi op11(from BB1), op12(from BB2) The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation. I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed. In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself. Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason. select's of two memory operands ends up being lowered into a select of two LEAs that represent the address of the two memory operands, and then a single dereference of the selected pointer. This means you don't have to speculate a load, or increase the number of loads n the program. The actual IR for the example in question when you get to EmitLoweredSelect is this: (gdb) print BB->dump() CMP32mi8 <fi#-1>, 1, %noreg, 0, %noreg, 0, %EFLAGS<imp-def>; mem:LD4[FixedStack-1](align=16) %vreg0<def> = LEA32r <fi#-4>, 1, %noreg, 0, %noreg; GR32:%vreg0 %vreg1<def> = LEA32r <fi#-5>, 1, %noreg, 0, %noreg; GR32:%vreg1 %vreg2<def> = CMOV_GR32 %vreg1<kill>, %vreg0<kill>, 15, %EFLAGS<imp-use> ;GR32:%vreg2,%vreg1,%vreg0 %vreg3<def> = LEA32r <fi#-2>, 1, %noreg, 0, %noreg; GR32:%vreg3 %vreg4<def> = LEA32r <fi#-3>, 1, %noreg, 0, %noreg; GR32:%vreg4 %vreg5<def> = CMOV_GR32 %vreg4<kill>, %vreg3<kill>, 15, %EFLAGS<imp-use>; GR32:%vreg5,%vreg4,%vreg3 %vreg6<def> = MOV32rm %vreg5<kill>, 1, %noreg, 0, %noreg; mem:LD4[<unknown>] GR32:%vreg6,%vreg5 %vreg7<def,tied1> = SUB32rm %vreg6<tied0>, %vreg2<kill>, 1, %noreg, 0, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[<unknown>] GR32:%vreg7,%vreg6,%vreg2 %EAX<def> = COPY %vreg7; GR32:%vreg7 RETL %EAX $1 = void As you can see the pseudo CMOVs are not contiguous, and thus this new code doesn't apply. |
test/CodeGen/X86/pseudo_cmov_lower.ll | ||
---|---|---|
9 |
I realize that, and this is a shortcoming of our test infrastructure that the MI serialization effort intends to address. Still, right now, in my opinion, overly explicit tests are better than no tests. So, my (radical) advice is: use the utils/update_llc_test_checks.py script, and tweak it locally to remove the SP-offset scrubs (l46). You'll probably want to get rid of foo9, and simplify foo8 (just focus on two types rather than the combination of all?). Also, I think the reuse of %v2 might make the output harder to follow than they would be if you had v1.v2 and v3.v4, but that's just a gut feeling. When someone changes anything downstream that affects this, they can just run the script again, and have a nice diff of what changed. What do you think? If you can come up with a cleaner way, that'd be great! But I'm uncomfortable with no testing at all :/
Thanks for investigating the example! |
Updated code to fix the one remaining suggestion. Added a test foo3 to pseudo_cmov_lower2.ll which does the checks that Ahmed requested for exact operand order. This should be pretty stable because of x86_64 register passing conventions really kind of force specific patterns down stream.
Updated pseudo_cmov_lower2.ll test routine foo3, and added routine foo4. foo3 now is a simpler pattern to check for, and foo4 tests correctness of transformation with the "opposite condition" code in EmitLoweredSelect gets used.
Ahmed, I think I have addressed all the comments and improvements that you asked for. Is this now good to go from your point of view?
Fixed "rewrting" spelling. Added comment on CMOV_V*I1 test to indicate this test is to be
updated if a way is found to generate these pseudo CMOV opcodes.
test/CodeGen/X86/pseudo_cmov_lower.ll | ||
---|---|---|
220–225 | Does enabling, say, avx512f work? |
Michael,
Could you commit these changes for me since I don't yet have commit permission to the llvm svn repository?
Thank you,
Kevin
Added inline comment.
test/CodeGen/X86/pseudo_cmov_lower.ll | ||
---|---|---|
220–225 | I will look into that. If it does, then I will remove this specific test, and add such a necessary test back as a separate test file in a separate change set. |
Michael, Can you commit these changes for me since I don't yet have commit permission in LLVM svn?
of -> if