This is an archive of the discontinued LLVM Phabricator instance.

[x86] use more shift or LEA for select-of-constants
ClosedPublic

Authored by spatel on Jul 12 2017, 5:10 PM.

Details

Summary

We can convert any select-of-constants to math ops:
http://rise4fun.com/Alive/d7d

For this patch, I'm proposing to enhance an existing x86 transform that uses fake multiplies (they always become shl/lea) to avoid cmov or branching. The current code misses cases where we have a negative constant and a positive constant, so this is just trying to plug that hole.

The DAGCombiner diff prevents us from hitting a terrible inefficiency: we can start with a select in IR, create a select DAG node, convert it into a sext, convert it back into a select, and then lower it to sext machine code.

Some notes about the test diffs:

  1. 2010-08-04-MaskedSignedCompare.ll - We were creating control flow that didn't exist in the IR.
  2. memcmp.ll - Choose -1 or 1 is the case that got me looking at this again. I think we could avoid the push/pop in some cases if we used 'movzbl %al' instead of an xor on a different reg? That's a post-DAG problem though.
  3. mul-constant-result.ll - The trade-off between sbb+not vs. setne+neg could be addressed if that's a regression, but I think those would always be nearly equivalent.
  4. pr22338.ll and sext-i1.ll - These tests have undef operands, so I don't think we actually care about these diffs.
  5. sbb.ll - This shows a win for what I think is a common case: choose -1 or 0.
  6. select.ll - There's another borderline case here: cmp+sbb+or vs. test+set+lea? Also, sbb+not vs. setae+neg shows up again.
  7. select_const.ll - These are motivating cases for the enhancement; replace cmov with cheaper ops.

Diff Detail

Event Timeline

spatel created this revision.Jul 12 2017, 5:10 PM
zvi added inline comments.Jul 13 2017, 7:45 AM
test/CodeGen/X86/memcmp.ll
31

A write to AL followed by a read from EAX may cause a partial register stall or a lesser penalty if the processor supports special 'merge register parts' micro-ops (which is also undesirable) .
This seems to be a recurring pattern as the tests show.

zvi added a subscriber: aaboud.Jul 13 2017, 7:46 AM
spatel added inline comments.Jul 13 2017, 8:29 AM
test/CodeGen/X86/memcmp.ll
31

Good point. FWIW, I think the memcmp diffs will disappear if D34904 is accepted. But given that this is a general problem, the answer might be in adjusting the x86-fixup-setcc pass? That's where the 'movzwl' is replaced by 'xorl' IIUC.

spatel added inline comments.Jul 13 2017, 8:45 AM
test/CodeGen/X86/memcmp.ll
31

On 2nd thought, the whole point of that pass is to avoid partial reg stalls, so if there's still a problem for a CPU even with the leading xor to clear the reg, then we should just avoid this transform completely?

craig.topper added inline comments.
test/CodeGen/X86/memcmp.ll
31

There shouldn't be a merge penalty after a zeroing idiom. So the one at line 23 should protect it. It looks like the zeroing in BB#2 should be unnecessary.

zvi added inline comments.Jul 13 2017, 2:49 PM
test/CodeGen/X86/memcmp.ll
31

I missed the xor on line 23 - sorry, my bad.

spatel updated this revision to Diff 108960.Jul 31 2017, 12:01 PM

Ping.

Patch rebased after rL309597:
Note that the memcmp code with the potentially questionable partial-reg codegen has changed. We're producing setcc+movzbl more consistently now, but I'm not sure why x86-fixup-setcc doesn't transform those.

Note that the memcmp code with the potentially questionable partial-reg codegen has changed. We're producing setcc+movzbl more consistently now, but I'm not sure why x86-fixup-setcc doesn't transform those.

Disregard that - I didn't scroll down far enough. Sometimes we use movzbl and other times we use xor, and I don't know what causes the difference. IMO the improvements stand independent of that question/problem, but I plan to look at that next.

spatel added a comment.Aug 1 2017, 8:30 AM

Sometimes we use movzbl and other times we use xor, and I don't know what causes the difference. IMO the improvements stand independent of that question/problem, but I plan to look at that next.

I stepped through a couple of the memcmp examples in 'X86 Fixup SetCC'. The difference is that we only transform to xor if we can find the instruction that def'd the flags in the same BB. In the memcmp examples, if there's a 'cmp' in the res_block, we'll get the xor transform, but if the flags are defined by a sub in a preceding block, we'll see movzbl instead. I think either way avoids partial-reg problems in these examples. I'm still looking at memcmp expansion improvements, so these differences may disappear for memcmp.

zvi accepted this revision.Aug 6 2017, 5:51 AM

LGTM

This revision is now accepted and ready to land.Aug 6 2017, 5:51 AM
This revision was automatically updated to reflect the committed changes.
spatel reopened this revision.Aug 7 2017, 8:59 AM

Reopening - the patch was reverted at rL310264 because it caused/exposed failures in test-suite jpeg tests:
https://bugs.llvm.org/show_bug.cgi?id=34097
https://bugs.llvm.org/show_bug.cgi?id=34105

This revision is now accepted and ready to land.Aug 7 2017, 8:59 AM
spatel updated this revision to Diff 110421.Aug 9 2017, 10:08 AM

Patch updated:
As shown in PR34097 ( https://bugs.llvm.org/show_bug.cgi?id=34097 ), I failed to account for the subtract overflow case, so add a check for that.

Ideally, we'll update this all to use xor/and+add for all constants and avoid that issue (mul+add is really only here because that matches to LEA). For now, this is still the smallest code and codegen diff, so I think it's best to go with that and make sure I haven't introduced any other bugs.

This revision was automatically updated to reflect the committed changes.
aivchenk removed a subscriber: aivchenk.