Page MenuHomePhabricator

[X86] Only reorder srl/and on last DAG combiner run
ClosedPublic

Authored by craig.topper on Feb 12 2018, 10:15 AM.

Details

Summary

This seems to interfere with a target independent brcond combine that looks for the (srl (and X, C1), C2) pattern to enable TEST instructions. Once we flip, that combine doesn't fire and we end up exposing it to the X86 specific BT combine which causes us to emit a BT instruction. BT has lower throughput than TEST.

We could try to make the brcond combine aware of the alternate pattern, but since the flip was just a code size reduction and not likely to enable other combines, it seemed easier to just delay it until after lowering.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper created this revision.Feb 12 2018, 10:15 AM

Does this also imply that we're abandoning D37418? Ie, we're effectively blacklisting 'bt' in isel?

BT has lower throughput than TEST.

This isn't true universally based on Agner's tables. Ryzen has 1 uop and full throughput for a plain 'bt' (btc/btr/bts need 2 uops).

I don't have any actual perf evidence of 'bt' being a factor in a benchmark, but in D37418, it was suggested that a 'btc' would likely be an improvement. See also the comments in:
https://bugs.llvm.org/show_bug.cgi?id=35911

Given that and the fact that gcc/icc already choose 'bt' more often than llvm (again based on earlier comments in the previous review and the PR), I think we'd be better off choosing bt and friends by default in isel and transforming to something with a bigger constant as a specialization at a later stage.

All valid points. I don't know of any benchmarks that care either. I guess my goal with this patch is to get DAG going into isel to be consistent. The test cases in test-shrink.ll look very similar, but some use test and some use bt. This patch gets them all back to using test. We can isel them back to bt for D37418.

All valid points. I don't know of any benchmarks that care either. I guess my goal with this patch is to get DAG going into isel to be consistent. The test cases in test-shrink.ll look very similar, but some use test and some use bt. This patch gets them all back to using test. We can isel them back to bt for D37418.

Yes, I'm all for more consistency; it's a mess.

If I'm seeing it correctly, D37418 would need to be enhanced to handle these cases because it's only looking for 64-bit constants right now?

So let's try to move these forward. AFAICT, there's no opposition to D37418 with -Os. Start there?

I don't have any benchmarks. Anecdotally speaking from my experience at Apple: improving code density tends to improve overall system performance. That being said, microbenchmarks sometime suffer.

(Beyond the scope of this patch…) I wish LLVM or the x86 backend could change its instruction selection on the fly. Then it could prefer "reasonable density" (not microcoded) by default and "throughput" in loops (or otherwise obvious hot paths). This would probably yield the best overall/pragmatic performance.

It looks like the behavior of the cases in test-shrink are currently dependent on the order of the operands to the branch...

Rebase with changes to test-vs-bittest.ll

(Beyond the scope of this patch…) I wish LLVM or the x86 backend could change its instruction selection on the fly. Then it could prefer "reasonable density" (not microcoded) by default and "throughput" in loops (or otherwise obvious hot paths). This would probably yield the best overall/pragmatic performance.

Just to second this, I think there's a lot of potential gain to be had by being able to select cold blocks for size and hot blocks for speed. As stated, definitely off topic for this review though!

Rebase now that we use BT under optsize.

(Beyond the scope of this patch…) I wish LLVM or the x86 backend could change its instruction selection on the fly. Then it could prefer "reasonable density" (not microcoded) by default and "throughput" in loops (or otherwise obvious hot paths). This would probably yield the best overall/pragmatic performance.

Just to second this, I think there's a lot of potential gain to be had by being able to select cold blocks for size and hot blocks for speed. As stated, definitely off topic for this review though!

@davezarzycki @reames Any chance that you could put some of this request into a bugzilla please? Its the kind of thing that keeps getting bounced around for scheduler driven MC optimizations.

Not sure where we're at - rL325287 was supposed to include a new test file, but it didn't get committed yet? Will this patch affect those tests?

Are we converting all >8 bit constants to bt with optsize yet, or that's another patch?

Looks like I forgot to 'svn add' the test when i applied the patch before commit.

This patch should not effect those tests, but I'll double check.

We are not converting >8-bit or/and/xor to bts/btr/btc under optsize yet. Just >32 bit. Using bts/btr/btc will cripple our load folding ability so its not a clear win.

I've committed the test case, and this patch had no effect on it.

test/CodeGen/X86/test-vs-bittest.ll
55 ↗(On Diff #134493)

Note this test is identical to test64 above with the operands of the 'br' instruction reversed. We were always using 'test'. Seems the initial selectionDAG in one of the cases(i forgot which one) has an ISD::XOR inverting the setcc result before the branch. This somehow causes a difference in DAG combine ordering or something that leads to different results.

spatel accepted this revision.Feb 16 2018, 7:31 AM

LGTM.

test/CodeGen/X86/test-vs-bittest.ll
8 ↗(On Diff #134493)

Nit: all of the tests in this file have .cfi noise; use 'nounwind' attribute to remove that.

55 ↗(On Diff #134493)

Mysteries of SDAG... :)
Please add this as a test comment here or the other test, so we have a record of it.

This revision is now accepted and ready to land.Feb 16 2018, 7:31 AM
This revision was automatically updated to reflect the committed changes.