This is an archive of the discontinued LLVM Phabricator instance.

[X86] Stop promoting integer loads to vXi64
ClosedPublic

Authored by craig.topper on Oct 15 2018, 4:48 PM.

Details

Summary

Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to remove the bitcast.

I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.

I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the load size.

I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.

Diff Detail

Event Timeline

craig.topper created this revision.Oct 15 2018, 4:48 PM
craig.topper added inline comments.Oct 15 2018, 4:54 PM
test/CodeGen/X86/oddshuffles.ll
1633

Looks like we're now reusing something we previously reloaded.

test/CodeGen/X86/pshufb-mask-comments.ll
60–62

This changed because the loads and stores in the test all use undef pointers. And previously the two loads combined because the promotion gave them the same type. Now they will always have different types.

test/CodeGen/X86/widened-broadcast.ll
124–138

Looks like matching broadcast from shuffles is a little weak in avx1. This test regressed but the load_splat_8i32_8i32_01010101 case improved.

242

Similar to load_splat_8i32_4i32_01010101

449–463

Similar to load_splat_8i32_4i32_01010101

Add a hack to prevent the crash in vector-trunc. Though now we miss a combine.

RKSimon added inline comments.Oct 21 2018, 3:40 AM
test/CodeGen/X86/vector-trunc.ll
1926 ↗(On Diff #170329)

I think you're going to need to add a similar size test to the get constant code in X86MCInstLower.cpp

Thread expected width into the constant pool shuffle decoders so we don't over decode the constant.

RKSimon accepted this revision.Oct 21 2018, 11:33 AM

LGTM - thanks.

PR33399 should already cover the missed AVX1 broadcasts

This revision is now accepted and ready to land.Oct 21 2018, 11:33 AM
This revision was automatically updated to reflect the committed changes.

This broke our continuous integration, it appears to cause miscompiles in tensorflow/compiler when running on X86 CPU:

Specifically the PoolGradTest.testMaxPool test here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/tests/pooling_ops_test.py#L545

Unfortunately I'm not familiar enough with the LLVM backend nor tensorflow/XLA to work out how to reproduce/reduce this.

@sammccall I've reverted the change in r344921. Is there anything you can do to help narrow this down? Ideally providing the LLVM IR for the failing case.

@sammccall I've reverted the change in r344921. Is there anything you can do to help narrow this down? Ideally providing the LLVM IR for the failing case.

Thanks Craig!
I've looped in @sanjoy who knows way more about this than I do, and has access to our internal continuous integration.
(I bet it's reproducible in the upstream TensorFlow repo too, but I don't have a build environment for that).

Hi Craig,

+CC My work email

I've attached the IR that I think is miscompiled after this patch
(this is the post-optimization IR, after LLVM IR opts have run). The
miscompile happens on broadwell at least, but may also happen on other
archs.

Let me know if there's anything I can do to help you diagnose this.

@sanjoy @sammccall I've recommitted this in r344965 with a fix for the miscompile. I believe DAGCombiner::ForwardStoreValueToDirectLoad was forwarding a v4i64 store to a v4i32 load by replacing them with a truncate which doesn't work for vectors. We would need an extract_subvector+bitcast. I've put in a qualification to only forward scalars if the types don't match. Please let me know if you see any more issues.