The lowering patterns for X86ISD::VZEXT_MOVL for 128-bit to 256-bit vectors were just copying the lower xmm instead of actually masking off the first scalar using a xmm blend (we make use of the implicit zeroing of the upper ymm).
Fix for PR25320.
Paths
| Differential D14151
[X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension ClosedPublic Authored by RKSimon on Oct 28 2015, 9:04 AM.
Details Summary The lowering patterns for X86ISD::VZEXT_MOVL for 128-bit to 256-bit vectors were just copying the lower xmm instead of actually masking off the first scalar using a xmm blend (we make use of the implicit zeroing of the upper ymm). Fix for PR25320.
Diff Detail
Event TimelineComment Actions Hi Simon,
Comment Actions I'll update this patch to just be a fix, with just the removed lines (and the altered tests) and then prepare a second patch that improves the code quality. We are missing a potentially big perf gain for when the upper half of a register can be implicitly zero'd with VEX encoded instructions - especially on 128-bit ALUs such as Jaguar and Sandy Bridge.
RKSimon edited edge metadata. Comment ActionsSimplified patch to just the bugfix - improved codegen to follow in a future patch andreadb edited edge metadata. Comment ActionsThanks Simon for working on this! The patch LGTM. Cheers, This revision is now accepted and ready to land.Nov 19 2015, 3:17 AM Closed by commit rL253561: [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension (authored by RKSimon). · Explain WhyNov 19 2015, 4:21 AM This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 40565 lib/Target/X86/X86InstrSSE.td
test/CodeGen/X86/vec_extract-avx.ll
|
I don't think these new patterns are needed. We already have sse4.1/avx patterns to select a blend from a vzmovl node.
If your goal is just to fix the miscompile, then the minimal fix consists in removing the offending patterns between lines 939 and 952.
The poor codegen reported by Jeroen is caused by the lack of smart x86 combine rules for 256-bit shuffles in function 'PerformShuffleCombine256'. That function implements a very simple rule for when there is a shuffle between two concat_vector nodes. Ideally we should extend it and add rules for the case where the second operand is a build_vector of all zeroes.
Currently we check if a shuffle takes as input two concat_vectors and we try to fold it to a zero extending load or an insert of a 128-bit vector into a zero vector.
I think that we are just missing rules for the case where we are inserting a 64/32-bit quantity in a zero vector.