This is an archive of the discontinued LLVM Phabricator instance.

[X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension
ClosedPublic

Authored by RKSimon on Oct 28 2015, 9:04 AM.

Details

Summary

The lowering patterns for X86ISD::VZEXT_MOVL for 128-bit to 256-bit vectors were just copying the lower xmm instead of actually masking off the first scalar using a xmm blend (we make use of the implicit zeroing of the upper ymm).

Fix for PR25320.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 38662.Oct 28 2015, 9:04 AM
RKSimon retitled this revision from to [X86][AVX] Fix lowering of X86ISD::VZEXT_MOVL for 128-bit -> 256-bit extension.
RKSimon updated this object.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
jketema added inline comments.
test/CodeGen/X86/vec_extract-avx.ll
112 ↗(On Diff #38662)

I think there still might be something wrong here. If I understand this correctly only the lower 48 bytes will now be copied into xmm0.

RKSimon updated this revision to Diff 38677.Oct 28 2015, 11:53 AM
RKSimon marked an inline comment as done.
RKSimon added inline comments.
test/CodeGen/X86/vec_extract-avx.ll
112 ↗(On Diff #38677)

Nice catch!

RKSimon marked an inline comment as done.Oct 29 2015, 7:05 AM

Jeroen has confirmed that fixes PR24935

andreadb edited edge metadata.Nov 5 2015, 8:25 AM

Hi Simon,

lib/Target/X86/X86InstrSSE.td
7207–7227 ↗(On Diff #38677)

I don't think these new patterns are needed. We already have sse4.1/avx patterns to select a blend from a vzmovl node.

If your goal is just to fix the miscompile, then the minimal fix consists in removing the offending patterns between lines 939 and 952.

The poor codegen reported by Jeroen is caused by the lack of smart x86 combine rules for 256-bit shuffles in function 'PerformShuffleCombine256'. That function implements a very simple rule for when there is a shuffle between two concat_vector nodes. Ideally we should extend it and add rules for the case where the second operand is a build_vector of all zeroes.

Currently we check if a shuffle takes as input two concat_vectors and we try to fold it to a zero extending load or an insert of a 128-bit vector into a zero vector.
I think that we are just missing rules for the case where we are inserting a 64/32-bit quantity in a zero vector.

I'll update this patch to just be a fix, with just the removed lines (and the altered tests) and then prepare a second patch that improves the code quality. We are missing a potentially big perf gain for when the upper half of a register can be implicitly zero'd with VEX encoded instructions - especially on 128-bit ALUs such as Jaguar and Sandy Bridge.

lib/Target/X86/X86InstrSSE.td
7207–7227 ↗(On Diff #38677)

I can confirm that just removing the lines 939 to 952 fixes the problem. It then leaves AVX1 targets with a lot of domain crossing stalls between integer / float to deal with the 256-bit vectors.

RKSimon updated this revision to Diff 40565.Nov 18 2015, 3:00 PM
RKSimon edited edge metadata.

Simplified patch to just the bugfix - improved codegen to follow in a future patch

andreadb accepted this revision.Nov 19 2015, 3:17 AM
andreadb edited edge metadata.

Thanks Simon for working on this!

The patch LGTM.

Cheers,
Andrea

This revision is now accepted and ready to land.Nov 19 2015, 3:17 AM
This revision was automatically updated to reflect the committed changes.