This is an archive of the discontinued LLVM Phabricator instance.

[x86] avoid vector load narrowing with extracted store uses (PR42305)
ClosedPublic

Authored by spatel on Jun 18 2019, 1:50 PM.

Details

Summary

This is an exception to the rule that we should prefer xmm ops to ymm ops. As shown in PR42305:
https://bugs.llvm.org/show_bug.cgi?id=42305
...the store folding opportunity with vextractf128 may result in better perf by reducing the instruction count.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Jun 18 2019, 1:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 18 2019, 1:50 PM
craig.topper added inline comments.Jun 18 2019, 11:11 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
4819 ↗(On Diff #205429)

I wonder if it would be better to use use_begin()/use_end() for the loop. Then you can use getUse().getResNo() to check if its the chain or not. That would be more efficient than scanning the operands of the user.

spatel updated this revision to Diff 205600.Jun 19 2019, 8:19 AM

Patch updated:
Use iterator loop rather than range loop. That allows checking the result number more efficiently from the SDUse.

RKSimon accepted this revision.Jun 19 2019, 10:36 AM

LGTM - cheers

llvm/test/CodeGen/X86/sandybridge-loads.ll
44 ↗(On Diff #205600)

Not (very) related - but do we need a variant of this with that has an unaligned load:

%v0 = load <8 x float>, <8 x float>* %a, align 32 ; <--- aligned
%v1 = load <8 x float>, <8 x float>* %b, align 16 ; <--- unaligned
store <8 x float> %v0, <8 x float>* %b, align 32 ; <--- aligned
store <8 x float> %v1, <8 x float>* %a, align 16 ; <--- unaligned
ret void
This revision is now accepted and ready to land.Jun 19 2019, 10:36 AM
spatel marked an inline comment as done.Jun 19 2019, 10:50 AM
spatel added inline comments.
llvm/test/CodeGen/X86/sandybridge-loads.ll
44 ↗(On Diff #205600)

I think for the purpose of this patch, that's covered by the next test diff (in widen_load-3.ll).

And for the purpose of SandyBridge-specific (actual CPU feature is -slow-unaligned-mem-32), the test just above here covers it. But I can add that suggestion for completeness.

This revision was automatically updated to reflect the committed changes.