This is an archive of the discontinued LLVM Phabricator instance.

combine consecutive subvector 16-byte loads into one 32-byte load (PR21709)
ClosedPublic

Authored by spatel on Dec 2 2014, 6:03 PM.

Details

Summary

This is a fix for PR21709 ( http://llvm.org/bugs/show_bug.cgi?id=21709 ). When we have 2 consecutive 16-byte loads that are merged into one 32-byte vector, we can use a single 32-byte load instead. But we don't do this for SandyBridge / IvyBridge because they have slower 32-byte memops. We also don't bother using 32-byte *integer* loads on a machine that only has AVX1 (btver2) because those operands would have to be split in half anyway since there is no support for 32-byte integer math ops.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 16841.Dec 2 2014, 6:03 PM
spatel retitled this revision from to combine consecutive subvector 16-byte loads into one 32-byte load (PR21709).
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: craig.topper, delena, andreadb.
spatel added a subscriber: Unknown Object (MLST).
delena edited edge metadata.Dec 3 2014, 4:16 AM

In general, this optimization is right.
Will it work if I write code without intrinsics, like

%a = load < v4 x float>*%ptr
%ptr1 = GEP ..
%b = load < v4 x float>*%ptr1
insertelement ..
insertelement ..
?
If it does not work for all possible combinations, this optimization should be done in DAG combine.
If it works, please add a test.

I also think that all possible types should be done in one patch. Just to be sure that this feature is completed. But it is up to you.

lib/Target/X86/X86InstrSSE.td
8461 ↗(On Diff #16841)

If you swap places of 2 insert_subvector this pattern will not work.

spatel updated this revision to Diff 16867.Dec 3 2014, 9:01 AM
spatel edited edge metadata.

Hi Elena - thanks for your feedback! Updated patch attached.

I've added test cases that use shufflevector rather than the vinsertf128 intrinsic. This is derived from Andrea's alternate code example in the bug report. Does this cover the scenario you're thinking about using insertelement(s)? Unless I'm mistaken, we can only use insertelement with scalars, so this would be the expected pattern from IR that had insert/extract?

I've also added test cases that swap the order of the operands to the intrinsic and shuffles. Does this answer your concern about changing the order of the pattern parameters?

spatel updated this revision to Diff 17307.Dec 15 2014, 2:54 PM
spatel updated this object.
spatel added a reviewer: mkuper.

Removed 'TODO' comment. Added patterns to match all AVX/AVX2 vector types. Added test cases for each pattern. Please let me know if there's a more efficient way to do this. Thanks!

This revision was automatically updated to reflect the committed changes.

Thanks, Elena - committed at r224344.