This is an archive of the discontinued LLVM Phabricator instance.

Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)
ClosedPublic

Authored by spatel on Sep 14 2014, 2:05 PM.

Details

Summary

Currently, we generate broadcast instructions on CPUs with AVX2 to load some constant splat vectors.
This patch should preserve all existing behavior with regular optimization levels, but also use splats whenever possible when optimizing for *size* on any CPU with AVX or AVX2.

The tradeoff is up to 5 extra instruction bytes for the broadcast instruction to save at least 8 bytes (up to 31 bytes) of constant pool data.

The change using -Os (function attribute "optsize") for the included testcase file with all 12 AVX2 vector data type cases (f32, f64, i8, i16, i32, i64 for 128-bit and 256-bit vectors) is:

AVX: +29 inst -112 data = 83 bytes saved
AVX2: +29 inst -106 data = 77 bytes saved

Note: Is there any optimization pass in LLVM that merges constant pool data from different functions? This could also be done at link time? If that exists, it might change the criteria for generating a broadcast because we might not want to generate extra instructions if the same constant data was loaded multiple times.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 13686.Sep 14 2014, 2:05 PM
spatel retitled this revision from to Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2).
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: nadav, chandlerc, andreadb.
spatel added a subscriber: Unknown Object (MLST).
delena added a subscriber: delena.Sep 15 2014, 12:31 AM

if (ConstSplatVal && (Subtarget->hasAVX2() || OptForSize)) {

EVT CVT = Ld.getValueType();
assert(!CVT.isVector() && "Must not broadcast a vector type");
unsigned Opcode = X86ISD::VBROADCAST; // This only changes for v2[f|i]64.

You can't generate VBROADCAST for SSE. You should check target here..

  • Elena

One more:

// The v2[f/i]64 case is a mess because there is no VBROADCAST to handle it.

You can take the broadcast to ymm as well.
VBROADCASTSD ymm1,m64

Just add a pattern to the td file.

  • Elena
spatel updated this revision to Diff 13713.Sep 15 2014, 9:22 AM
spatel added a reviewer: delena.

Hi Elena -

Thank you for the feedback. This function is already guarded against non-AVX CPUs. I've made that more explicit in the updated patch.

You can take the broadcast to ymm as well.
VBROADCASTSD ymm1,m64

This patch generates that - please see the testcases for expected output in both the AVX and AVX2 cases. Let me know if I missed something. I've tried to match the correct broadcast instruction with the vector element type in all cases, unless the ISA does not allow it.

delena edited edge metadata.Sep 16 2014, 1:25 AM

I just suggest to add this pattern to X86InstrSSE.td:

def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),

(v2i64 (EXTRACT_SUBREG (v4i64 (VBROADCASTSDYrm addr:$src)),sub_xmm)))>;

and remove this code:

+ if (VecSize == 128 && ScalarSize == 64) {
+ This is only a size optimization - could be slightly slower in time.
+ if (OptForSize) {
+ if (ConstantSDNode *CI = dyn_cast<ConstantSDNode>(Ld)) {
+ C = CI->getConstantIntValue();
+ if (!Subtarget->hasAVX2()) {
+
For an AVX CPU, fake an int splat with FP splat.
+ Opcode = X86ISD::MOVDDUP;
+ CVT = MVT::v2f64;
+ VT = MVT::v2f64;
+ }
+ } else if (ConstantFPSDNode *CF = dyn_cast<ConstantFPSDNode>(Ld)) {
+ C = CF->getConstantFPValue();
+ Opcode = X86ISD::MOVDDUP;
+ }
+ }

  • Elena
In D5347#10, @delena wrote:

I just suggest to add this pattern to X86InstrSSE.td:

def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),

(v2i64 (EXTRACT_SUBREG (v4i64 (VBROADCASTSDYrm addr:$src)),sub_xmm)))>;

I tried this, but it's not producing the codegen that I want. Specifically, we want to use movddup when possible, and we don't want to alter codegen at all when not optimizing for size. (Apologies for pattern ignorance - I haven't used these yet.)

  1. In the testcase for v2f64, no splat is generated (movddup expected).
  1. In the testcase for v2i64 with AVX, we get: vbroadcastsd LCPI4_0(%rip), %ymm1 vpaddq %xmm1, %xmm0, %xmm0 vzeroupper <--- can the pattern be rewritten to avoid this? even if yes, movddup is smaller than broadcastsd

This is worse in size than what my patch produces:

vmovddup	LCPI4_0(%rip), %xmm1
vpaddq	%xmm1, %xmm0, %xmm0
  1. In the testcase for v4i64 with AVX, we again would generate vbroadcastsd vbroadcastsd LCPI5_0(%rip), %ymm1 vextractf128 $1, %ymm0, %xmm2 vpaddq %xmm1, %xmm2, %xmm2 vpaddq %xmm1, %xmm0, %xmm0 vinsertf128 $1, %xmm2, %ymm0, %ymm0

But movddup is better because it is one byte smaller than vbroadcastsd.

  1. Using the pattern also caused a failure in test/CodeGen/X86/exedepsfix-broadcast.ll because a broadcast is generated even when not optimizing for size. I don't think we want to use a broadcast in that case?
spatel updated this revision to Diff 13756.Sep 16 2014, 10:18 AM
spatel edited edge metadata.

Patch rebased and added possible movddup return value to function-level comment.

Note: Is there any optimization pass in LLVM that merges constant pool data from different functions? This could also be done at link time? If that exists, it might change the criteria for generating a broadcast because we might not want to generate extra instructions if the same constant data was loaded multiple times.

Depends on the type. For ELF, if the data is being put in a section like

.section .foo,"aM",@progbits,8

Then yes, the linker will merge them. For ELF the entsize can be any
value, no sure if linkers actually merge all possible sizes. We could
do a better job at merging these in the IR, but we don't at the
moment.

Cheers,
Rafael

Ok, if you want to use VMOVDDUP, you still can do it via pattern in td file. This pattern works perfect:

  • lib/Target/X86/X86InstrSSE.td (revision 217862)

+++ lib/Target/X86/X86InstrSSE.td (working copy)
@@ -5279,6 +5279,11 @@

                 (v2i64 (scalar_to_vector (loadi64 addr:$src))))),
(VMOVDDUPrm addr:$src)>, Requires<[HasAVX]>;

+ def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+ def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+

// 256-bit version
def : Pat<(X86Movddup (loadv4f64 addr:$src)),
          (VMOVDDUPYrm addr:$src)>;
  • Elena
In D5347#14, @rafael wrote:

Then yes, the linker will merge them. For ELF the entsize can be any
value, no sure if linkers actually merge all possible sizes. We could
do a better job at merging these in the IR, but we don't at the
moment.

Thanks, Rafael. I'll stick with the assumption that it's still worthwhile to splat for size then, but I'll add a comment to revisit the optimization if we start merging in IR.

In D5347#16, @delena wrote:

Ok, if you want to use VMOVDDUP, you still can do it via pattern in td file. This pattern works perfect:
+ def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+ def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+

Thanks! That does solve all of the testcases in my splat-for-size test file...it even replaces the vpbroadcastq for v2i64 on AVX2 with a vmovddup which is even better for size.

But that's a problem...according to Intel's optimization guides, when optimizing for speed, we don't want to use vmovddup for v2i64 when AVX2 is available because that's a mismatch between FP and int domains. This also causes the "Q64" test in test/CodeGen/X86/avx2-vbroadcast.ll to fail - it is expecting vpbroadcastq.

Is there a way to use patterns but still distinguish between the conflicting optimization goals of speed and size in that one case? Or just let it slide that vpbroadcastq is an extra byte and always use that instruction for v2i64 with AVX2? (That's what was happening with my patch anyway.)

In D5347#18, @spatel wrote:

Is there a way to use patterns but still distinguish between the conflicting optimization goals of speed and size in that one case?
Or just let it slide that vpbroadcastq is an extra byte and always use that instruction for v2i64 with AVX2? (That's what was
happening with my patch anyway.)

I thought I had stumbled into the answer with:

def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
          (VMOVDDUPrm addr:$src)>, Requires<[OptForSize]>;
def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
          (VMOVDDUPrm addr:$src)>, Requires<[OptForSize]>;

But that doesn't change the failing testcase in avx2-broadcast.ll - we're still generating vmovddup even without an OptForSize atrtribute on the function.

I tried this :
let Predicates = [OptForSize, UseAVX] in {

def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
          (VMOVDDUPrm addr:$src)>;
def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
          (VMOVDDUPrm addr:$src)>;

}

It solves Q64 test in avx2-broadcast.ll.

  • Elena
spatel updated this revision to Diff 13843.Sep 18 2014, 12:36 PM

Use patterns (thanks, Elena!) instead of code to handle just the cases we want to optimize for size.

With this patch, the actual code changed is only in the 2 'if' statements that guard the generation of the VBROADCAST node.

In D5347#17, @spatel wrote:
In D5347#14, @rafael wrote:

Then yes, the linker will merge them. For ELF the entsize can be any
value, no sure if linkers actually merge all possible sizes. We could
do a better job at merging these in the IR, but we don't at the
moment.

Thanks, Rafael. I'll stick with the assumption that it's still worthwhile to splat for size then, but I'll add a comment to revisit the optimization if we start merging in IR.

Bug for doing constant pool merging in IR:
http://llvm.org/bugs/show_bug.cgi?id=16711

I added a TODO comment to this patch about multiple loads of the same constant. It's possible that this patch will already increase overall size today just based on link-time constant merging, but I think that's unlikely in general.

For this patch to be detrimental to size, we would have to generate 2 or more new splat loads (10 new bytes of instructions) of a single 64-bit scalar instead of 1 fused load/op of a 128-bit vector. That's the worst case. For a v8f32, it would take at least 6 loads of the same constant (+30 bytes of splat load instructions) to override the 28 data bytes of savings from using a scalar constant instead of a vector constant.

Please let me know if this ok to checkin now.

For reference, I asked about the predicate vs. requirement tablegen behavior here:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-September/076935.html

LGTM

  • Elena
spatel closed this revision.Sep 22 2014, 12:04 PM
spatel updated this revision to Diff 13945.

Closed by commit rL218263 (authored by @spatel).