Download Raw Diff

Details

Reviewers

chandlerc
nadav
delena
andreadb

Commits

rG7939d7229d37: Use broadcasts to optimize overall size when loading constant splat vectors…
rL218263: Use broadcasts to optimize overall size when loading constant splat vectors…

Summary

Currently, we generate broadcast instructions on CPUs with AVX2 to load some constant splat vectors.
This patch should preserve all existing behavior with regular optimization levels, but also use splats whenever possible when optimizing for *size* on any CPU with AVX or AVX2.

The tradeoff is up to 5 extra instruction bytes for the broadcast instruction to save at least 8 bytes (up to 31 bytes) of constant pool data.

The change using -Os (function attribute "optsize") for the included testcase file with all 12 AVX2 vector data type cases (f32, f64, i8, i16, i32, i64 for 128-bit and 256-bit vectors) is:

AVX: +29 inst -112 data = 83 bytes saved
AVX2: +29 inst -106 data = 77 bytes saved

Note: Is there any optimization pass in LLVM that merges constant pool data from different functions? This could also be done at link time? If that exists, it might change the criteria for generating a broadcast because we might not want to generate extra instructions if the same constant data was loaded multiple times.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 13686.Sep 14 2014, 2:05 PM

spatel retitled this revision from to Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2).

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: nadav, chandlerc, andreadb.

spatel added a subscriber: Unknown Object (MLST).

if (ConstSplatVal && (Subtarget->hasAVX2() || OptForSize)) {

EVT CVT = Ld.getValueType();
assert(!CVT.isVector() && "Must not broadcast a vector type");
unsigned Opcode = X86ISD::VBROADCAST; // This only changes for v2[f|i]64.

You can't generate VBROADCAST for SSE. You should check target here..

Elena

One more:

// The v2[f/i]64 case is a mess because there is no VBROADCAST to handle it.

You can take the broadcast to ymm as well.
VBROADCASTSD ymm1,m64

Just add a pattern to the td file.

Elena

Hi Elena -

Thank you for the feedback. This function is already guarded against non-AVX CPUs. I've made that more explicit in the updated patch.

You can take the broadcast to ymm as well.
VBROADCASTSD ymm1,m64

This patch generates that - please see the testcases for expected output in both the AVX and AVX2 cases. Let me know if I missed something. I've tried to match the correct broadcast instruction with the vector element type in all cases, unless the ISA does not allow it.

I just suggest to add this pattern to X86InstrSSE.td:

def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),

(v2i64 (EXTRACT_SUBREG (v4i64 (VBROADCASTSDYrm addr:$src)),sub_xmm)))>;

and remove this code:

+ if (VecSize == 128 && ScalarSize == 64) {
+ This is only a size optimization - could be slightly slower in time.
+ if (OptForSize) {
+ if (ConstantSDNode *CI = dyn_cast<ConstantSDNode>(Ld)) {
+ C = CI->getConstantIntValue();
+ if (!Subtarget->hasAVX2()) {
+ For an AVX CPU, fake an int splat with FP splat.
+ Opcode = X86ISD::MOVDDUP;
+ CVT = MVT::v2f64;
+ VT = MVT::v2f64;
+ }
+ } else if (ConstantFPSDNode *CF = dyn_cast<ConstantFPSDNode>(Ld)) {
+ C = CF->getConstantFPValue();
+ Opcode = X86ISD::MOVDDUP;
+ }
+ }

Elena

In D5347#10, @delena wrote:
I just suggest to add this pattern to X86InstrSSE.td:

def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
(v2i64 (EXTRACT_SUBREG (v4i64 (VBROADCASTSDYrm addr:$src)),sub_xmm)))>;

I tried this, but it's not producing the codegen that I want. Specifically, we want to use movddup when possible, and we don't want to alter codegen at all when not optimizing for size. (Apologies for pattern ignorance - I haven't used these yet.)

In the testcase for v2f64, no splat is generated (movddup expected).

In the testcase for v2i64 with AVX, we get: vbroadcastsd LCPI4_0(%rip), %ymm1 vpaddq %xmm1, %xmm0, %xmm0 vzeroupper <--- can the pattern be rewritten to avoid this? even if yes, movddup is smaller than broadcastsd

This is worse in size than what my patch produces:

vmovddup	LCPI4_0(%rip), %xmm1
vpaddq	%xmm1, %xmm0, %xmm0

In the testcase for v4i64 with AVX, we again would generate vbroadcastsd vbroadcastsd LCPI5_0(%rip), %ymm1 vextractf128 $1, %ymm0, %xmm2 vpaddq %xmm1, %xmm2, %xmm2 vpaddq %xmm1, %xmm0, %xmm0 vinsertf128 $1, %xmm2, %ymm0, %ymm0

But movddup is better because it is one byte smaller than vbroadcastsd.

Using the pattern also caused a failure in test/CodeGen/X86/exedepsfix-broadcast.ll because a broadcast is generated even when not optimizing for size. I don't think we want to use a broadcast in that case?

Patch rebased and added possible movddup return value to function-level comment.

Note: Is there any optimization pass in LLVM that merges constant pool data from different functions? This could also be done at link time? If that exists, it might change the criteria for generating a broadcast because we might not want to generate extra instructions if the same constant data was loaded multiple times.

Depends on the type. For ELF, if the data is being put in a section like

.section .foo,"aM",@progbits,8

Then yes, the linker will merge them. For ELF the entsize can be any
value, no sure if linkers actually merge all possible sizes. We could
do a better job at merging these in the IR, but we don't at the
moment.

Cheers,
Rafael

Ok, if you want to use VMOVDDUP, you still can do it via pattern in td file. This pattern works perfect:

lib/Target/X86/X86InstrSSE.td (revision 217862)

+++ lib/Target/X86/X86InstrSSE.td (working copy)
@@ -5279,6 +5279,11 @@

                 (v2i64 (scalar_to_vector (loadi64 addr:$src))))),
(VMOVDDUPrm addr:$src)>, Requires<[HasAVX]>;

+ def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+ def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+

// 256-bit version
def : Pat<(X86Movddup (loadv4f64 addr:$src)),
          (VMOVDDUPYrm addr:$src)>;

Elena

In D5347#14, @rafael wrote:

Then yes, the linker will merge them. For ELF the entsize can be any
value, no sure if linkers actually merge all possible sizes. We could
do a better job at merging these in the IR, but we don't at the
moment.

Thanks, Rafael. I'll stick with the assumption that it's still worthwhile to splat for size then, but I'll add a comment to revisit the optimization if we start merging in IR.

In D5347#16, @delena wrote:

Ok, if you want to use VMOVDDUP, you still can do it via pattern in td file. This pattern works perfect:
+ def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+ def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
+ (VMOVDDUPrm addr:$src)>;
+

Thanks! That does solve all of the testcases in my splat-for-size test file...it even replaces the vpbroadcastq for v2i64 on AVX2 with a vmovddup which is even better for size.

But that's a problem...according to Intel's optimization guides, when optimizing for speed, we don't want to use vmovddup for v2i64 when AVX2 is available because that's a mismatch between FP and int domains. This also causes the "Q64" test in test/CodeGen/X86/avx2-vbroadcast.ll to fail - it is expecting vpbroadcastq.

Is there a way to use patterns but still distinguish between the conflicting optimization goals of speed and size in that one case? Or just let it slide that vpbroadcastq is an extra byte and always use that instruction for v2i64 with AVX2? (That's what was happening with my patch anyway.)

In D5347#18, @spatel wrote:

Is there a way to use patterns but still distinguish between the conflicting optimization goals of speed and size in that one case?
Or just let it slide that vpbroadcastq is an extra byte and always use that instruction for v2i64 with AVX2? (That's what was
happening with my patch anyway.)

I thought I had stumbled into the answer with:

def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
          (VMOVDDUPrm addr:$src)>, Requires<[OptForSize]>;
def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
          (VMOVDDUPrm addr:$src)>, Requires<[OptForSize]>;

But that doesn't change the failing testcase in avx2-broadcast.ll - we're still generating vmovddup even without an OptForSize atrtribute on the function.

I tried this :
let Predicates = [OptForSize, UseAVX] in {

def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
          (VMOVDDUPrm addr:$src)>;
def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
          (VMOVDDUPrm addr:$src)>;

}

It solves Q64 test in avx2-broadcast.ll.

Elena

Use patterns (thanks, Elena!) instead of code to handle just the cases we want to optimize for size.

With this patch, the actual code changed is only in the 2 'if' statements that guard the generation of the VBROADCAST node.

In D5347#17, @spatel wrote:

In D5347#14, @rafael wrote:

Then yes, the linker will merge them. For ELF the entsize can be any
value, no sure if linkers actually merge all possible sizes. We could
do a better job at merging these in the IR, but we don't at the
moment.

Thanks, Rafael. I'll stick with the assumption that it's still worthwhile to splat for size then, but I'll add a comment to revisit the optimization if we start merging in IR.

Bug for doing constant pool merging in IR:
http://llvm.org/bugs/show_bug.cgi?id=16711

I added a TODO comment to this patch about multiple loads of the same constant. It's possible that this patch will already increase overall size today just based on link-time constant merging, but I think that's unlikely in general.

For this patch to be detrimental to size, we would have to generate 2 or more new splat loads (10 new bytes of instructions) of a single 64-bit scalar instead of 1 fused load/op of a 128-bit vector. That's the worst case. For a v8f32, it would take at least 6 loads of the same constant (+30 bytes of splat load instructions) to override the 28 data bytes of savings from using a scalar constant instead of a vector constant.

Please let me know if this ok to checkin now.

For reference, I asked about the predicate vs. requirement tablegen behavior here:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-September/076935.html

LGTM

Elena

Closed by commit rL218263 (authored by @spatel).

Diff 13945

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,990 Lines • ▼ Show 20 Lines
/// to generate a splat value for the following cases:		/// to generate a splat value for the following cases:
/// 1. A splat BUILD_VECTOR which uses a single scalar load, or a constant.		/// 1. A splat BUILD_VECTOR which uses a single scalar load, or a constant.
/// 2. A splat shuffle which uses a scalar_to_vector node which comes from		/// 2. A splat shuffle which uses a scalar_to_vector node which comes from
/// a scalar load, or a constant.		/// a scalar load, or a constant.
/// The VBROADCAST node is returned when a pattern is found,		/// The VBROADCAST node is returned when a pattern is found,
/// or SDValue() otherwise.		/// or SDValue() otherwise.
static SDValue LowerVectorBroadcast(SDValue Op, const X86Subtarget* Subtarget,		static SDValue LowerVectorBroadcast(SDValue Op, const X86Subtarget* Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
if (!Subtarget->hasFp256())		// VBROADCAST requires AVX.
		// TODO: Splats could be generated for non-AVX CPUs using SSE
		// instructions, but there's less potential gain for only 128-bit vectors.
		if (!Subtarget->hasAVX())
return SDValue();		return SDValue();

MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
SDLoc dl(Op);		SDLoc dl(Op);

assert((VT.is128BitVector() \|\| VT.is256BitVector() \|\| VT.is512BitVector()) &&		assert((VT.is128BitVector() \|\| VT.is256BitVector() \|\| VT.is512BitVector()) &&
"Unsupported vector type for broadcast.");		"Unsupported vector type for broadcast.");

▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	case ISD::VECTOR_SHUFFLE: {
Ld.getValueType().getSizeInBits() >= 32;		Ld.getValueType().getSizeInBits() >= 32;
if (!ConstSplatVal && ((!Sc.hasOneUse() \|\| !Ld.hasOneUse()) &&		if (!ConstSplatVal && ((!Sc.hasOneUse() \|\| !Ld.hasOneUse()) &&
!hasRegVer))		!hasRegVer))
return SDValue();		return SDValue();
break;		break;
}		}
}		}

		unsigned ScalarSize = Ld.getValueType().getSizeInBits();
bool IsGE256 = (VT.getSizeInBits() >= 256);		bool IsGE256 = (VT.getSizeInBits() >= 256);

// Handle the broadcasting a single constant scalar from the constant pool		// When optimizing for size, generate up to 5 extra bytes for a broadcast
// into a vector. On Sandybridge it is still better to load a constant vector		// instruction to save 8 or more bytes of constant pool data.
		// TODO: If multiple splats are generated to load the same constant,
		// it may be detrimental to overall size. There needs to be a way to detect
		// that condition to know if this is truly a size win.
		const Function *F = DAG.getMachineFunction().getFunction();
		bool OptForSize = F->getAttributes().
		hasAttribute(AttributeSet::FunctionIndex, Attribute::OptimizeForSize);

		// Handle broadcasting a single constant scalar from the constant pool
		// into a vector.
		// On Sandybridge (no AVX2), it is still better to load a constant vector
// from the constant pool and not to broadcast it from a scalar.		// from the constant pool and not to broadcast it from a scalar.
if (ConstSplatVal && Subtarget->hasInt256()) {		// But override that restriction when optimizing for size.
		// TODO: Check if splatting is recommended for other AVX-capable CPUs.
		if (ConstSplatVal && (Subtarget->hasAVX2() \|\| OptForSize)) {
EVT CVT = Ld.getValueType();		EVT CVT = Ld.getValueType();
assert(!CVT.isVector() && "Must not broadcast a vector type");		assert(!CVT.isVector() && "Must not broadcast a vector type");
unsigned ScalarSize = CVT.getSizeInBits();

if (ScalarSize == 32 \|\| (IsGE256 && ScalarSize == 64)) {		// Splat f32, i32, v4f64, v4i64 in all cases with AVX2.
		// For size optimization, also splat v2f64 and v2i64, and for size opt
		// with AVX2, also splat i8 and i16.
		// With pattern matching, the VBROADCAST node may become a VMOVDDUP.
		if (ScalarSize == 32 \|\| (IsGE256 && ScalarSize == 64) \|\|
		(OptForSize && (ScalarSize == 64 \|\| Subtarget->hasAVX2()))) {
const Constant *C = nullptr;		const Constant *C = nullptr;
if (ConstantSDNode *CI = dyn_cast<ConstantSDNode>(Ld))		if (ConstantSDNode *CI = dyn_cast<ConstantSDNode>(Ld))
C = CI->getConstantIntValue();		C = CI->getConstantIntValue();
else if (ConstantFPSDNode *CF = dyn_cast<ConstantFPSDNode>(Ld))		else if (ConstantFPSDNode *CF = dyn_cast<ConstantFPSDNode>(Ld))
C = CF->getConstantFPValue();		C = CF->getConstantFPValue();

assert(C && "Invalid constant type");		assert(C && "Invalid constant type");

const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
SDValue CP = DAG.getConstantPool(C, TLI.getPointerTy());		SDValue CP = DAG.getConstantPool(C, TLI.getPointerTy());
unsigned Alignment = cast<ConstantPoolSDNode>(CP)->getAlignment();		unsigned Alignment = cast<ConstantPoolSDNode>(CP)->getAlignment();
Ld = DAG.getLoad(CVT, dl, DAG.getEntryNode(), CP,		Ld = DAG.getLoad(CVT, dl, DAG.getEntryNode(), CP,
MachinePointerInfo::getConstantPool(),		MachinePointerInfo::getConstantPool(),
false, false, false, Alignment);		false, false, false, Alignment);

return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Ld);		return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Ld);
}		}
}		}

bool IsLoad = ISD::isNormalLoad(Ld.getNode());		bool IsLoad = ISD::isNormalLoad(Ld.getNode());
unsigned ScalarSize = Ld.getValueType().getSizeInBits();

// Handle AVX2 in-register broadcasts.		// Handle AVX2 in-register broadcasts.
if (!IsLoad && Subtarget->hasInt256() &&		if (!IsLoad && Subtarget->hasInt256() &&
(ScalarSize == 32 \|\| (IsGE256 && ScalarSize == 64)))		(ScalarSize == 32 \|\| (IsGE256 && ScalarSize == 64)))
return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Ld);		return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Ld);

// The scalar source must be a normal load.		// The scalar source must be a normal load.
if (!IsLoad)		if (!IsLoad)
▲ Show 20 Lines • Show All 18,468 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,284 Lines • ▼ Show 20 Lines	let Predicates = [HasAVX] in {
def : Pat<(X86Movddup (loadv4i64 addr:$src)),		def : Pat<(X86Movddup (loadv4i64 addr:$src)),
(VMOVDDUPYrm addr:$src)>;		(VMOVDDUPYrm addr:$src)>;
def : Pat<(X86Movddup (v4i64 (scalar_to_vector (loadi64 addr:$src)))),		def : Pat<(X86Movddup (v4i64 (scalar_to_vector (loadi64 addr:$src)))),
(VMOVDDUPYrm addr:$src)>;		(VMOVDDUPYrm addr:$src)>;
def : Pat<(X86Movddup (v4i64 VR256:$src)),		def : Pat<(X86Movddup (v4i64 VR256:$src)),
(VMOVDDUPYrr VR256:$src)>;		(VMOVDDUPYrr VR256:$src)>;
}		}

		let Predicates = [UseAVX, OptForSize] in {
		def : Pat<(v2f64 (X86VBroadcast (loadf64 addr:$src))),
		(VMOVDDUPrm addr:$src)>;
		def : Pat<(v2i64 (X86VBroadcast (loadi64 addr:$src))),
		(VMOVDDUPrm addr:$src)>;
		}

let Predicates = [UseSSE3] in {		let Predicates = [UseSSE3] in {
def : Pat<(X86Movddup (memopv2f64 addr:$src)),		def : Pat<(X86Movddup (memopv2f64 addr:$src)),
(MOVDDUPrm addr:$src)>;		(MOVDDUPrm addr:$src)>;
def : Pat<(X86Movddup (bc_v2f64 (memopv4f32 addr:$src))),		def : Pat<(X86Movddup (bc_v2f64 (memopv4f32 addr:$src))),
(MOVDDUPrm addr:$src)>;		(MOVDDUPrm addr:$src)>;
def : Pat<(X86Movddup (bc_v2f64 (memopv2i64 addr:$src))),		def : Pat<(X86Movddup (bc_v2f64 (memopv2i64 addr:$src))),
(MOVDDUPrm addr:$src)>;		(MOVDDUPrm addr:$src)>;
def : Pat<(X86Movddup (bc_v2f64		def : Pat<(X86Movddup (bc_v2f64
▲ Show 20 Lines • Show All 3,757 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/splat-for-size.ll

				; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=avx < %s \| FileCheck %s -check-prefix=CHECK --check-prefix=AVX
				; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=avx2 < %s \| FileCheck %s -check-prefix=CHECK --check-prefix=AVX2

				; Check constant loads of every 128-bit and 256-bit vector type
				; for size optimization using splat ops available with AVX and AVX2.

				; There is no AVX broadcast from double to 128-bit vector because movddup has been around since SSE3 (grrr).
				define <2 x double> @splat_v2f64(<2 x double> %x) #0 {
				%add = fadd <2 x double> %x, <double 1.0, double 1.0>
				ret <2 x double> %add
				; CHECK-LABEL: splat_v2f64
				; CHECK: vmovddup
				; CHECK: vaddpd
				; CHECK-NEXT: retq
				}

				define <4 x double> @splat_v4f64(<4 x double> %x) #0 {
				%add = fadd <4 x double> %x, <double 1.0, double 1.0, double 1.0, double 1.0>
				ret <4 x double> %add
				; CHECK-LABEL: splat_v4f64
				; CHECK: vbroadcastsd
				; CHECK-NEXT: vaddpd
				; CHECK-NEXT: retq
				}

				define <4 x float> @splat_v4f32(<4 x float> %x) #0 {
				%add = fadd <4 x float> %x, <float 1.0, float 1.0, float 1.0, float 1.0>
				ret <4 x float> %add
				; CHECK-LABEL: splat_v4f32
				; CHECK: vbroadcastss
				; CHECK-NEXT: vaddps
				; CHECK-NEXT: retq
				}

				define <8 x float> @splat_v8f32(<8 x float> %x) #0 {
				%add = fadd <8 x float> %x, <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>
				ret <8 x float> %add
				; CHECK-LABEL: splat_v8f32
				; CHECK: vbroadcastss
				; CHECK-NEXT: vaddps
				; CHECK-NEXT: retq
				}

				; AVX can't do integer splats, so fake it: use vmovddup to splat 64-bit value.
				; We also generate vmovddup for AVX2 because it's one byte smaller than vpbroadcastq.
				define <2 x i64> @splat_v2i64(<2 x i64> %x) #0 {
				%add = add <2 x i64> %x, <i64 1, i64 1>
				ret <2 x i64> %add
				; CHECK-LABEL: splat_v2i64
				; CHECK: vmovddup
				; CHECK: vpaddq
				; CHECK-NEXT: retq
				}

				; AVX can't do 256-bit integer ops, so we split this into two 128-bit vectors,
				; and then we fake it: use vmovddup to splat 64-bit value.
				define <4 x i64> @splat_v4i64(<4 x i64> %x) #0 {
				%add = add <4 x i64> %x, <i64 1, i64 1, i64 1, i64 1>
				ret <4 x i64> %add
				; CHECK-LABEL: splat_v4i64
				; AVX: vmovddup
				; AVX: vpaddq
				; AVX: vpaddq
				; AVX2: vpbroadcastq
				; AVX2: vpaddq
				; CHECK: retq
				}

				; AVX can't do integer splats, so fake it: use vbroadcastss to splat 32-bit value.
				define <4 x i32> @splat_v4i32(<4 x i32> %x) #0 {
				%add = add <4 x i32> %x, <i32 1, i32 1, i32 1, i32 1>
				ret <4 x i32> %add
				; CHECK-LABEL: splat_v4i32
				; AVX: vbroadcastss
				; AVX2: vpbroadcastd
				; CHECK-NEXT: vpaddd
				; CHECK-NEXT: retq
				}

				; AVX can't do integer splats, so fake it: use vbroadcastss to splat 32-bit value.
				define <8 x i32> @splat_v8i32(<8 x i32> %x) #0 {
				%add = add <8 x i32> %x, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
				ret <8 x i32> %add
				; CHECK-LABEL: splat_v8i32
				; AVX: vbroadcastss
				; AVX: vpaddd
				; AVX: vpaddd
				; AVX2: vpbroadcastd
				; AVX2: vpaddd
				; CHECK: retq
				}

				; AVX can't do integer splats, and there's no broadcast fakery for 16-bit. Could use pshuflw, etc?
				define <8 x i16> @splat_v8i16(<8 x i16> %x) #0 {
				%add = add <8 x i16> %x, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
				ret <8 x i16> %add
				; CHECK-LABEL: splat_v8i16
				; AVX-NOT: broadcast
				; AVX2: vpbroadcastw
				; CHECK: vpaddw
				; CHECK-NEXT: retq
				}

				; AVX can't do integer splats, and there's no broadcast fakery for 16-bit. Could use pshuflw, etc?
				define <16 x i16> @splat_v16i16(<16 x i16> %x) #0 {
				%add = add <16 x i16> %x, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
				ret <16 x i16> %add
				; CHECK-LABEL: splat_v16i16
				; AVX-NOT: broadcast
				; AVX: vpaddw
				; AVX: vpaddw
				; AVX2: vpbroadcastw
				; AVX2: vpaddw
				; CHECK: retq
				}

				; AVX can't do integer splats, and there's no broadcast fakery for 8-bit. Could use pshufb, etc?
				define <16 x i8> @splat_v16i8(<16 x i8> %x) #0 {
				%add = add <16 x i8> %x, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
				ret <16 x i8> %add
				; CHECK-LABEL: splat_v16i8
				; AVX-NOT: broadcast
				; AVX2: vpbroadcastb
				; CHECK: vpaddb
				; CHECK-NEXT: retq
				}

				; AVX can't do integer splats, and there's no broadcast fakery for 8-bit. Could use pshufb, etc?
				define <32 x i8> @splat_v32i8(<32 x i8> %x) #0 {
				%add = add <32 x i8> %x, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
				ret <32 x i8> %add
				; CHECK-LABEL: splat_v32i8
				; AVX-NOT: broadcast
				; AVX: vpaddb
				; AVX: vpaddb
				; AVX2: vpbroadcastb
				; AVX2: vpaddb
				; CHECK: retq
				}

				attributes #0 = { optsize }

This is an archive of the discontinued LLVM Phabricator instance.

Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 13945

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/lib/Target/X86/X86InstrSSE.td

llvm/trunk/test/CodeGen/X86/splat-for-size.ll

This is an archive of the discontinued LLVM Phabricator instance.

Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 13945

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/lib/Target/X86/X86InstrSSE.td

llvm/trunk/test/CodeGen/X86/splat-for-size.ll

Use broadcasts to optimize overall size when loading constant splat vectors (x86-64 with AVX or AVX2)
ClosedPublic