This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Back end improvements to vec_splat
ClosedPublic

Authored by nemanjai on Mar 30 2016, 4:27 AM.

Download Raw Diff

Details

Reviewers

wschmidt
kbarton
amehsan
seurer
hfinkel

Summary

We currently have no way to emit the xxspltw for a word splat and a call to vec_splat for vectors with 8-byte elements get translated to vperm with a mask vector that comes from memory.
This patch provides intrinsics to get at xxspltw and xxspltd (extended mnemonic). In a subsequent patch, altivec.h will be modified to use these intrinsics for the respective vec_splat definitions.

This provides a significant improvement in one benchmark that uses vec_splat.

Diff Detail

Repository: rL LLVM

Event Timeline

nemanjai updated this revision to Diff 52035.Mar 30 2016, 4:27 AM

nemanjai retitled this revision from to [PowerPC] Back end improvements to vec_splat.

nemanjai updated this object.

nemanjai added reviewers: hfinkel, kbarton, wschmidt, amehsan, seurer.

nemanjai set the repository for this revision to rL LLVM.

nemanjai added a subscriber: llvm-commits.

amehsan added inline comments.Mar 30 2016, 4:37 AM

include/llvm/IR/IntrinsicsPowerPC.td
752–755	Are you going to define a builtin and emit this intrinsics in the clang, when that builtin is seen? Can't we use shufflevector instruction, instead of target specific intrinsic?

nemanjai mentioned this in D18593: [PowerPC] Front end improvements for vec_splat.Mar 30 2016, 5:24 AM

nemanjai added inline comments.Mar 30 2016, 8:07 AM

include/llvm/IR/IntrinsicsPowerPC.td
752–755	Yes, D18593 has the FE portions of this patch. I don't think having this intrinsic prevents us from using these for the shufflevector instruction. These can be subsequently added as options to the framework for emitting vector shuffles. Namely, shufflevector is probably too general of an instruction to do this all in the .td files. Notice that all the Altivec instructions that can be used for vector shuffles have code fragments that check whether the node has a mask that is eligible for that instruction. The canonical type for all the vector shuffles is v16i8 which isn't a type that can be in vsrc. Of course, this can be changed... I think that the simplest thing to do would be similar to the way the QPX QVESPLATI instruction is handled (namely, check for VSX, check that the node is a splat of the correct type, emit xxspltw/xxpermdi). Of course, this is just my opinion and if people feel that I should implement this as a vector shuffle immediately, I can do so.

amehsan added inline comments.Mar 30 2016, 11:48 AM

include/llvm/IR/IntrinsicsPowerPC.td
752–755	You definitely know more about our codegen than I do and I may be missing something here. Also it will be good to hear from other reviewers about this (@hfinkel, @kbarton, @wschmidt, @seurer). The way I see it (and I maybe wrong) is this. We have two options: 1- Current implementation which is simpler. 2- Using vector shuffle, which requires extra C++ code, to get it right. It is more complicated. I think (1) results in introducing a new target specific intrinsic which is not understood by optimizations. Also this intrinsic maybe used for other purposes in the future. Potentially causing optimization problems. So even though (2) is more complicated, to me it seems that (2) is the right way to go. One reason that I see we may want to choose (1) is that the complexity of (2) is too much. There might be other reasons that I have not realized.

hfinkel added inline comments.Mar 30 2016, 12:16 PM

include/llvm/IR/IntrinsicsPowerPC.td
752–755	I don't really see a difference between what the two of you are proposing, except whether the frontend uses an intrinsic or not. Is this right? Our general preference, across all targets, is that the frontend headers emit generic IR whenever possible (including using things like __builtin_shufflevector, or equivalent syntax. Then we match it in the backend.

Yes, please use generic IR if it is possible to represent. Thanks.

nemanjai added inline comments.Mar 30 2016, 2:38 PM

include/llvm/IR/IntrinsicsPowerPC.td
752–755	OK, I think it's settled. I'll update the patch so that we retain the information that a vec_splat is a vector_shuffle rather than using a target-specific builtin. Thank you all for your comments. New patch coming up tomorrow :).

I've changed this around a bit to match an SDAG pattern that will be emitted when lowering vector_shuffle nodes rather than matching an intrinsic. This way, we'll get the FE to emit vector shuffles for splats (in the update to D18593).

Also, since XXSPLTW is a splat, I've added the handling for it to the swap removal pass.

A couple of comments, otherwise LGTM.

lib/Target/PowerPC/PPCVSXSwapRemoval.cpp
410 ↗	(On Diff #52216)	Looks like the "This is not yet implemented. When it is, we need to uncomment the following:" part of this comment is out-dated. If so, please remove it.
test/CodeGen/PowerPC/swaps-le-2.ll
90 ↗	(On Diff #52216)	Please add a few more test cases for other index values. The space of bugs that might happen to continue to cause this index to be 0, but incorrectly compute others is pretty large.

This revision is now accepted and ready to land.Apr 26 2016, 6:11 PM

Committed revision 268516.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsPowerPC.td

7 lines

lib/

Target/

PowerPC/

PPCInstrVSX.td

8 lines

Diff 52035

include/llvm/IR/IntrinsicsPowerPC.td

Show First 20 Lines • Show All 740 Lines • ▼ Show 20 Lines	def int_ppc_vsx_xvcmpgtsp :
PowerPC_VSX_Intrinsic<"xvcmpgtsp", [llvm_v4i32_ty],		PowerPC_VSX_Intrinsic<"xvcmpgtsp", [llvm_v4i32_ty],
[llvm_v4f32_ty, llvm_v4f32_ty], [IntrNoMem]>;		[llvm_v4f32_ty, llvm_v4f32_ty], [IntrNoMem]>;
def int_ppc_vsx_xvcmpgtsp_p : GCCBuiltin<"__builtin_vsx_xvcmpgtsp_p">,		def int_ppc_vsx_xvcmpgtsp_p : GCCBuiltin<"__builtin_vsx_xvcmpgtsp_p">,
Intrinsic<[llvm_i32_ty],[llvm_i32_ty,llvm_v4f32_ty,llvm_v4f32_ty],		Intrinsic<[llvm_i32_ty],[llvm_i32_ty,llvm_v4f32_ty,llvm_v4f32_ty],
[IntrNoMem]>;		[IntrNoMem]>;
def int_ppc_vsx_xxleqv :		def int_ppc_vsx_xxleqv :
PowerPC_VSX_Intrinsic<"xxleqv", [llvm_v4i32_ty],		PowerPC_VSX_Intrinsic<"xxleqv", [llvm_v4i32_ty],
[llvm_v4i32_ty, llvm_v4i32_ty], [IntrNoMem]>;		[llvm_v4i32_ty, llvm_v4i32_ty], [IntrNoMem]>;
		def int_ppc_vsx_xxspltw :
		PowerPC_VSX_Intrinsic<"xxspltw", [llvm_v4i32_ty],
		[llvm_v4i32_ty, llvm_i32_ty], [IntrNoMem]>;
		def int_ppc_vsx_xxpermdi :
		PowerPC_VSX_Intrinsic<"xxpermdi", [llvm_v2i64_ty],
		[llvm_v2i64_ty, llvm_v2i64_ty, llvm_i32_ty],
		[IntrNoMem]>;
		amehsanUnsubmitted Not Done Reply Inline Actions Are you going to define a builtin and emit this intrinsics in the clang, when that builtin is seen? Can't we use shufflevector instruction, instead of target specific intrinsic? amehsan: Are you going to define a builtin and emit this intrinsics in the clang, when that builtin is…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Yes, D18593 has the FE portions of this patch. I don't think having this intrinsic prevents us from using these for the shufflevector instruction. These can be subsequently added as options to the framework for emitting vector shuffles. Namely, shufflevector is probably too general of an instruction to do this all in the .td files. Notice that all the Altivec instructions that can be used for vector shuffles have code fragments that check whether the node has a mask that is eligible for that instruction. The canonical type for all the vector shuffles is v16i8 which isn't a type that can be in vsrc. Of course, this can be changed... I think that the simplest thing to do would be similar to the way the QPX QVESPLATI instruction is handled (namely, check for VSX, check that the node is a splat of the correct type, emit xxspltw/xxpermdi). Of course, this is just my opinion and if people feel that I should implement this as a vector shuffle immediately, I can do so. nemanjai: Yes, D18593 has the FE portions of this patch. I don't think having this intrinsic prevents us…
		amehsanUnsubmitted Not Done Reply Inline Actions You definitely know more about our codegen than I do and I may be missing something here. Also it will be good to hear from other reviewers about this (@hfinkel, @kbarton, @wschmidt, @seurer). The way I see it (and I maybe wrong) is this. We have two options: 1- Current implementation which is simpler. 2- Using vector shuffle, which requires extra C++ code, to get it right. It is more complicated. I think (1) results in introducing a new target specific intrinsic which is not understood by optimizations. Also this intrinsic maybe used for other purposes in the future. Potentially causing optimization problems. So even though (2) is more complicated, to me it seems that (2) is the right way to go. One reason that I see we may want to choose (1) is that the complexity of (2) is too much. There might be other reasons that I have not realized. amehsan: You definitely know more about our codegen than I do and I may be missing something here. Also…
		hfinkelUnsubmitted Not Done Reply Inline Actions I don't really see a difference between what the two of you are proposing, except whether the frontend uses an intrinsic or not. Is this right? Our general preference, across all targets, is that the frontend headers emit generic IR whenever possible (including using things like __builtin_shufflevector, or equivalent syntax. Then we match it in the backend. hfinkel: I don't really see a difference between what the two of you are proposing, except whether the…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions OK, I think it's settled. I'll update the patch so that we retain the information that a vec_splat is a vector_shuffle rather than using a target-specific builtin. Thank you all for your comments. New patch coming up tomorrow :). nemanjai: OK, I think it's settled. I'll update the patch so that we retain the information that a…
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// PowerPC QPX Intrinsics.		// PowerPC QPX Intrinsics.
//		//

let TargetPrefix = "ppc" in { // All PPC intrinsics start with "llvm.ppc.".		let TargetPrefix = "ppc" in { // All PPC intrinsics start with "llvm.ppc.".
/// PowerPC_QPX_Intrinsic - Base class for all QPX intrinsics.		/// PowerPC_QPX_Intrinsic - Base class for all QPX intrinsics.
▲ Show 20 Lines • Show All 229 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrVSX.td

Show First 20 Lines • Show All 759 Lines • ▼ Show 20 Lines	def XXMRGHW : XX3Form<60, 18,
(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),		(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),
"xxmrghw $XT, $XA, $XB", IIC_VecPerm, []>;		"xxmrghw $XT, $XA, $XB", IIC_VecPerm, []>;
def XXMRGLW : XX3Form<60, 50,		def XXMRGLW : XX3Form<60, 50,
(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),		(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),
"xxmrglw $XT, $XA, $XB", IIC_VecPerm, []>;		"xxmrglw $XT, $XA, $XB", IIC_VecPerm, []>;

def XXPERMDI : XX3Form_2<60, 10,		def XXPERMDI : XX3Form_2<60, 10,
(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB, u2imm:$DM),		(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB, u2imm:$DM),
"xxpermdi $XT, $XA, $XB, $DM", IIC_VecPerm, []>;		"xxpermdi $XT, $XA, $XB, $DM", IIC_VecPerm,
		[(set v2i64:$XT, (int_ppc_vsx_xxpermdi v2i64:$XA,
		v2i64:$XB, imm:$DM))]>;
def XXSEL : XX4Form<60, 3,		def XXSEL : XX4Form<60, 3,
(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB, vsrc:$XC),		(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB, vsrc:$XC),
"xxsel $XT, $XA, $XB, $XC", IIC_VecPerm, []>;		"xxsel $XT, $XA, $XB, $XC", IIC_VecPerm, []>;

def XXSLDWI : XX3Form_2<60, 2,		def XXSLDWI : XX3Form_2<60, 2,
(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB, u2imm:$SHW),		(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB, u2imm:$SHW),
"xxsldwi $XT, $XA, $XB, $SHW", IIC_VecPerm, []>;		"xxsldwi $XT, $XA, $XB, $SHW", IIC_VecPerm, []>;
def XXSPLTW : XX2Form_2<60, 164,		def XXSPLTW : XX2Form_2<60, 164,
(outs vsrc:$XT), (ins vsrc:$XB, u2imm:$UIM),		(outs vsrc:$XT), (ins vsrc:$XB, u2imm:$UIM),
"xxspltw $XT, $XB, $UIM", IIC_VecPerm, []>;		"xxspltw $XT, $XB, $UIM", IIC_VecPerm,
		[(set v4i32:$XT,
		(int_ppc_vsx_xxspltw v4i32:$XB, imm:$UIM))]>;
} // hasSideEffects		} // hasSideEffects

// SELECT_CC_* - Used to implement the SELECT_CC DAG operation. Expanded after		// SELECT_CC_* - Used to implement the SELECT_CC DAG operation. Expanded after
// instruction selection into a branch sequence.		// instruction selection into a branch sequence.
let usesCustomInserter = 1, // Expanded after instruction selection.		let usesCustomInserter = 1, // Expanded after instruction selection.
PPC970_Single = 1 in {		PPC970_Single = 1 in {

def SELECT_CC_VSRC: Pseudo<(outs vsrc:$dst),		def SELECT_CC_VSRC: Pseudo<(outs vsrc:$dst),
▲ Show 20 Lines • Show All 1,355 Lines • Show Last 20 Lines