This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] look through bitcasts when trying to narrow vector binops
ClosedPublic

Authored by spatel on Nov 11 2018, 8:29 AM.

Download Raw Diff

Details

Reviewers

efriedma
craig.topper
RKSimon
andreadb
t.p.northover

Commits

rG357053f2899f: [DAGCombiner] look through bitcasts when trying to narrow vector binops
rL347356: [DAGCombiner] look through bitcasts when trying to narrow vector binops

Summary

This is another step in vector narrowing - a follow-up to D53784 (and hoping to eventually squash potential regressions seen in D51553).

Most of the x86 test diffs are wins, but the AArch64 diff is probably not. Do we need the target hook to distinguish between "isExtractSubvectorCheap" and "isExtractSubvectorFree"?

This problem probably already exists independent of this patch, but it went unnoticed in the previous patch because there were no regression tests that showed that possibility.

The x86 diff in i64-mem-copy.ll is also close. Given the frequency throttling concerns with using wider vector ops, I think an extra extract to reduce vector width is a reasonable trade-off at this level of codegen, but if we're going to refine the target hook for AArch, we might adjust the x86 override too.

Diff Detail

Event Timeline

spatel created this revision.Nov 11 2018, 8:29 AM

Herald added subscribers: kristof.beyls, javed.absar, mcrosier. · View Herald TranscriptNov 11 2018, 8:29 AM

spatel added inline comments.Nov 11 2018, 8:40 AM

test/CodeGen/AArch64/arm64-ld1.ll
918–919	Side note for the ARM folks - I think this applies here? UXTL{2} <Vd>.<Ta>, <Vn>.<Tb> is equivalent to USHLL{2} <Vd>.<Ta>, <Vn>.<Tb>, #0 and is the preferred disassembly...

The X86 code looks reasonable to me - not sure how far the rabbit hole we want to go with 'free vs cheap' - especially as 'cheap' can be so relative, it'll quickly end up needing a cost model......

@t.p.northover The aarch64 regression looks like a failure in performAddSubLongCombine?

Ping (AArch64 expertise requested).

It's probably okay to canonicalize the way you are, but you're hitting a missing pattern for AArch64. Something like the following appears to work:

def : Pat<(sub (extract_subvector (zext v8i8:$LHS), (i64 0)),
               (extract_subvector (zext v8i8:$RHS), (i64 0))),
          (EXTRACT_SUBREG (USUBLv8i8_v8i16 v8i8:$LHS, v8i8:$RHS), dsub)>;

Of course, needs to be rewritten to to match all the relevant types and operations. x86 doesn't really have those sort of operations, I guess?

(performAddSubLongCombine probably also should be extended, but that's not what you're seeing.)

test/CodeGen/AArch64/arm64-ld1.ll
918–919	Not sure why the alias isn't getting automatically applied; please file a bug.
test/CodeGen/X86/i64-mem-copy.ll
95	This appears to be one instruction more... but maybe worth avoid 256-bit operations on x86?

In D54392#1303764, @efriedma wrote:
It's probably okay to canonicalize the way you are, but you're hitting a missing pattern for AArch64. Something like the following appears to work:
def : Pat<(sub (extract_subvector (zext v8i8:$LHS), (i64 0)),
               (extract_subvector (zext v8i8:$RHS), (i64 0))),
          (EXTRACT_SUBREG (USUBLv8i8_v8i16 v8i8:$LHS, v8i8:$RHS), dsub)>;
Of course, needs to be rewritten to to match all the relevant types and operations. x86 doesn't really have those sort of operations, I guess?

Filed here:
https://bugs.llvm.org/show_bug.cgi?id=39722

Given that it's an existing bug, there's probably not much incentive to make this patch dependent on that getting fixed?

test/CodeGen/AArch64/arm64-ld1.ll
918–919	https://bugs.llvm.org/show_bug.cgi?id=39721
test/CodeGen/X86/i64-mem-copy.ll
95	Right - this is the test I mentioned in the initial summary. Given the current HW implementation choices (frequency throttling based on count of vector ops), I think this is the preferred form despite the extra instruction.

Yes, we don't need to block this fix, I think; LGTM

This revision is now accepted and ready to land.Nov 20 2018, 11:28 AM

Closed by commit rL347356: [DAGCombiner] look through bitcasts when trying to narrow vector binops (authored by spatel). · Explain WhyNov 20 2018, 2:29 PM

This revision was automatically updated to reflect the committed changes.

spatel marked 2 inline comments as done.

spatel mentioned this in D36650: [X86] WIP support narrowing operations when only a subvector is demanded.Jan 9 2019, 9:56 AM

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

27 lines

test/

CodeGen/

AArch64/

arm64-ld1.ll

4 lines

X86/

avx-vperm2x128.ll

4 lines

avx1-logical-load-folding.ll

40 lines

i64-mem-copy.ll

3 lines

pr36199.ll

2 lines

sad.ll

4 lines

Diff 173558

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,640 Lines • ▼ Show 20 Lines	static SDValue narrowExtractedVectorBinOp(SDNode *Extract, SelectionDAG &DAG) {
EVT NarrowBVT = EVT::getVectorVT(*DAG.getContext(), WideBVT.getScalarType(),		EVT NarrowBVT = EVT::getVectorVT(*DAG.getContext(), WideBVT.getScalarType(),
WideNumElts / NarrowingRatio);		WideNumElts / NarrowingRatio);
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
if (!TLI.isOperationLegalOrCustomOrPromote(BOpcode, NarrowBVT))		if (!TLI.isOperationLegalOrCustomOrPromote(BOpcode, NarrowBVT))
return SDValue();		return SDValue();

// If extraction is cheap, we don't need to look at the binop operands		// If extraction is cheap, we don't need to look at the binop operands
// for concat ops. The narrow binop alone makes this transform profitable.		// for concat ops. The narrow binop alone makes this transform profitable.
// TODO: We're not dealing with the bitcasted pattern here. That limitation		// We can't just reuse the original extract index operand because we may have
// should be lifted.		// bitcasted.
if (Extract->getOperand(0) == BinOp && BinOp.hasOneUse() &&		unsigned ConcatOpNum = ExtractIndex / NumElems;
TLI.isExtractSubvectorCheap(NarrowBVT, WideBVT, ExtractIndex)) {		unsigned ExtBOIdx = ConcatOpNum * NarrowBVT.getVectorNumElements();
		EVT ExtBOIdxVT = Extract->getOperand(1).getValueType();
		if (TLI.isExtractSubvectorCheap(NarrowBVT, WideBVT, ExtBOIdx) &&
		BinOp.hasOneUse() && Extract->getOperand(0)->hasOneUse()) {
// extract (binop B0, B1), N --> binop (extract B0, N), (extract B1, N)		// extract (binop B0, B1), N --> binop (extract B0, N), (extract B1, N)
SDLoc DL(Extract);		SDLoc DL(Extract);
		SDValue NewExtIndex = DAG.getConstant(ExtBOIdx, DL, ExtBOIdxVT);
SDValue X = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NarrowBVT,		SDValue X = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NarrowBVT,
BinOp.getOperand(0), Extract->getOperand(1));		BinOp.getOperand(0), NewExtIndex);
SDValue Y = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NarrowBVT,		SDValue Y = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NarrowBVT,
BinOp.getOperand(1), Extract->getOperand(1));		BinOp.getOperand(1), NewExtIndex);
return DAG.getNode(BOpcode, DL, NarrowBVT, X, Y,		SDValue NarrowBinOp = DAG.getNode(BOpcode, DL, NarrowBVT, X, Y,
BinOp.getNode()->getFlags());		BinOp.getNode()->getFlags());
		return DAG.getBitcast(VT, NarrowBinOp);
}		}

// Only handle the case where we are doubling and then halving. A larger ratio		// Only handle the case where we are doubling and then halving. A larger ratio
// may require more than two narrow binops to replace the wide binop.		// may require more than two narrow binops to replace the wide binop.
if (NarrowingRatio != 2)		if (NarrowingRatio != 2)
return SDValue();		return SDValue();

// TODO: The motivating case for this transform is an x86 AVX1 target. That		// TODO: The motivating case for this transform is an x86 AVX1 target. That
Show All 12 Lines	static SDValue narrowExtractedVectorBinOp(SDNode *Extract, SelectionDAG &DAG) {
bool ConcatL =		bool ConcatL =
LHS.getOpcode() == ISD::CONCAT_VECTORS && LHS.getNumOperands() == 2;		LHS.getOpcode() == ISD::CONCAT_VECTORS && LHS.getNumOperands() == 2;
bool ConcatR =		bool ConcatR =
RHS.getOpcode() == ISD::CONCAT_VECTORS && RHS.getNumOperands() == 2;		RHS.getOpcode() == ISD::CONCAT_VECTORS && RHS.getNumOperands() == 2;
if (!ConcatL && !ConcatR)		if (!ConcatL && !ConcatR)
return SDValue();		return SDValue();

// If one of the binop operands was not the result of a concat, we must		// If one of the binop operands was not the result of a concat, we must
// extract a half-sized operand for our new narrow binop. We can't just reuse		// extract a half-sized operand for our new narrow binop.
// the original extract index operand because we may have bitcasted.
unsigned ConcatOpNum = ExtractIndex / NumElems;
unsigned ExtBOIdx = ConcatOpNum * NarrowBVT.getVectorNumElements();
EVT ExtBOIdxVT = Extract->getOperand(1).getValueType();
SDLoc DL(Extract);		SDLoc DL(Extract);

// extract (binop (concat X1, X2), (concat Y1, Y2)), N --> binop XN, YN		// extract (binop (concat X1, X2), (concat Y1, Y2)), N --> binop XN, YN
// extract (binop (concat X1, X2), Y), N --> binop XN, (extract Y, N)		// extract (binop (concat X1, X2), Y), N --> binop XN, (extract Y, N)
// extract (binop X, (concat Y1, Y2)), N --> binop (extract X, N), YN		// extract (binop X, (concat Y1, Y2)), N --> binop (extract X, N), YN
SDValue X = ConcatL ? DAG.getBitcast(NarrowBVT, LHS.getOperand(ConcatOpNum))		SDValue X = ConcatL ? DAG.getBitcast(NarrowBVT, LHS.getOperand(ConcatOpNum))
: DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NarrowBVT,		: DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NarrowBVT,
BinOp.getOperand(0),		BinOp.getOperand(0),
▲ Show 20 Lines • Show All 2,376 Lines • Show Last 20 Lines

test/CodeGen/AArch64/arm64-ld1.ll

	Show First 20 Lines • Show All 909 Lines • ▼ Show 20 Lines


	; Add rdar://13098923 test case: vld1_dup_u32 doesn't generate ld1r.2s			; Add rdar://13098923 test case: vld1_dup_u32 doesn't generate ld1r.2s
	define void @ld1r_2s_from_dup(i8* nocapture %a, i8* nocapture %b, i16* nocapture %diff) nounwind ssp {			define void @ld1r_2s_from_dup(i8* nocapture %a, i8* nocapture %b, i16* nocapture %diff) nounwind ssp {
	entry:			entry:
	; CHECK: ld1r_2s_from_dup			; CHECK: ld1r_2s_from_dup
	; CHECK: ld1r.2s { [[ARG1:v[0-9]+]] }, [x0]			; CHECK: ld1r.2s { [[ARG1:v[0-9]+]] }, [x0]
	; CHECK-NEXT: ld1r.2s { [[ARG2:v[0-9]+]] }, [x1]			; CHECK-NEXT: ld1r.2s { [[ARG2:v[0-9]+]] }, [x1]
	; CHECK-NEXT: usubl.8h v[[RESREGNUM:[0-9]+]], [[ARG1]], [[ARG2]]			; CHECK-NEXT: ushll.8h [[ARG1]], [[ARG1]], #0
				; CHECK-NEXT: ushll.8h [[ARG2]], [[ARG2]], #0
				spatelAuthorUnsubmitted Not Done Reply Inline Actions Side note for the ARM folks - I think this applies here? UXTL{2} <Vd>.<Ta>, <Vn>.<Tb> is equivalent to USHLL{2} <Vd>.<Ta>, <Vn>.<Tb>, #0 and is the preferred disassembly... spatel: Side note for the ARM folks - I think this applies here? ``` UXTL{2} <Vd>.<Ta>, <Vn>.<Tb> is…
				efriedmaUnsubmitted Done Reply Inline Actions Not sure why the alias isn't getting automatically applied; please file a bug. efriedma: Not sure why the alias isn't getting automatically applied; please file a bug.
				spatelAuthorUnsubmitted Not Done Reply Inline Actions https://bugs.llvm.org/show_bug.cgi?id=39721 spatel: https://bugs.llvm.org/show_bug.cgi?id=39721
				; CHECK-NEXT: sub.4h v[[RESREGNUM:[0-9]+]], [[ARG1]], [[ARG2]]
	; CHECK-NEXT: str d[[RESREGNUM]], [x2]			; CHECK-NEXT: str d[[RESREGNUM]], [x2]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%tmp = bitcast i8* %a to i32*			%tmp = bitcast i8* %a to i32*
	%tmp1 = load i32, i32* %tmp, align 4			%tmp1 = load i32, i32* %tmp, align 4
	%tmp2 = insertelement <2 x i32> undef, i32 %tmp1, i32 0			%tmp2 = insertelement <2 x i32> undef, i32 %tmp1, i32 0
	%lane = shufflevector <2 x i32> %tmp2, <2 x i32> undef, <2 x i32> zeroinitializer			%lane = shufflevector <2 x i32> %tmp2, <2 x i32> undef, <2 x i32> zeroinitializer
	%tmp3 = bitcast <2 x i32> %lane to <8 x i8>			%tmp3 = bitcast <2 x i32> %lane to <8 x i8>
	%tmp4 = bitcast i8* %b to i32*			%tmp4 = bitcast i8* %b to i32*
	▲ Show 20 Lines • Show All 419 Lines • Show Last 20 Lines

test/CodeGen/X86/avx-vperm2x128.ll

	Show First 20 Lines • Show All 204 Lines • ▼ Show 20 Lines
	; AVX1: # %bb.0: # %entry			; AVX1: # %bb.0: # %entry
	; AVX1-NEXT: vpcmpeqd %xmm2, %xmm2, %xmm2			; AVX1-NEXT: vpcmpeqd %xmm2, %xmm2, %xmm2
	; AVX1-NEXT: vpsubw %xmm2, %xmm0, %xmm0			; AVX1-NEXT: vpsubw %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v16i16_4501:			; AVX2-LABEL: shuffle_v16i16_4501:
	; AVX2: # %bb.0: # %entry			; AVX2: # %bb.0: # %entry
	; AVX2-NEXT: vpcmpeqd %ymm2, %ymm2, %ymm2			; AVX2-NEXT: vpcmpeqd %xmm2, %xmm2, %xmm2
	; AVX2-NEXT: vpsubw %ymm2, %ymm0, %ymm0			; AVX2-NEXT: vpsubw %xmm2, %xmm0, %xmm0
	; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0			; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	entry:			entry:
	; add forces execution domain			; add forces execution domain
	%a2 = add <16 x i16> %a, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>			%a2 = add <16 x i16> %a, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
	%shuffle = shufflevector <16 x i16> %a2, <16 x i16> %b, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%shuffle = shufflevector <16 x i16> %a2, <16 x i16> %b, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	ret <16 x i16> %shuffle			ret <16 x i16> %shuffle
	}			}
	▲ Show 20 Lines • Show All 471 Lines • Show Last 20 Lines

test/CodeGen/X86/avx1-logical-load-folding.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -O3 -disable-peephole -mtriple=i686-apple-macosx10.9.0 -mcpu=corei7-avx -mattr=+avx \| FileCheck %s --check-prefix=X86			; RUN: llc < %s -O3 -disable-peephole -mtriple=i686-apple-macosx10.9.0 -mcpu=corei7-avx -mattr=+avx \| FileCheck %s --check-prefix=X86
	; RUN: llc < %s -O3 -disable-peephole -mtriple=x86_64-apple-macosx10.9.0 -mcpu=corei7-avx -mattr=+avx \| FileCheck %s --check-prefix=X64			; RUN: llc < %s -O3 -disable-peephole -mtriple=x86_64-apple-macosx10.9.0 -mcpu=corei7-avx -mattr=+avx \| FileCheck %s --check-prefix=X64

	; Function Attrs: nounwind ssp uwtable			; Function Attrs: nounwind ssp uwtable
	define void @test1(float* %A, float* %C) #0 {			define void @test1(float* %A, float* %C) #0 {
	; X86-LABEL: test1:			; X86-LABEL: test1:
	; X86: ## %bb.0:			; X86: ## %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-NEXT: vmovaps (%ecx), %ymm0			; X86-NEXT: vmovaps (%ecx), %xmm0
	; X86-NEXT: vandps LCPI0_0, %ymm0, %ymm0			; X86-NEXT: vandps LCPI0_0, %xmm0, %xmm0
	; X86-NEXT: vmovss %xmm0, (%eax)			; X86-NEXT: vmovss %xmm0, (%eax)
	; X86-NEXT: vzeroupper
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test1:			; X64-LABEL: test1:
	; X64: ## %bb.0:			; X64: ## %bb.0:
	; X64-NEXT: vmovaps (%rdi), %ymm0			; X64-NEXT: vmovaps (%rdi), %xmm0
	; X64-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0			; X64-NEXT: vandps {{.*}}(%rip), %xmm0, %xmm0
	; X64-NEXT: vmovss %xmm0, (%rsi)			; X64-NEXT: vmovss %xmm0, (%rsi)
	; X64-NEXT: vzeroupper
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp1 = bitcast float* %A to <8 x float>*			%tmp1 = bitcast float* %A to <8 x float>*
	%tmp2 = load <8 x float>, <8 x float>* %tmp1, align 32			%tmp2 = load <8 x float>, <8 x float>* %tmp1, align 32
	%tmp3 = bitcast <8 x float> %tmp2 to <8 x i32>			%tmp3 = bitcast <8 x float> %tmp2 to <8 x i32>
	%tmp4 = and <8 x i32> %tmp3, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>			%tmp4 = and <8 x i32> %tmp3, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>
	%tmp5 = bitcast <8 x i32> %tmp4 to <8 x float>			%tmp5 = bitcast <8 x i32> %tmp4 to <8 x float>
	%tmp6 = extractelement <8 x float> %tmp5, i32 0			%tmp6 = extractelement <8 x float> %tmp5, i32 0
	store float %tmp6, float* %C			store float %tmp6, float* %C
	ret void			ret void
	}			}

	; Function Attrs: nounwind ssp uwtable			; Function Attrs: nounwind ssp uwtable
	define void @test2(float* %A, float* %C) #0 {			define void @test2(float* %A, float* %C) #0 {
	; X86-LABEL: test2:			; X86-LABEL: test2:
	; X86: ## %bb.0:			; X86: ## %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-NEXT: vmovaps (%ecx), %ymm0			; X86-NEXT: vmovaps (%ecx), %xmm0
	; X86-NEXT: vorps LCPI1_0, %ymm0, %ymm0			; X86-NEXT: vorps LCPI1_0, %xmm0, %xmm0
	; X86-NEXT: vmovss %xmm0, (%eax)			; X86-NEXT: vmovss %xmm0, (%eax)
	; X86-NEXT: vzeroupper
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test2:			; X64-LABEL: test2:
	; X64: ## %bb.0:			; X64: ## %bb.0:
	; X64-NEXT: vmovaps (%rdi), %ymm0			; X64-NEXT: vmovaps (%rdi), %xmm0
	; X64-NEXT: vorps {{.*}}(%rip), %ymm0, %ymm0			; X64-NEXT: vorps {{.*}}(%rip), %xmm0, %xmm0
	; X64-NEXT: vmovss %xmm0, (%rsi)			; X64-NEXT: vmovss %xmm0, (%rsi)
	; X64-NEXT: vzeroupper
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp1 = bitcast float* %A to <8 x float>*			%tmp1 = bitcast float* %A to <8 x float>*
	%tmp2 = load <8 x float>, <8 x float>* %tmp1, align 32			%tmp2 = load <8 x float>, <8 x float>* %tmp1, align 32
	%tmp3 = bitcast <8 x float> %tmp2 to <8 x i32>			%tmp3 = bitcast <8 x float> %tmp2 to <8 x i32>
	%tmp4 = or <8 x i32> %tmp3, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>			%tmp4 = or <8 x i32> %tmp3, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>
	%tmp5 = bitcast <8 x i32> %tmp4 to <8 x float>			%tmp5 = bitcast <8 x i32> %tmp4 to <8 x float>
	%tmp6 = extractelement <8 x float> %tmp5, i32 0			%tmp6 = extractelement <8 x float> %tmp5, i32 0
	store float %tmp6, float* %C			store float %tmp6, float* %C
	ret void			ret void
	}			}

	; Function Attrs: nounwind ssp uwtable			; Function Attrs: nounwind ssp uwtable
	define void @test3(float* %A, float* %C) #0 {			define void @test3(float* %A, float* %C) #0 {
	; X86-LABEL: test3:			; X86-LABEL: test3:
	; X86: ## %bb.0:			; X86: ## %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-NEXT: vmovaps (%ecx), %ymm0			; X86-NEXT: vmovaps (%ecx), %xmm0
	; X86-NEXT: vxorps LCPI2_0, %ymm0, %ymm0			; X86-NEXT: vxorps LCPI2_0, %xmm0, %xmm0
	; X86-NEXT: vmovss %xmm0, (%eax)			; X86-NEXT: vmovss %xmm0, (%eax)
	; X86-NEXT: vzeroupper
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test3:			; X64-LABEL: test3:
	; X64: ## %bb.0:			; X64: ## %bb.0:
	; X64-NEXT: vmovaps (%rdi), %ymm0			; X64-NEXT: vmovaps (%rdi), %xmm0
	; X64-NEXT: vxorps {{.*}}(%rip), %ymm0, %ymm0			; X64-NEXT: vxorps {{.*}}(%rip), %xmm0, %xmm0
	; X64-NEXT: vmovss %xmm0, (%rsi)			; X64-NEXT: vmovss %xmm0, (%rsi)
	; X64-NEXT: vzeroupper
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp1 = bitcast float* %A to <8 x float>*			%tmp1 = bitcast float* %A to <8 x float>*
	%tmp2 = load <8 x float>, <8 x float>* %tmp1, align 32			%tmp2 = load <8 x float>, <8 x float>* %tmp1, align 32
	%tmp3 = bitcast <8 x float> %tmp2 to <8 x i32>			%tmp3 = bitcast <8 x float> %tmp2 to <8 x i32>
	%tmp4 = xor <8 x i32> %tmp3, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>			%tmp4 = xor <8 x i32> %tmp3, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>
	%tmp5 = bitcast <8 x i32> %tmp4 to <8 x float>			%tmp5 = bitcast <8 x i32> %tmp4 to <8 x float>
	%tmp6 = extractelement <8 x float> %tmp5, i32 0			%tmp6 = extractelement <8 x float> %tmp5, i32 0
	store float %tmp6, float* %C			store float %tmp6, float* %C
	ret void			ret void
	}			}

	define void @test4(float* %A, float* %C) #0 {			define void @test4(float* %A, float* %C) #0 {
	; X86-LABEL: test4:			; X86-LABEL: test4:
	; X86: ## %bb.0:			; X86: ## %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-NEXT: vmovaps (%ecx), %ymm0			; X86-NEXT: vmovaps (%ecx), %xmm0
	; X86-NEXT: vandnps LCPI3_0, %ymm0, %ymm0			; X86-NEXT: vandnps LCPI3_0, %xmm0, %xmm0
	; X86-NEXT: vmovss %xmm0, (%eax)			; X86-NEXT: vmovss %xmm0, (%eax)
	; X86-NEXT: vzeroupper
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test4:			; X64-LABEL: test4:
	; X64: ## %bb.0:			; X64: ## %bb.0:
	; X64-NEXT: vmovaps (%rdi), %ymm0			; X64-NEXT: vmovaps (%rdi), %xmm0
	; X64-NEXT: vandnps {{.*}}(%rip), %ymm0, %ymm0			; X64-NEXT: vandnps {{.*}}(%rip), %xmm0, %xmm0
	; X64-NEXT: vmovss %xmm0, (%rsi)			; X64-NEXT: vmovss %xmm0, (%rsi)
	; X64-NEXT: vzeroupper
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp1 = bitcast float* %A to <8 x float>*			%tmp1 = bitcast float* %A to <8 x float>*
	%tmp2 = load <8 x float>, <8 x float>* %tmp1, align 32			%tmp2 = load <8 x float>, <8 x float>* %tmp1, align 32
	%tmp3 = bitcast <8 x float> %tmp2 to <8 x i32>			%tmp3 = bitcast <8 x float> %tmp2 to <8 x i32>
	%tmp4 = xor <8 x i32> %tmp3, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>			%tmp4 = xor <8 x i32> %tmp3, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
	%tmp5 = and <8 x i32> %tmp4, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>			%tmp5 = and <8 x i32> %tmp4, <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>
	%tmp6 = bitcast <8 x i32> %tmp5 to <8 x float>			%tmp6 = bitcast <8 x i32> %tmp5 to <8 x float>
	%tmp7 = extractelement <8 x float> %tmp6, i32 0			%tmp7 = extractelement <8 x float> %tmp6, i32 0
	store float %tmp7, float * %C			store float %tmp7, float * %C
	ret void			ret void
	}			}

test/CodeGen/X86/i64-mem-copy.ll

	Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
	; X32-NEXT: movl %ebp, %esp			; X32-NEXT: movl %ebp, %esp
	; X32-NEXT: popl %ebp			; X32-NEXT: popl %ebp
	; X32-NEXT: .cfi_def_cfa %esp, 4			; X32-NEXT: .cfi_def_cfa %esp, 4
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X32AVX-LABEL: store_i64_from_vector256:			; X32AVX-LABEL: store_i64_from_vector256:
	; X32AVX: # %bb.0:			; X32AVX: # %bb.0:
	; X32AVX-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32AVX-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32AVX-NEXT: vpaddw %ymm1, %ymm0, %ymm0			; X32AVX-NEXT: vextracti128 $1, %ymm1, %xmm1
	; X32AVX-NEXT: vextracti128 $1, %ymm0, %xmm0			; X32AVX-NEXT: vextracti128 $1, %ymm0, %xmm0
				; X32AVX-NEXT: vpaddw %xmm1, %xmm0, %xmm0
				efriedmaUnsubmitted Done Reply Inline Actions This appears to be one instruction more... but maybe worth avoid 256-bit operations on x86? efriedma: This appears to be one instruction more... but maybe worth avoid 256-bit operations on x86?
				spatelAuthorUnsubmitted Not Done Reply Inline Actions Right - this is the test I mentioned in the initial summary. Given the current HW implementation choices (frequency throttling based on count of vector ops), I think this is the preferred form despite the extra instruction. spatel: Right - this is the test I mentioned in the initial summary. Given the current HW…
	; X32AVX-NEXT: vmovq %xmm0, (%eax)			; X32AVX-NEXT: vmovq %xmm0, (%eax)
	; X32AVX-NEXT: vzeroupper			; X32AVX-NEXT: vzeroupper
	; X32AVX-NEXT: retl			; X32AVX-NEXT: retl
	%z = add <16 x i16> %x, %y ; force execution domain			%z = add <16 x i16> %x, %y ; force execution domain
	%bc = bitcast <16 x i16> %z to <4 x i64>			%bc = bitcast <16 x i16> %z to <4 x i64>
	%vecext = extractelement <4 x i64> %bc, i32 2			%vecext = extractelement <4 x i64> %bc, i32 2
	store i64 %vecext, i64* %i, align 8			store i64 %vecext, i64* %i, align 8
	ret void			ret void
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

test/CodeGen/X86/pr36199.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f \| FileCheck %s

	define void @foo(<16 x float> %x) {			define void @foo(<16 x float> %x) {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: vaddps %zmm0, %zmm0, %zmm0			; CHECK-NEXT: vaddps %xmm0, %xmm0, %xmm0
	; CHECK-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; CHECK-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; CHECK-NEXT: vinsertf64x4 $1, %ymm0, %zmm0, %zmm0			; CHECK-NEXT: vinsertf64x4 $1, %ymm0, %zmm0, %zmm0
	; CHECK-NEXT: vmovups %zmm0, (%rax)			; CHECK-NEXT: vmovups %zmm0, (%rax)
	; CHECK-NEXT: vzeroupper			; CHECK-NEXT: vzeroupper
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = fadd <16 x float> %x, %x			%1 = fadd <16 x float> %x, %x
	%bc256 = bitcast <16 x float> %1 to <4 x i128>			%bc256 = bitcast <16 x float> %1 to <4 x i128>
	%2 = extractelement <4 x i128> %bc256, i32 0			%2 = extractelement <4 x i128> %bc256, i32 0
	%3 = bitcast i128 %2 to <4 x float>			%3 = bitcast i128 %2 to <4 x float>
	%4 = shufflevector <4 x float> %3, <4 x float> undef, <16 x i32> <i32 0, i32			%4 = shufflevector <4 x float> %3, <4 x float> undef, <16 x i32> <i32 0, i32
	1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0,			1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0,
	i32 1, i32 2, i32 3>			i32 1, i32 2, i32 3>
	store <16 x float> %4, <16 x float>* undef, align 4			store <16 x float> %4, <16 x float>* undef, align 4
	ret void			ret void
	}			}

test/CodeGen/X86/sad.ll

	Show First 20 Lines • Show All 1,353 Lines • ▼ Show 20 Lines
	;			;
	; AVX2-LABEL: sad_nonloop_32i8:			; AVX2-LABEL: sad_nonloop_32i8:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovdqu (%rdi), %ymm0			; AVX2-NEXT: vmovdqu (%rdi), %ymm0
	; AVX2-NEXT: vpsadbw (%rdx), %ymm0, %ymm0			; AVX2-NEXT: vpsadbw (%rdx), %ymm0, %ymm0
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX2-NEXT: vpaddq %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddq %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX2-NEXT: vpaddq %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddq %xmm1, %xmm0, %xmm0
	; AVX2-NEXT: vmovd %xmm0, %eax			; AVX2-NEXT: vmovd %xmm0, %eax
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: sad_nonloop_32i8:			; AVX512-LABEL: sad_nonloop_32i8:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vmovdqu (%rdi), %ymm0			; AVX512-NEXT: vmovdqu (%rdi), %ymm0
	; AVX512-NEXT: vpsadbw (%rdx), %ymm0, %ymm0			; AVX512-NEXT: vpsadbw (%rdx), %ymm0, %ymm0
	; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512-NEXT: vpaddq %ymm1, %ymm0, %ymm0			; AVX512-NEXT: vpaddq %ymm1, %ymm0, %ymm0
	; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512-NEXT: vpaddq %ymm1, %ymm0, %ymm0			; AVX512-NEXT: vpaddq %xmm1, %xmm0, %xmm0
	; AVX512-NEXT: vmovd %xmm0, %eax			; AVX512-NEXT: vmovd %xmm0, %eax
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%v1 = load <32 x i8>, <32 x i8>* %p, align 1			%v1 = load <32 x i8>, <32 x i8>* %p, align 1
	%z1 = zext <32 x i8> %v1 to <32 x i32>			%z1 = zext <32 x i8> %v1 to <32 x i32>
	%v2 = load <32 x i8>, <32 x i8>* %q, align 1			%v2 = load <32 x i8>, <32 x i8>* %q, align 1
	%z2 = zext <32 x i8> %v2 to <32 x i32>			%z2 = zext <32 x i8> %v2 to <32 x i32>
	%sub = sub nsw <32 x i32> %z1, %z2			%sub = sub nsw <32 x i32> %z1, %z2
	▲ Show 20 Lines • Show All 225 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] look through bitcasts when trying to narrow vector binopsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 173558

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/AArch64/arm64-ld1.ll

test/CodeGen/X86/avx-vperm2x128.ll

test/CodeGen/X86/avx1-logical-load-folding.ll

test/CodeGen/X86/i64-mem-copy.ll

test/CodeGen/X86/pr36199.ll

test/CodeGen/X86/sad.ll

[DAGCombiner] look through bitcasts when trying to narrow vector binops
ClosedPublic