This is an archive of the discontinued LLVM Phabricator instance.

[x86] enable storeOfVectorConstantIsCheap() target hook
ClosedPublic

Authored by spatel on Sep 4 2017, 4:30 PM.

Download Raw Diff

Details

Reviewers

craig.topper
zvi
RKSimon

Commits

rG65d67807039d: [x86] enable storeOfVectorConstantIsCheap() target hook
rL313458: [x86] enable storeOfVectorConstantIsCheap() target hook

Summary

This allows vector-sized store merging of constants in DAGCombiner using the existing code in MergeConsecutiveStores(). All of the twisted logic that decides exactly what vector operations are legal and fast for each particular CPU are handled separately in there using the appropriate hooks.

Some notes:

For the motivating tests in merge-store-constants.ll, we already produce the same vector code in IR via the SLP vectorizer. So this is just providing a backend backstop for code that doesn't go through that pass (-O1). More details in PR24449:

https://bugs.llvm.org/show_bug.cgi?id=24449 (this change should be the last step to resolve that bug)

At the minimum vector size limit (16-bytes), we're trading two 8-byte scalar immediate stores for one 16-byte constant pool load and one 16-byte store (eg, fold-vector-sext-crash2.ll::test_sext1). I think that's a reasonable trade-off because offloading any work to vector registers should ease pressure on register-starved scalar code, but let me know if there are other considerations. We could adjust this in the hook by returning true only for >2*max scalar size, so we know there would an instruction reduction.

There's a likely regression in vector-sext-crash2.ll::test_zext1 and mod128.ll where we materialize a constant in scalar and then send it over to the vector unit. I know we have some bug reports related to that. A quick scan turned up:

https://bugs.llvm.org/show_bug.cgi?id=26301
...but there are probably others.

Diff Detail

Event Timeline

spatel created this revision.Sep 4 2017, 4:30 PM

Herald added a subscriber: mcrosier. · View Herald TranscriptSep 4 2017, 4:30 PM

Intuitively, it seems to me that choosing a minimum threshold, as suggested in note 2, is a better option.
Another concern for store-merging in general: are we more susceptible to losing store-to-load forwarding? I know that Intel pre-Nehalem processors had some limitations that were later improved. Sorry i can't recall the full details from the top of my head. Will look later at the Optimization Manual for the info.

In D37451#863494, @zvi wrote:

Intuitively, it seems to me that choosing a minimum threshold, as suggested in note 2, is a better option.

Yes, that should be a more clear win. It should also sidestep the scalar imm -> move to vector -> store regressions we see here.

Another concern for store-merging in general: are we more susceptible to losing store-to-load forwarding? I know that Intel pre-Nehalem processors had some limitations that were later improved. Sorry i can't recall the full details from the top of my head. Will look later at the Optimization Manual for the info.

I can see that concern in general, but we're only dealing with constant stores in this patch, so I would hope there's no problem rematerializing constants.

Patch updated:
Don't enable vector store of constants unless we can replace more than 2 scalar stores. This avoids the borderline cases and sidesteps the regressions in the earlier rev.

big_nonzero_16_bytes() for x86-64 shows a different merging problem. We can't directly store 64-bit immediates, so we have to materialize the constants in registers and then store as a separate instruction. The test is trying to store four 32-bit imms, so we probably shouldn't have done any merging there?

In D37451#863708, @spatel wrote:

big_nonzero_16_bytes() for x86-64 shows a different merging problem. We can't directly store 64-bit immediates, so we have to materialize the constants in registers and then store as a separate instruction. The test is trying to store four 32-bit imms, so we probably shouldn't have done any merging there?

On 2nd thought, that doesn't make sense. We probably should merge that case, but the hook doesn't allow us to distinguish constant values. So we don't know when an immediate can be placed directly in the store or not.

I think it's ok to only handle the larger cases in this patch and make that potential enhancement a follow-up. Fixing that might also solve the scalar imm to vector to memory bug.

Patch updated:
On 3rd thought... :)
We can simplify the code and handle the previously mentioned test on 64-bit too. I've added another test to show the potential follow-up improvement.

Ping.

LGTM

This revision is now accepted and ready to land.Sep 15 2017, 9:39 AM

Closed by commit rL313458: [x86] enable storeOfVectorConstantIsCheap() target hook (authored by spatel). · Explain WhySep 16 2017, 6:30 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.h

3 lines

X86ISelLowering.cpp

12 lines

test/

CodeGen/

X86/

avx512-regcall-Mask.ll

43 lines

merge-store-constants.ll

49 lines

Diff 114224

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,034 Lines • ▼ Show 20 Lines	public:

bool convertSelectOfConstantsToMath(EVT VT) const override;		bool convertSelectOfConstantsToMath(EVT VT) const override;

/// Return true if EXTRACT_SUBVECTOR is cheap for this result type		/// Return true if EXTRACT_SUBVECTOR is cheap for this result type
/// with this index.		/// with this index.
bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,		bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
unsigned Index) const override;		unsigned Index) const override;

		bool storeOfVectorConstantIsCheap(EVT MemVT, unsigned NumElem,
		unsigned AddrSpace) const override;

/// Intel processors have a unified instruction and data cache		/// Intel processors have a unified instruction and data cache
const char * getClearCacheBuiltinName() const override {		const char * getClearCacheBuiltinName() const override {
return nullptr; // nothing to do, move along.		return nullptr; // nothing to do, move along.
}		}

unsigned getRegisterByName(const char* RegName, EVT VT,		unsigned getRegisterByName(const char* RegName, EVT VT,
SelectionDAG &DAG) const override;		SelectionDAG &DAG) const override;

▲ Show 20 Lines • Show All 454 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,597 Lines • ▼ Show 20 Lines	bool X86TargetLowering::isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
// extract half of vector.		// extract half of vector.
if (ResVT.getVectorElementType() == MVT::i1)		if (ResVT.getVectorElementType() == MVT::i1)
return Index == 0 \|\| ((ResVT.getSizeInBits() == SrcVT.getSizeInBits()*2) &&		return Index == 0 \|\| ((ResVT.getSizeInBits() == SrcVT.getSizeInBits()*2) &&
(Index == ResVT.getVectorNumElements()));		(Index == ResVT.getVectorNumElements()));

return (Index % ResVT.getVectorNumElements()) == 0;		return (Index % ResVT.getVectorNumElements()) == 0;
}		}

		bool X86TargetLowering::storeOfVectorConstantIsCheap(EVT MemVT,
		unsigned NumElts,
		unsigned) const {
		// Don't replace two scalar stores of immediates with a vector constant load
		// and a vector store. There should be a net reduction in instructions from
		// using the wider operations, so make sure that the vector store is larger
		// than 2 of the largest scalar stores.
		unsigned VectorStoreSize = MemVT.getSizeInBits() * NumElts;
		unsigned LargestScalarStoreSize = Subtarget.is64Bit() ? 64 : 32;
		return VectorStoreSize > 2 * LargestScalarStoreSize;
		}

bool X86TargetLowering::isCheapToSpeculateCttz() const {		bool X86TargetLowering::isCheapToSpeculateCttz() const {
// Speculate cttz only if we can directly use TZCNT.		// Speculate cttz only if we can directly use TZCNT.
return Subtarget.hasBMI();		return Subtarget.hasBMI();
}		}

bool X86TargetLowering::isCheapToSpeculateCtlz() const {		bool X86TargetLowering::isCheapToSpeculateCtlz() const {
// Speculate ctlz only if we can directly use LZCNT.		// Speculate ctlz only if we can directly use LZCNT.
return Subtarget.hasLZCNT();		return Subtarget.hasLZCNT();
▲ Show 20 Lines • Show All 32,379 Lines • Show Last 20 Lines

test/CodeGen/X86/avx512-regcall-Mask.ll

Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	define x86_regcallcc i64 @test_argv64i1(<64 x i1> %x0, <64 x i1> %x1, <64 x i1> %x2,
%add9 = add i64 %add8, %y9		%add9 = add i64 %add8, %y9
%add10 = add i64 %add9, %y10		%add10 = add i64 %add9, %y10
%add11 = add i64 %add10, %y11		%add11 = add i64 %add10, %y11
%add12 = add i64 %add11, %y12		%add12 = add i64 %add11, %y12
ret i64 %add12		ret i64 %add12
}		}

; X32-LABEL: caller_argv64i1:		; X32-LABEL: caller_argv64i1:
		; X32: pushl %edi
		; X32: subl $88, %esp
		; X32: vmovaps __xmm@00000001000000020000000100000002, %xmm0 # xmm0 = [2,1,2,1]
		; X32: vmovups %xmm0, 64(%esp)
		; X32: vmovaps LCPI1_1, %zmm0 # zmm0 = [2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1]
		; X32: vmovups %zmm0, (%esp)
		; X32: movl $1, 84(%esp)
		; X32: movl $2, 80(%esp)
; X32: movl $2, %eax		; X32: movl $2, %eax
; X32: movl $1, %ecx		; X32: movl $1, %ecx
; X32: movl $2, %edx		; X32: movl $2, %edx
; X32: movl $1, %edi		; X32: movl $1, %edi
; X32: pushl ${{1\|2}}		; X32: vzeroupper
; X32: pushl ${{1\|2}}		; X32: calll _test_argv64i1
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: pushl ${{1\|2}}
; X32: call{{.*}} _test_argv64i1

; WIN64-LABEL: caller_argv64i1:		; WIN64-LABEL: caller_argv64i1:
; WIN64: movabsq $4294967298, %rax		; WIN64: movabsq $4294967298, %rax
; WIN64: movq %rax, (%rsp)		; WIN64: movq %rax, (%rsp)
; WIN64: movq %rax, %rcx		; WIN64: movq %rax, %rcx
; WIN64: movq %rax, %rdx		; WIN64: movq %rax, %rdx
; WIN64: movq %rax, %rdi		; WIN64: movq %rax, %rdi
; WIN64: movq %rax, %rsi		; WIN64: movq %rax, %rsi
; WIN64: movq %rax, %r8		; WIN64: movq %rax, %r8
▲ Show 20 Lines • Show All 216 Lines • Show Last 20 Lines

test/CodeGen/X86/merge-store-constants.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=avx \| FileCheck %s --check-prefix=X32			; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=avx \| FileCheck %s --check-prefix=X32
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx \| FileCheck %s --check-prefix=X64			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx \| FileCheck %s --check-prefix=X64

	define void @big_nonzero_16_bytes(i32* nocapture %a) {			define void @big_nonzero_16_bytes(i32* nocapture %a) {
	; X32-LABEL: big_nonzero_16_bytes:			; X32-LABEL: big_nonzero_16_bytes:
	; X32: # BB#0:			; X32: # BB#0:
	; X32-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-NEXT: movl $1, (%eax)			; X32-NEXT: vmovaps {{.*#+}} xmm0 = [1,2,3,4]
	; X32-NEXT: movl $2, 4(%eax)			; X32-NEXT: vmovups %xmm0, (%eax)
	; X32-NEXT: movl $3, 8(%eax)
	; X32-NEXT: movl $4, 12(%eax)
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: big_nonzero_16_bytes:			; X64-LABEL: big_nonzero_16_bytes:
	; X64: # BB#0:			; X64: # BB#0:
	; X64-NEXT: movabsq $8589934593, %rax # imm = 0x200000001			; X64-NEXT: movabsq $8589934593, %rax # imm = 0x200000001
	; X64-NEXT: movq %rax, (%rdi)			; X64-NEXT: movq %rax, (%rdi)
	; X64-NEXT: movabsq $17179869187, %rax # imm = 0x400000003			; X64-NEXT: movabsq $17179869187, %rax # imm = 0x400000003
	; X64-NEXT: movq %rax, 8(%rdi)			; X64-NEXT: movq %rax, 8(%rdi)
	Show All 10 Lines
	}			}

	; Splats may be an opportunity to use a broadcast op.			; Splats may be an opportunity to use a broadcast op.

	define void @big_nonzero_32_bytes_splat(i32* nocapture %a) {			define void @big_nonzero_32_bytes_splat(i32* nocapture %a) {
	; X32-LABEL: big_nonzero_32_bytes_splat:			; X32-LABEL: big_nonzero_32_bytes_splat:
	; X32: # BB#0:			; X32: # BB#0:
	; X32-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-NEXT: movl $42, (%eax)			; X32-NEXT: vmovaps {{.*#+}} ymm0 = [42,42,42,42,42,42,42,42]
	; X32-NEXT: movl $42, 4(%eax)			; X32-NEXT: vmovups %ymm0, (%eax)
	; X32-NEXT: movl $42, 8(%eax)			; X32-NEXT: vzeroupper
	; X32-NEXT: movl $42, 12(%eax)
	; X32-NEXT: movl $42, 16(%eax)
	; X32-NEXT: movl $42, 20(%eax)
	; X32-NEXT: movl $42, 24(%eax)
	; X32-NEXT: movl $42, 28(%eax)
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: big_nonzero_32_bytes_splat:			; X64-LABEL: big_nonzero_32_bytes_splat:
	; X64: # BB#0:			; X64: # BB#0:
	; X64-NEXT: movabsq $180388626474, %rax # imm = 0x2A0000002A			; X64-NEXT: vmovaps {{.*#+}} ymm0 = [42,42,42,42,42,42,42,42]
	; X64-NEXT: movq %rax, (%rdi)			; X64-NEXT: vmovups %ymm0, (%rdi)
	; X64-NEXT: movq %rax, 8(%rdi)			; X64-NEXT: vzeroupper
	; X64-NEXT: movq %rax, 16(%rdi)
	; X64-NEXT: movq %rax, 24(%rdi)
	; X64-NEXT: retq			; X64-NEXT: retq
	%arrayidx1 = getelementptr inbounds i32, i32* %a, i64 1			%arrayidx1 = getelementptr inbounds i32, i32* %a, i64 1
	%arrayidx2 = getelementptr inbounds i32, i32* %a, i64 2			%arrayidx2 = getelementptr inbounds i32, i32* %a, i64 2
	%arrayidx3 = getelementptr inbounds i32, i32* %a, i64 3			%arrayidx3 = getelementptr inbounds i32, i32* %a, i64 3
	%arrayidx4 = getelementptr inbounds i32, i32* %a, i64 4			%arrayidx4 = getelementptr inbounds i32, i32* %a, i64 4
	%arrayidx5 = getelementptr inbounds i32, i32* %a, i64 5			%arrayidx5 = getelementptr inbounds i32, i32* %a, i64 5
	%arrayidx6 = getelementptr inbounds i32, i32* %a, i64 6			%arrayidx6 = getelementptr inbounds i32, i32* %a, i64 6
	%arrayidx7 = getelementptr inbounds i32, i32* %a, i64 7			%arrayidx7 = getelementptr inbounds i32, i32* %a, i64 7
	Show All 10 Lines
	}			}

	; Verify that we choose the best-sized store(s) for each chunk.			; Verify that we choose the best-sized store(s) for each chunk.

	define void @big_nonzero_63_bytes(i8* nocapture %a) {			define void @big_nonzero_63_bytes(i8* nocapture %a) {
	; X32-LABEL: big_nonzero_63_bytes:			; X32-LABEL: big_nonzero_63_bytes:
	; X32: # BB#0:			; X32: # BB#0:
	; X32-NEXT: movl {{[0-9]+}}(%esp), %eax			; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X32-NEXT: movl $0, 4(%eax)			; X32-NEXT: vmovaps {{.*#+}} ymm0 = [1,0,2,0,3,0,4,0]
	; X32-NEXT: movl $1, (%eax)			; X32-NEXT: vmovups %ymm0, (%eax)
	; X32-NEXT: movl $0, 12(%eax)			; X32-NEXT: vmovaps {{.*#+}} xmm0 = [5,0,6,0]
	; X32-NEXT: movl $2, 8(%eax)			; X32-NEXT: vmovups %xmm0, 32(%eax)
	; X32-NEXT: movl $0, 20(%eax)
	; X32-NEXT: movl $3, 16(%eax)
	; X32-NEXT: movl $0, 28(%eax)
	; X32-NEXT: movl $4, 24(%eax)
	; X32-NEXT: movl $0, 36(%eax)
	; X32-NEXT: movl $5, 32(%eax)
	; X32-NEXT: movl $0, 44(%eax)
	; X32-NEXT: movl $6, 40(%eax)
	; X32-NEXT: movl $0, 52(%eax)			; X32-NEXT: movl $0, 52(%eax)
	; X32-NEXT: movl $7, 48(%eax)			; X32-NEXT: movl $7, 48(%eax)
	; X32-NEXT: movl $8, 56(%eax)			; X32-NEXT: movl $8, 56(%eax)
	; X32-NEXT: movw $9, 60(%eax)			; X32-NEXT: movw $9, 60(%eax)
	; X32-NEXT: movb $10, 62(%eax)			; X32-NEXT: movb $10, 62(%eax)
				; X32-NEXT: vzeroupper
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: big_nonzero_63_bytes:			; X64-LABEL: big_nonzero_63_bytes:
	; X64: # BB#0:			; X64: # BB#0:
	; X64-NEXT: movq $1, (%rdi)			; X64-NEXT: vmovaps {{.*#+}} ymm0 = [1,2,3,4]
	; X64-NEXT: movq $2, 8(%rdi)			; X64-NEXT: vmovups %ymm0, (%rdi)
	; X64-NEXT: movq $3, 16(%rdi)
	; X64-NEXT: movq $4, 24(%rdi)
	; X64-NEXT: movq $5, 32(%rdi)			; X64-NEXT: movq $5, 32(%rdi)
	; X64-NEXT: movq $6, 40(%rdi)			; X64-NEXT: movq $6, 40(%rdi)
	; X64-NEXT: movq $7, 48(%rdi)			; X64-NEXT: movq $7, 48(%rdi)
	; X64-NEXT: movl $8, 56(%rdi)			; X64-NEXT: movl $8, 56(%rdi)
	; X64-NEXT: movw $9, 60(%rdi)			; X64-NEXT: movw $9, 60(%rdi)
	; X64-NEXT: movb $10, 62(%rdi)			; X64-NEXT: movb $10, 62(%rdi)
				; X64-NEXT: vzeroupper
	; X64-NEXT: retq			; X64-NEXT: retq
	%a8 = bitcast i8* %a to i64*			%a8 = bitcast i8* %a to i64*
	%arrayidx8 = getelementptr inbounds i64, i64* %a8, i64 1			%arrayidx8 = getelementptr inbounds i64, i64* %a8, i64 1
	%arrayidx16 = getelementptr inbounds i64, i64* %a8, i64 2			%arrayidx16 = getelementptr inbounds i64, i64* %a8, i64 2
	%arrayidx24 = getelementptr inbounds i64, i64* %a8, i64 3			%arrayidx24 = getelementptr inbounds i64, i64* %a8, i64 3
	%arrayidx32 = getelementptr inbounds i64, i64* %a8, i64 4			%arrayidx32 = getelementptr inbounds i64, i64* %a8, i64 4
	%arrayidx40 = getelementptr inbounds i64, i64* %a8, i64 5			%arrayidx40 = getelementptr inbounds i64, i64* %a8, i64 5
	%arrayidx48 = getelementptr inbounds i64, i64* %a8, i64 6			%arrayidx48 = getelementptr inbounds i64, i64* %a8, i64 6
	Show All 19 Lines