This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
2/5
PPCISelLowering.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
bswap-load-store.ll
1/3
ld-bswap64-no-ldbrx.ll

Differential D104836

[PowerPC] Combine 64-bit bswap(load) without LDBRX
ClosedPublic

Authored by nemanjai on Jun 23 2021, 8:43 PM.

Download Raw Diff

Details

Reviewers

spatel
nathanchance

Group Reviewers

Restricted Project

Commits

rG0464586ac515: [PowerPC] Combine 64-bit bswap(load) without LDBRX

Summary

When targeting CPUs that don't have LDBRX, we end up producing code that is very inefficient and large for this common idiom. This patch just optimizes it two 32-bit LWBRX instructions along with a merge.

This fixes https://bugs.llvm.org/show_bug.cgi?id=49610

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

nemanjai created this revision.Jun 23 2021, 8:43 PM

Herald added subscribers: shchenz, kbarton, hiraditya. · View Herald TranscriptJun 23 2021, 8:43 PM

nemanjai requested review of this revision.Jun 23 2021, 8:43 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2021, 8:43 PM

Harbormaster completed remote builds in B110757: Diff 354145.Jun 23 2021, 9:18 PM

spatel added inline comments.Jun 24 2021, 5:31 AM

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15245	If we honor the clang-tidy (no else after return), that makes it more obvious that we are repeating conditions in both clauses. It would be cleaner to hoist those into a common 'if' clause, and probably better still to make it an early exit for the inverted clause (possibly add a helper function too): if (!Is64BitBswapOn64BitTarget \|\| !IsSingleUseNonExtLd) return SDValue();
15246–15247	Do we need to check for 'volatile' or other modifiers? Are we ok with misaligned accesses? Should add regression tests either way.
llvm/test/CodeGen/PowerPC/ld-bswap64-no-ldbrx.ll
4	It would be better to include tests for the minimal patterns - just a load or just a store. We are not handling the store case yet, so that might be a "TODO" item (or doesn't matter if we don't need to worry about the pre-stdbrx targets): define void @split_store(i64 %x, i64* %p) { %b = call i64 @llvm.bswap.i64(i64 %x) store i64 %b, i64* %p, align 8 ret void } define i64 @split_load(i64* %p) { %x = load i64, i64* %p, align 8 %b = call i64 @llvm.bswap.i64(i64 %x) ret i64 %b }

nemanjai added inline comments.Jun 24 2021, 6:07 AM

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15245	Oh, I probably should have noticed that the if clause has a return. I'll update it. Thanks!
15246–15247	There are no alignment requirements for the byte-reversed loads/stores. I'll add a test.
llvm/test/CodeGen/PowerPC/ld-bswap64-no-ldbrx.ll
4	Yes, the store combine is definitely a TODO :) I'll update the tests, thank you.

Fix up clang-tidy warnings and add some more regression tests.

Harbormaster completed remote builds in B110818: Diff 354237.Jun 24 2021, 7:12 AM

LangRef says "the backend should never split or merge target-legal volatile load/store instructions":
https://llvm.org/docs/LangRef.html#volatile-memory-accesses

I haven't looked at the use cases in detail, but the target does support 64-bit loads via plain ld, so we shouldn't do the transform?

In D104836#2838555, @spatel wrote:

LangRef says "the backend should never split or merge target-legal volatile load/store instructions":
https://llvm.org/docs/LangRef.html#volatile-memory-accesses

I haven't looked at the use cases in detail, but the target does support 64-bit loads via plain ld, so we shouldn't do the transform?

Hmm... clearly this requirement cannot possibly be met for volatile operations that are wider than the available load for the target. Of course, that is not the case here. So I suppose it is possible for a volatile load to load the two halves of the value before and after a store to the same memory from another thread.
I'll add the volatile check and bail on the combine. Thanks for bringing this up.

Do not combine volatile loads.

See inline for a couple of small adjustments, otherwise LGTM.
I'm not up-to-speed on current PPC targets though, so might want to wait for a 2nd opinion.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15254	I forgot to mention atomics (add one more test...). I think we want to use "LD->isSimple()" to be safe.
llvm/test/CodeGen/PowerPC/ld-bswap64-no-ldbrx.ll
20	IIUC, we are ok doing the transform with any alignment, so this test isn't capturing that. Please add a misaligned-only test or adjust the first test to have "align 1" so it can verify that we perform the fold independent of alignment.

This revision is now accepted and ready to land.Jun 24 2021, 11:20 AM

I can confirm that this reduces the stack size of the function in the Linux kernel that prompted the report:

Before:

arch/powerpc/kvm/book3s_hv_nested.c:289:6: warning: stack frame size (2048) exceeds limit (768) in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
     ^
1 warning generated.

After:

arch/powerpc/kvm/book3s_hv_nested.c:289:6: warning: stack frame size (1856) exceeds limit (768) in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than]
long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
     ^
1 warning generated.

nickdesaulniers removed a reviewer: nickdesaulniers.Jun 24 2021, 11:48 AM

nickdesaulniers added a subscriber: nickdesaulniers.

Harbormaster completed remote builds in B110864: Diff 354304.Jun 24 2021, 11:51 AM

Closed by commit rG0464586ac515: [PowerPC] Combine 64-bit bswap(load) without LDBRX (authored by nemanjai). · Explain WhyJun 24 2021, 1:12 PM

This revision was automatically updated to reflect the committed changes.

nemanjai added a commit: rG0464586ac515: [PowerPC] Combine 64-bit bswap(load) without LDBRX.

Herald added a subscriber: jfb. · View Herald TranscriptJun 24 2021, 1:12 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

PowerPC/

PPCISelLowering.cpp

41 lines

test/

CodeGen/

PowerPC/

bswap-load-store.ll

20 lines

ld-bswap64-no-ldbrx.ll

54 lines

Diff 354340

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 15,196 Lines • ▼ Show 20 Lines	if (Subtarget.needsSwapsForVSXMemOps()) {
default:		default:
break;		break;
case Intrinsic::ppc_vsx_stxvw4x:		case Intrinsic::ppc_vsx_stxvw4x:
case Intrinsic::ppc_vsx_stxvd2x:		case Intrinsic::ppc_vsx_stxvd2x:
return expandVSXStoreForLE(N, DCI);		return expandVSXStoreForLE(N, DCI);
}		}
}		}
break;		break;
case ISD::BSWAP:		case ISD::BSWAP: {
// Turn BSWAP (LOAD) -> lhbrx/lwbrx.		// Turn BSWAP (LOAD) -> lhbrx/lwbrx.
if (ISD::isNON_EXTLoad(N->getOperand(0).getNode()) &&		// For subtargets without LDBRX, we can still do better than the default
N->getOperand(0).hasOneUse() &&		// expansion even for 64-bit BSWAP (LOAD).
		bool Is64BitBswapOn64BitTgt =
		Subtarget.isPPC64() && N->getValueType(0) == MVT::i64;
		bool IsSingleUseNormalLd = ISD::isNormalLoad(N->getOperand(0).getNode()) &&
		N->getOperand(0).hasOneUse();
		if (IsSingleUseNormalLd &&
(N->getValueType(0) == MVT::i32 \|\| N->getValueType(0) == MVT::i16 \|\|		(N->getValueType(0) == MVT::i32 \|\| N->getValueType(0) == MVT::i16 \|\|
(Subtarget.hasLDBRX() && Subtarget.isPPC64() &&		(Subtarget.hasLDBRX() && Is64BitBswapOn64BitTgt))) {
N->getValueType(0) == MVT::i64))) {
SDValue Load = N->getOperand(0);		SDValue Load = N->getOperand(0);
LoadSDNode *LD = cast<LoadSDNode>(Load);		LoadSDNode *LD = cast<LoadSDNode>(Load);
// Create the byte-swapping load.		// Create the byte-swapping load.
SDValue Ops[] = {		SDValue Ops[] = {
LD->getChain(), // Chain		LD->getChain(), // Chain
LD->getBasePtr(), // Ptr		LD->getBasePtr(), // Ptr
DAG.getValueType(N->getValueType(0)) // VT		DAG.getValueType(N->getValueType(0)) // VT
};		};
Show All 13 Lines	if (IsSingleUseNormalLd &&
DCI.CombineTo(N, ResVal);		DCI.CombineTo(N, ResVal);

// Next, combine the load away, we give it a bogus result value but a real		// Next, combine the load away, we give it a bogus result value but a real
// chain result. The result value is dead because the bswap is dead.		// chain result. The result value is dead because the bswap is dead.
DCI.CombineTo(Load.getNode(), ResVal, BSLoad.getValue(1));		DCI.CombineTo(Load.getNode(), ResVal, BSLoad.getValue(1));

// Return N so it doesn't get rechecked!		// Return N so it doesn't get rechecked!
return SDValue(N, 0);		return SDValue(N, 0);
}		}
		spatelUnsubmitted Not Done Reply Inline Actions If we honor the clang-tidy (no else after return), that makes it more obvious that we are repeating conditions in both clauses. It would be cleaner to hoist those into a common 'if' clause, and probably better still to make it an early exit for the inverted clause (possibly add a helper function too): if (!Is64BitBswapOn64BitTarget \|\| !IsSingleUseNonExtLd) return SDValue(); spatel: If we honor the clang-tidy (no else after return), that makes it more obvious that we are…
		nemanjaiAuthorUnsubmitted Done Reply Inline Actions Oh, I probably should have noticed that the if clause has a return. I'll update it. Thanks! nemanjai: Oh, I probably should have noticed that the if clause has a return. I'll update it. Thanks!
break;		// Convert this to two 32-bit bswap loads and a BUILD_PAIR. Do this only
		// before legalization so that the BUILD_PAIR is handled correctly.
		spatelUnsubmitted Not Done Reply Inline Actions Do we need to check for 'volatile' or other modifiers? Are we ok with misaligned accesses? Should add regression tests either way. spatel: Do we need to check for 'volatile' or other modifiers? Are we ok with misaligned accesses?
		nemanjaiAuthorUnsubmitted Done Reply Inline Actions There are no alignment requirements for the byte-reversed loads/stores. I'll add a test. nemanjai: There are no alignment requirements for the byte-reversed loads/stores. I'll add a test.
		if (!DCI.isBeforeLegalize() \|\| !Is64BitBswapOn64BitTgt \|\|
		!IsSingleUseNormalLd)
		return SDValue();
		LoadSDNode *LD = cast<LoadSDNode>(N->getOperand(0));

		// Can't split volatile or atomic loads.
		if (!LD->isSimple())
		spatelUnsubmitted Not Done Reply Inline Actions I forgot to mention atomics (add one more test...). I think we want to use "LD->isSimple()" to be safe. spatel: I forgot to mention atomics (add one more test...). I think we want to use "LD->isSimple()" to…
		return SDValue();
		SDValue BasePtr = LD->getBasePtr();
		SDValue Lo = DAG.getLoad(MVT::i32, dl, LD->getChain(), BasePtr,
		LD->getPointerInfo(), LD->getAlignment());
		Lo = DAG.getNode(ISD::BSWAP, dl, MVT::i32, Lo);
		BasePtr = DAG.getNode(ISD::ADD, dl, BasePtr.getValueType(), BasePtr,
		DAG.getIntPtrConstant(4, dl));
		SDValue Hi = DAG.getLoad(MVT::i32, dl, LD->getChain(), BasePtr,
		LD->getPointerInfo(), LD->getAlignment());
		Hi = DAG.getNode(ISD::BSWAP, dl, MVT::i32, Hi);
		SDValue Res = DAG.getNode(ISD::BUILD_PAIR, dl, MVT::i64, Hi, Lo);
		SDValue TF =
		DAG.getNode(ISD::TokenFactor, dl, MVT::Other,
		Hi.getOperand(0).getValue(1), Lo.getOperand(0).getValue(1));
		DAG.ReplaceAllUsesOfValueWith(SDValue(LD, 1), TF);
		return Res;
		}
case PPCISD::VCMP:		case PPCISD::VCMP:
// If a VCMP_rec node already exists with exactly the same operands as this		// If a VCMP_rec node already exists with exactly the same operands as this
// node, use its result instead of this node (VCMP_rec computes both a CR6		// node, use its result instead of this node (VCMP_rec computes both a CR6
// and a normal output).		// and a normal output).
//		//
if (!N->getOperand(0).hasOneUse() &&		if (!N->getOperand(0).hasOneUse() &&
!N->getOperand(1).hasOneUse() &&		!N->getOperand(1).hasOneUse() &&
!N->getOperand(2).hasOneUse()) {		!N->getOperand(2).hasOneUse()) {
▲ Show 20 Lines • Show All 2,154 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/bswap-load-store.ll

	Show First 20 Lines • Show All 95 Lines • ▼ Show 20 Lines
	; PWR7_64-NEXT: blr			; PWR7_64-NEXT: blr
	%tmp1 = getelementptr i8, i8* %ptr, i32 %off			%tmp1 = getelementptr i8, i8* %ptr, i32 %off
	%tmp1.upgrd.4 = bitcast i8* %tmp1 to i16*			%tmp1.upgrd.4 = bitcast i8* %tmp1 to i16*
	%tmp = load i16, i16* %tmp1.upgrd.4			%tmp = load i16, i16* %tmp1.upgrd.4
	%tmp6 = call i16 @llvm.bswap.i16( i16 %tmp )			%tmp6 = call i16 @llvm.bswap.i16( i16 %tmp )
	ret i16 %tmp6			ret i16 %tmp6
	}			}

				; TODO: combine the bswap feeding a store on subtargets
				; that do not have an STDBRX.
	define void @STDBRX(i64 %i, i8* %ptr, i64 %off) {			define void @STDBRX(i64 %i, i8* %ptr, i64 %off) {
	; PWR7_32-LABEL: STDBRX:			; PWR7_32-LABEL: STDBRX:
	; PWR7_32: # %bb.0:			; PWR7_32: # %bb.0:
	; PWR7_32-NEXT: li r6, 4			; PWR7_32-NEXT: li r6, 4
	; PWR7_32-NEXT: add r7, r5, r8			; PWR7_32-NEXT: add r7, r5, r8
	; PWR7_32-NEXT: stwbrx r4, r5, r8			; PWR7_32-NEXT: stwbrx r4, r5, r8
	; PWR7_32-NEXT: stwbrx r3, r7, r6			; PWR7_32-NEXT: stwbrx r3, r7, r6
	; PWR7_32-NEXT: blr			; PWR7_32-NEXT: blr
	Show All 32 Lines
	; PWR7_32-NEXT: li r5, 4			; PWR7_32-NEXT: li r5, 4
	; PWR7_32-NEXT: add r7, r3, r6			; PWR7_32-NEXT: add r7, r3, r6
	; PWR7_32-NEXT: lwbrx r4, r3, r6			; PWR7_32-NEXT: lwbrx r4, r3, r6
	; PWR7_32-NEXT: lwbrx r3, r7, r5			; PWR7_32-NEXT: lwbrx r3, r7, r5
	; PWR7_32-NEXT: blr			; PWR7_32-NEXT: blr
	;			;
	; X64-LABEL: LDBRX:			; X64-LABEL: LDBRX:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: ldx r4, r3, r4			; X64-NEXT: li r5, 4
	; X64-NEXT: rotldi r5, r4, 16			; X64-NEXT: lwbrx r6, r3, r4
	; X64-NEXT: rotldi r3, r4, 8			; X64-NEXT: add r3, r3, r4
	; X64-NEXT: rldimi r3, r5, 8, 48			; X64-NEXT: lwbrx r3, r3, r5
	; X64-NEXT: rotldi r5, r4, 24			; X64-NEXT: rldimi r3, r6, 32, 0
	; X64-NEXT: rldimi r3, r5, 16, 40
	; X64-NEXT: rotldi r5, r4, 32
	; X64-NEXT: rldimi r3, r5, 24, 32
	; X64-NEXT: rotldi r5, r4, 48
	; X64-NEXT: rldimi r3, r5, 40, 16
	; X64-NEXT: rotldi r5, r4, 56
	; X64-NEXT: rldimi r3, r5, 48, 8
	; X64-NEXT: rldimi r3, r4, 56, 0
	; X64-NEXT: blr			; X64-NEXT: blr
	;			;
	; PWR7_64-LABEL: LDBRX:			; PWR7_64-LABEL: LDBRX:
	; PWR7_64: # %bb.0:			; PWR7_64: # %bb.0:
	; PWR7_64-NEXT: ldbrx r3, r3, r4			; PWR7_64-NEXT: ldbrx r3, r3, r4
	; PWR7_64-NEXT: blr			; PWR7_64-NEXT: blr
	%tmp1 = getelementptr i8, i8* %ptr, i64 %off			%tmp1 = getelementptr i8, i8* %ptr, i64 %off
	%tmp1.upgrd.2 = bitcast i8* %tmp1 to i64*			%tmp1.upgrd.2 = bitcast i8* %tmp1 to i64*
	%tmp = load i64, i64* %tmp1.upgrd.2			%tmp = load i64, i64* %tmp1.upgrd.2
	%tmp14 = tail call i64 @llvm.bswap.i64( i64 %tmp )			%tmp14 = tail call i64 @llvm.bswap.i64( i64 %tmp )
	ret i64 %tmp14			ret i64 %tmp14
	}			}

	declare i16 @llvm.bswap.i16(i16)			declare i16 @llvm.bswap.i16(i16)
	declare i32 @llvm.bswap.i32(i32)			declare i32 @llvm.bswap.i32(i32)
	declare i64 @llvm.bswap.i64(i64)			declare i64 @llvm.bswap.i64(i64)

llvm/test/CodeGen/PowerPC/ld-bswap64-no-ldbrx.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=powerpc64-- -mcpu=pwr5 -verify-machineinstrs < %s \| \
				; RUN: FileCheck %s
				define void @bs(i64* %p) {
				spatelUnsubmitted Not Done Reply Inline Actions It would be better to include tests for the minimal patterns - just a load or just a store. We are not handling the store case yet, so that might be a "TODO" item (or doesn't matter if we don't need to worry about the pre-stdbrx targets): define void @split_store(i64 %x, i64* %p) { %b = call i64 @llvm.bswap.i64(i64 %x) store i64 %b, i64* %p, align 8 ret void } define i64 @split_load(i64* %p) { %x = load i64, i64* %p, align 8 %b = call i64 @llvm.bswap.i64(i64 %x) ret i64 %b } spatel: It would be better to include tests for the minimal patterns - just a load or just a store. We…
				nemanjaiAuthorUnsubmitted Done Reply Inline Actions Yes, the store combine is definitely a TODO :) I'll update the tests, thank you. nemanjai: Yes, the store combine is definitely a TODO :) I'll update the tests, thank you.
				; CHECK-LABEL: bs:
				; CHECK: # %bb.0:
				; CHECK-NEXT: li 4, 4
				; CHECK-NEXT: lwbrx 5, 0, 3
				; CHECK-NEXT: lwbrx 4, 3, 4
				; CHECK-NEXT: rldimi 4, 5, 32, 0
				; CHECK-NEXT: std 4, 0(3)
				; CHECK-NEXT: blr
				%x = load i64, i64* %p, align 8
				%b = call i64 @llvm.bswap.i64(i64 %x)
				store i64 %b, i64* %p, align 8
				ret void
				}

				define i64 @volatile_ld(i64* %p) {
				; CHECK-LABEL: volatile_ld:
				spatelUnsubmitted Not Done Reply Inline Actions IIUC, we are ok doing the transform with any alignment, so this test isn't capturing that. Please add a misaligned-only test or adjust the first test to have "align 1" so it can verify that we perform the fold independent of alignment. spatel: IIUC, we are ok doing the transform with any alignment, so this test isn't capturing that.
				; CHECK: # %bb.0:
				; CHECK-NEXT: ld 4, 0(3)
				; CHECK-NEXT: rotldi 5, 4, 16
				; CHECK-NEXT: rotldi 3, 4, 8
				; CHECK-NEXT: rldimi 3, 5, 8, 48
				; CHECK-NEXT: rotldi 5, 4, 24
				; CHECK-NEXT: rldimi 3, 5, 16, 40
				; CHECK-NEXT: rotldi 5, 4, 32
				; CHECK-NEXT: rldimi 3, 5, 24, 32
				; CHECK-NEXT: rotldi 5, 4, 48
				; CHECK-NEXT: rldimi 3, 5, 40, 16
				; CHECK-NEXT: rotldi 5, 4, 56
				; CHECK-NEXT: rldimi 3, 5, 48, 8
				; CHECK-NEXT: rldimi 3, 4, 56, 0
				; CHECK-NEXT: blr
				%x = load volatile i64, i64* %p, align 8
				%b = call i64 @llvm.bswap.i64(i64 %x)
				ret i64 %b
				}

				define i64 @misaligned_ld(i64* %p) {
				; CHECK-LABEL: misaligned_ld:
				; CHECK: # %bb.0:
				; CHECK-NEXT: li 4, 4
				; CHECK-NEXT: lwbrx 5, 0, 3
				; CHECK-NEXT: lwbrx 3, 3, 4
				; CHECK-NEXT: rldimi 3, 5, 32, 0
				; CHECK-NEXT: blr
				%x = load i64, i64* %p, align 1
				%b = call i64 @llvm.bswap.i64(i64 %x)
				ret i64 %b
				}

				declare i64 @llvm.bswap.i64(i64) #2