This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
-
RISCVISelLowering.h
4/5
RISCVISelLowering.cpp
-
test/CodeGen/RISCV/
-
CodeGen/
-
RISCV/
-
memcpy-align.ll
-
memcpy-inline.ll

Differential D134168

[RISCV] Make preferred alignment of PointerArgs for MemIntrinsic
Needs RevisionPublic

Authored by JojoR on Sep 19 2022, 1:11 AM.

Download Raw Diff

Details

Reviewers

kito.cheng
craig.topper
arichardson

Summary

Set default preferred alignment for MemIntrinsic like memcpy according to arch32 or arch64,
it will improve performance.

e.g. dhrystone with "-O2" boosts performance by 50% on arch RV32.

Diff Detail

Event Timeline

JojoR created this revision.Sep 19 2022, 1:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2022, 1:11 AM

Herald added subscribers: sunshaoce, VincentWu, StephenFan and 29 others. · View Herald Transcript

JojoR requested review of this revision.Sep 19 2022, 1:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2022, 1:11 AM

Herald added subscribers: llvm-commits, • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

Please use like git diff -U999999 to make context available.(see https://llvm.org/docs/DeveloperPolicy.html#making-and-submitting-a-patch)

Harbormaster completed remote builds in B187452: Diff 461153.Sep 19 2022, 2:14 AM

Please upload patches using -U99999 as indicated here https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12987	`Align(Subtarget.getXLen() / 8)`

JojoR updated this revision to Diff 461443.Sep 19 2022, 6:47 PM

JojoR marked an inline comment as done.

Set default preferred alignment for MemIntrinsic like memcpy according to arch32 or arch64,
it will improve performance.

What are arch32 and arch64?

e.g. dhrystone with "-O2" boosts performance by 50% on arch RV32.

On what implementation? Does this affect actually-useful benchmarks, not just dhrystone? I would assume so, but it'd be more useful to get numbers for meaningful benchmarks rather than ones people should've long since abandoned.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12986	This seems arbitrary. I'd expect at least XLEN / 8 here?

In D134168#3801742, @jrtc27 wrote:

Set default preferred alignment for MemIntrinsic like memcpy according to arch32 or arch64,
it will improve performance.

What are arch32 and arch64?

e.g. dhrystone with "-O2" boosts performance by 50% on arch RV32.

On what implementation? Does this affect actually-useful benchmarks, not just dhrystone? I would assume so, but it'd be more useful to get numbers for meaningful benchmarks rather than ones people should've long since abandoned.

I think WORD_SIZE as preferred alignment is basic behavior to improve performance (use lw/sw instead of lb/sb), and dhrystone gets good performance benefit from this patch,
some other useful benchmarks are up slightly (< 5%), I do not list here so.

In D134168#3801753, @JojoR wrote:

In D134168#3801742, @jrtc27 wrote:

Set default preferred alignment for MemIntrinsic like memcpy according to arch32 or arch64,
it will improve performance.

What are arch32 and arch64?

e.g. dhrystone with "-O2" boosts performance by 50% on arch RV32.

On what implementation? Does this affect actually-useful benchmarks, not just dhrystone? I would assume so, but it'd be more useful to get numbers for meaningful benchmarks rather than ones people should've long since abandoned.

I think WORD_SIZE as preferred alignment is basic behavior to improve performance (use lw/sw instead of lb/sb), and dhrystone gets good performance benefit from this patch,
some other useful benchmarks are up slightly (< 5%), I do not list here so.

I'm not saying it's a bad idea, I'm just saying that the 50% isn't a very useful figure to quote as it's unlikely to be representative of real-world performance, so some less awful benchmarks would be more meaningful and helpful to quote.

The description doesn't clearly describe the effect of the patch. My understanding from reading the user of this function is that the alignment of allocas and global variables used by memcpy are increased in CodeGenPrepare. This results in less memory operations. In the case of 32-bit dhrystone, it looks like we have an explicit call to the memcpy library function. I guess by aligning the pointers we allow the source and dest to both be word_size aligned so we can use the full word copy loop?

In my local testing for riscv64 this didn't seem to affect dhrystone performance.

JojoR added inline comments.Sep 19 2022, 8:11 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12986	The caller has check this condition, set PrefAlign only the original is more smaller :)

jrtc27 added inline comments.Sep 19 2022, 8:15 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12986	I don't understand this comment at all. This is about MinSize not PrefAlign, and CGP just uses this value as a threshold. Why 8? Why not XLEN / 8, or XLEN / 4 if the 8 was chosen to optimise for RV32. This needs justification.

JojoR updated this revision to Diff 461458.Sep 19 2022, 8:23 PM

JojoR marked an inline comment as done.

In D134168#3801800, @craig.topper wrote:

The description doesn't clearly describe the effect of the patch. My understanding from reading the user of this function is that the alignment of allocas and global variables used by memcpy are increased in CodeGenPrepare. This results in less memory operations. In the case of 32-bit dhrystone, it looks like we have an explicit call to the memcpy library function. I guess by aligning the pointers we allow the source and dest to both be word_size aligned so we can use the full word copy loop?

In my local testing for riscv64 this didn't seem to affect dhrystone performance.

Yes, my test is on the RV32.

craig.topper added inline comments.Sep 19 2022, 8:40 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
12986	having the same value for rv32 and rv64 kind of makes sense from a certain perspective. If the object is smaller than minsize then you're saying you're ok with "object size in bytes" number of stores if the object is align 1. What you want to allow doesn't necessarily vary with xlen.

Harbormaster completed remote builds in B187674: Diff 461458.Sep 19 2022, 9:40 PM

Any other suggestions ?

@craig.topper @jrtc27

It looks like this hook is only used by codegenprepare, so the test should be in llvm/test/Transforms/CodeGenPrepare and check that the alignment of the IR variable was changed rather than indirectly testing it via assembly code. I'd also add a version with an alloca.

Additionally, I don't think this should be specific to RISC-V, I would assume that all architectures that don't have unaligned accesses (and even those with slower accesses) would benefit from aligning global string constants that are used in memcpy operations. I see that right now this hook is only used by ARM targets (D7908).

Can this be changed to call allowsMisalignedMemoryAccesses() and adjusting the alignment to the size of pointers?

This revision now requires changes to proceed.Sep 20 2022, 3:32 AM

arichardson mentioned this in D134282: [CGP] Add generic TargetLowering::shouldAlignPointerArgs() implementation.Sep 20 2022, 6:23 AM

In D134168#3802424, @arichardson wrote:

It looks like this hook is only used by codegenprepare, so the test should be in llvm/test/Transforms/CodeGenPrepare and check that the alignment of the IR variable was changed rather than indirectly testing it via assembly code. I'd also add a version with an alloca.

Additionally, I don't think this should be specific to RISC-V, I would assume that all architectures that don't have unaligned accesses (and even those with slower accesses) would benefit from aligning global string constants that are used in memcpy operations. I see that right now this hook is only used by ARM targets (D7908).

Can this be changed to call allowsMisalignedMemoryAccesses() and adjusting the alignment to the size of pointers?

I've made this generic in D134282 (depends on D134281).

arichardson mentioned this in rGbd87a2449da0: [CGP] Add generic TargetLowering::shouldAlignPointerArgs() implementation.Feb 9 2023, 2:14 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVISelLowering.h

2 lines

RISCVISelLowering.cpp

10 lines

test/

CodeGen/

RISCV/

memcpy-align.ll

22 lines

memcpy-inline.ll

61 lines

Diff 461458

llvm/lib/Target/RISCV/RISCVISelLowering.h

Show First 20 Lines • Show All 501 Lines • ▼ Show 20 Lines	public:

/// If a physical register, this returns the register that receives the		/// If a physical register, this returns the register that receives the
/// exception typeid on entry to a landing pad.		/// exception typeid on entry to a landing pad.
Register		Register
getExceptionSelectorRegister(const Constant *PersonalityFn) const override;		getExceptionSelectorRegister(const Constant *PersonalityFn) const override;

bool shouldExtendTypeInLibCall(EVT Type) const override;		bool shouldExtendTypeInLibCall(EVT Type) const override;
bool shouldSignExtendTypeInLibCall(EVT Type, bool IsSigned) const override;		bool shouldSignExtendTypeInLibCall(EVT Type, bool IsSigned) const override;
		bool shouldAlignPointerArgs(CallInst *CI, unsigned &MinSize,
		Align &PrefAlign) const override;

/// Returns the register with the specified architectural or ABI name. This		/// Returns the register with the specified architectural or ABI name. This
/// method is necessary to lower the llvm.read_register.* and		/// method is necessary to lower the llvm.read_register.* and
/// llvm.write_register.* intrinsics. Allocatable registers must be reserved		/// llvm.write_register.* intrinsics. Allocatable registers must be reserved
/// with the clang -ffixed-xX flag for access to be allowed.		/// with the clang -ffixed-xX flag for access to be allowed.
Register getRegisterByName(const char *RegName, LLT VT,		Register getRegisterByName(const char *RegName, LLT VT,
const MachineFunction &MF) const override;		const MachineFunction &MF) const override;

▲ Show 20 Lines • Show All 252 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 12,972 Lines • ▼ Show 20 Lines

	bool RISCVTargetLowering::shouldSignExtendTypeInLibCall(EVT Type, bool IsSigned) const {			bool RISCVTargetLowering::shouldSignExtendTypeInLibCall(EVT Type, bool IsSigned) const {
	if (Subtarget.is64Bit() && Type == MVT::i32)			if (Subtarget.is64Bit() && Type == MVT::i32)
	return true;			return true;

	return IsSigned;			return IsSigned;
	}			}

				bool RISCVTargetLowering::shouldAlignPointerArgs(CallInst *CI,
				unsigned &MinSize,
				Align &PrefAlign) const {
				if (!isa<MemIntrinsic>(CI))
				return false;
				MinSize = Subtarget.getXLen() / 8;
				jrtc27Unsubmitted Done Reply Inline Actions This seems arbitrary. I'd expect at least XLEN / 8 here? jrtc27: This seems arbitrary. I'd expect at least XLEN / 8 here?
				JojoRAuthorUnsubmitted Done Reply Inline Actions The caller has check this condition, set PrefAlign only the original is more smaller :) JojoR: The caller has check this condition, set PrefAlign only the original is more smaller :)
				jrtc27Unsubmitted Done Reply Inline Actions I don't understand this comment at all. This is about MinSize not PrefAlign, and CGP just uses this value as a threshold. Why 8? Why not XLEN / 8, or XLEN / 4 if the 8 was chosen to optimise for RV32. This needs justification. jrtc27: I don't understand this comment at all. This is about MinSize not PrefAlign, and CGP just uses…
				craig.topperUnsubmitted Not Done Reply Inline Actions having the same value for rv32 and rv64 kind of makes sense from a certain perspective. If the object is smaller than minsize then you're saying you're ok with "object size in bytes" number of stores if the object is align 1. What you want to allow doesn't necessarily vary with xlen. craig.topper: having the same value for rv32 and rv64 kind of makes sense from a certain perspective. If the…
				PrefAlign = Align(MinSize);
				craig.topperUnsubmitted Done Reply Inline Actions `Align(Subtarget.getXLen() / 8)` craig.topper: `Align(Subtarget.getXLen() / 8)`
				return true;
				}

	bool RISCVTargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,			bool RISCVTargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
	SDValue C) const {			SDValue C) const {
	// Check integral scalar types.			// Check integral scalar types.
	const bool HasExtMOrZmmul =			const bool HasExtMOrZmmul =
	Subtarget.hasStdExtM() \|\| Subtarget.hasStdExtZmmul();			Subtarget.hasStdExtM() \|\| Subtarget.hasStdExtZmmul();
	if (VT.isScalarInteger()) {			if (VT.isScalarInteger()) {
	// Omit the optimization if the sub target has the M extension and the data			// Omit the optimization if the sub target has the M extension and the data
	// size exceeds XLen.			// size exceeds XLen.
	▲ Show 20 Lines • Show All 260 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/memcpy-align.ll

This file was added.

				; RUN: llc -mtriple=riscv32 -verify-machineinstrs < %s \
				; RUN: \| FileCheck %s -check-prefix=RV32
				; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s \
				; RUN: \| FileCheck %s -check-prefix=RV64

				@.str = private unnamed_addr constant [31 x i8] c"DHRYSTONE PROGRAM, SOME STRING\00", align 1
				@dst = internal global [31 x i8] zeroinitializer, align 1

				define void @foo() {
				; RV32-LABEL: foo:
				; RV32: .p2align 2
				; RV32-NEXT: .L.str:

				; RV64-LABEL: foo:
				; RV64: .p2align 3
				; RV64-NEXT: .L.str:
				entry:
				tail call void @llvm.memcpy.p0i8.p0i8.i32(ptr noundef nonnull align 1 dereferenceable(31) @dst, ptr noundef nonnull align 1 dereferenceable(31) @.str, i32 31, i1 false)
				ret void
				}

				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i1) nounwind

llvm/test/CodeGen/RISCV/memcpy-inline.ll

	Show First 20 Lines • Show All 289 Lines • ▼ Show 20 Lines
	; RV64UNALIGNED-NEXT: sw a1, 0(a0)			; RV64UNALIGNED-NEXT: sw a1, 0(a0)
	; RV64UNALIGNED-NEXT: ret			; RV64UNALIGNED-NEXT: ret
	entry:			entry:
	tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([7 x i8], [7 x i8]* @.str5, i64 0, i64 0), i64 7, i1 false)			tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %C, i8* getelementptr inbounds ([7 x i8], [7 x i8]* @.str5, i64 0, i64 0), i64 7, i1 false)
	ret void			ret void
	}			}

	define void @t6() nounwind {			define void @t6() nounwind {
	; RV32ALIGNED-LABEL: t6:			; RV32-LABEL: t6:
	; RV32ALIGNED: # %bb.0: # %entry			; RV32: # %bb.0: # %entry
	; RV32ALIGNED-NEXT: addi sp, sp, -16			; RV32-NEXT: lui a0, %hi(spool.splbuf)
	; RV32ALIGNED-NEXT: sw ra, 12(sp) # 4-byte Folded Spill			; RV32-NEXT: li a1, 88
	; RV32ALIGNED-NEXT: lui a0, %hi(spool.splbuf)			; RV32-NEXT: sh a1, %lo(spool.splbuf+12)(a0)
	; RV32ALIGNED-NEXT: addi a0, a0, %lo(spool.splbuf)			; RV32-NEXT: lui a1, 361862
	; RV32ALIGNED-NEXT: lui a1, %hi(.L.str6)			; RV32-NEXT: addi a1, a1, -1960
	; RV32ALIGNED-NEXT: addi a1, a1, %lo(.L.str6)			; RV32-NEXT: sw a1, %lo(spool.splbuf+8)(a0)
	; RV32ALIGNED-NEXT: li a2, 14			; RV32-NEXT: lui a1, 362199
	; RV32ALIGNED-NEXT: call memcpy@plt			; RV32-NEXT: addi a1, a1, 559
	; RV32ALIGNED-NEXT: lw ra, 12(sp) # 4-byte Folded Reload			; RV32-NEXT: sw a1, %lo(spool.splbuf+4)(a0)
	; RV32ALIGNED-NEXT: addi sp, sp, 16			; RV32-NEXT: lui a1, 460503
	; RV32ALIGNED-NEXT: ret			; RV32-NEXT: addi a1, a1, 1071
				; RV32-NEXT: sw a1, %lo(spool.splbuf)(a0)
				; RV32-NEXT: ret
	;			;
	; RV64ALIGNED-LABEL: t6:			; RV64ALIGNED-LABEL: t6:
	; RV64ALIGNED: # %bb.0: # %entry			; RV64ALIGNED: # %bb.0: # %entry
	; RV64ALIGNED-NEXT: addi sp, sp, -16
	; RV64ALIGNED-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
	; RV64ALIGNED-NEXT: lui a0, %hi(spool.splbuf)			; RV64ALIGNED-NEXT: lui a0, %hi(spool.splbuf)
	; RV64ALIGNED-NEXT: addi a0, a0, %lo(spool.splbuf)			; RV64ALIGNED-NEXT: li a1, 88
	; RV64ALIGNED-NEXT: lui a1, %hi(.L.str6)			; RV64ALIGNED-NEXT: sh a1, %lo(spool.splbuf+12)(a0)
	; RV64ALIGNED-NEXT: addi a1, a1, %lo(.L.str6)			; RV64ALIGNED-NEXT: lui a1, %hi(.LCPI6_0)
	; RV64ALIGNED-NEXT: li a2, 14			; RV64ALIGNED-NEXT: ld a1, %lo(.LCPI6_0)(a1)
	; RV64ALIGNED-NEXT: call memcpy@plt			; RV64ALIGNED-NEXT: lui a2, 361862
	; RV64ALIGNED-NEXT: ld ra, 8(sp) # 8-byte Folded Reload			; RV64ALIGNED-NEXT: addiw a2, a2, -1960
	; RV64ALIGNED-NEXT: addi sp, sp, 16			; RV64ALIGNED-NEXT: sw a2, %lo(spool.splbuf+8)(a0)
				; RV64ALIGNED-NEXT: sd a1, %lo(spool.splbuf)(a0)
	; RV64ALIGNED-NEXT: ret			; RV64ALIGNED-NEXT: ret
	;			;
	; RV32UNALIGNED-LABEL: t6:
	; RV32UNALIGNED: # %bb.0: # %entry
	; RV32UNALIGNED-NEXT: lui a0, %hi(spool.splbuf)
	; RV32UNALIGNED-NEXT: li a1, 88
	; RV32UNALIGNED-NEXT: sh a1, %lo(spool.splbuf+12)(a0)
	; RV32UNALIGNED-NEXT: lui a1, 361862
	; RV32UNALIGNED-NEXT: addi a1, a1, -1960
	; RV32UNALIGNED-NEXT: sw a1, %lo(spool.splbuf+8)(a0)
	; RV32UNALIGNED-NEXT: lui a1, 362199
	; RV32UNALIGNED-NEXT: addi a1, a1, 559
	; RV32UNALIGNED-NEXT: sw a1, %lo(spool.splbuf+4)(a0)
	; RV32UNALIGNED-NEXT: lui a1, 460503
	; RV32UNALIGNED-NEXT: addi a1, a1, 1071
	; RV32UNALIGNED-NEXT: sw a1, %lo(spool.splbuf)(a0)
	; RV32UNALIGNED-NEXT: ret
	;
	; RV64UNALIGNED-LABEL: t6:			; RV64UNALIGNED-LABEL: t6:
	; RV64UNALIGNED: # %bb.0: # %entry			; RV64UNALIGNED: # %bb.0: # %entry
	; RV64UNALIGNED-NEXT: lui a0, %hi(.L.str6)			; RV64UNALIGNED-NEXT: lui a0, %hi(.L.str6)
	; RV64UNALIGNED-NEXT: ld a0, %lo(.L.str6)(a0)			; RV64UNALIGNED-NEXT: ld a0, %lo(.L.str6)(a0)
	; RV64UNALIGNED-NEXT: lui a1, %hi(spool.splbuf)			; RV64UNALIGNED-NEXT: lui a1, %hi(spool.splbuf)
	; RV64UNALIGNED-NEXT: li a2, 88			; RV64UNALIGNED-NEXT: li a2, 88
	; RV64UNALIGNED-NEXT: sh a2, %lo(spool.splbuf+12)(a1)			; RV64UNALIGNED-NEXT: sh a2, %lo(spool.splbuf+12)(a1)
	; RV64UNALIGNED-NEXT: sd a0, %lo(spool.splbuf)(a1)			; RV64UNALIGNED-NEXT: sd a0, %lo(spool.splbuf)(a1)
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines