This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Reduce alignment of vector constant pool entries
AbandonedPublic

Authored by reames on Jul 18 2023, 9:07 AM.

Download Raw Diff

Details

Reviewers

craig.topper
asb
frasercrmck
kito-cheng
luke

Summary

For RVV, we are very reliant on constant pools for fixed length constants. The default lowering for constant pool aligns the entry to the ABI alignment. For a vector, this is usually the size of the type in question. As this isn't actually exposed in the ABI (right?!), the resulting alignment creates a bunch of extra padding with no value.

This change reduces the alignment used to be the vector element alignment. This closely matches the reasoning in the allowsMisalignedMemoryAccesses routine (and we assert they're in sync.) Note that our instruction choice doesn't change; only the alignment of the constant pool entry.

Performance effects here may be a bit complicated, but I think (hope?) it should be generally positive. Potential downsides include:

Placing data immediately after end of the previous function. This may confuse instruction decode which is fetching in chunks, and tries to decode the data as instructions.
Changing the working set of the following function. By removing alignment, we may either decrease or increase the size of this set. Note that we actually have two working sets to consider - d-cache and i-cache. Each can change independently.

Note that the downsides above already apply to non-vector data (since they are naturally less aligned). If we have a processor which has problems with the above items, we should probably be trying to mitigate the general issues as opposed to getting lucky due to vector constant pools. :)

Note: I'd originally tried to do something here which was more target independent, but I found that a) reducing alignment caused massive test diffs, and b) exposed what appeared to be a number of missing folds on x86. Thus the target specific hook approach taken here.

Diff Detail

Event Timeline

reames created this revision.Jul 18 2023, 9:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 9:07 AM

Herald added subscribers: jobnoorman, VincentWu, vkmr and 27 others. · View Herald Transcript

reames requested review of this revision.Jul 18 2023, 9:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 9:07 AM

Herald added subscribers: wangpc, eopXD, MaskRay. · View Herald Transcript

reames added a parent revision: D155612: [RISCV] Add test which shows alignment of constant pools and the functions which followed.Jul 18 2023, 9:08 AM

Harbormaster completed remote builds in B246269: Diff 541586.Jul 18 2023, 4:47 PM

That remind me this patch: D112531, and back to psABI part, yes, we don't have too much word on fixed length vector, one point I am hesitate here is P-extension (but it never ratified and no clear timeline.) use GPR to hold fixed-length vector, then that will require align to XLEN.

Note: I'd originally tried to do something here which was more target independent, but I found that a) reducing alignment caused massive test diffs, and b) exposed what appeared to be a number of missing folds on x86. Thus the target specific hook approach taken here.

SSE arithmetic ops can only fold aligned loads. Is that what you were seeing?

efriedma added a subscriber: efriedma.Jul 19 2023, 1:35 AM

efriedma added inline comments.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-constant-pool-alignment.ll
5	Messing with the alignment of constants in .rodata.cst16 is going to have nearly zero effect; the data is getting emitted in 16-byte chunks, so there's no padding anyway. (Not sure if this is what you meant to test.)

Abandoning as the benefit here appears to be marginal at best due to a misunderstanding pointed out by @efriedma's comment. (My inline response expands a bit on his point.)

In D155618#4513630, @kito-cheng wrote:

That remind me this patch: D112531, and back to psABI part, yes, we don't have too much word on fixed length vector, one point I am hesitate here is P-extension (but it never ratified and no clear timeline.) use GPR to hold fixed-length vector, then that will require align to XLEN.

P can be safely ignored at this point; we can worry about it when we add toolchain support. If ever.

In D155618#4513655, @craig.topper wrote:

Note: I'd originally tried to do something here which was more target independent, but I found that a) reducing alignment caused massive test diffs, and b) exposed what appeared to be a number of missing folds on x86. Thus the target specific hook approach taken here.

SSE arithmetic ops can only fold aligned loads. Is that what you were seeing?

Yep, that looks like what I was seeing.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-constant-pool-alignment.ll
5	I went and dug through this carefully. As you point out, the data fragments do end up being concatenated together into the .rodata section, and the alignment of the entries within the section doesn't end up mattering much. We group 16b chunks into .rodata.cst16, and 32b chunks into .rodata.cst16. These are then combined to form the final .rodata section, and thus the internal alignment doesn't create any additional padding within that section. This change does end up reducing the alignment requirement of the .rodata section as a whole (from 32 to 8 in this example), but that's a pretty marginal gain. When I wrote the patch, I was thinking the data fragments were in the .text section between the functions. This is not true.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

8 lines

lib/

CodeGen/

SelectionDAG/

SelectionDAG.cpp

5 lines

Target/

RISCV/

RISCVISelLowering.h

4 lines

RISCVISelLowering.cpp

26 lines

test/

CodeGen/

RISCV/

rvv/

fixed-vectors-constant-pool-alignment.ll

6 lines

Diff 541586

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 1,882 Lines • ▼ Show 20 Lines	#include "llvm/IR/ConstrainedOps.def"
}		}

/// Return the minimum function alignment.		/// Return the minimum function alignment.
Align getMinFunctionAlignment() const { return MinFunctionAlignment; }		Align getMinFunctionAlignment() const { return MinFunctionAlignment; }

/// Return the preferred function alignment.		/// Return the preferred function alignment.
Align getPrefFunctionAlignment() const { return PrefFunctionAlignment; }		Align getPrefFunctionAlignment() const { return PrefFunctionAlignment; }

		/// Return the preferred alignment for a constant pool entry for C.
		virtual Align getPrefConstantPoolAlignment(const Constant *C,
		bool ShouldOptForSize,
		const DataLayout &DL) const {
		return ShouldOptForSize ? DL.getABITypeAlign(C->getType()) :
		DL.getPrefTypeAlign(C->getType());
		}

/// Return the preferred loop alignment.		/// Return the preferred loop alignment.
virtual Align getPrefLoopAlignment(MachineLoop *ML = nullptr) const;		virtual Align getPrefLoopAlignment(MachineLoop *ML = nullptr) const;

/// Return the maximum amount of bytes allowed to be emitted when padding for		/// Return the maximum amount of bytes allowed to be emitted when padding for
/// alignment		/// alignment
virtual unsigned		virtual unsigned
getMaxPermittedBytesForAlignment(MachineBasicBlock *MBB) const;		getMaxPermittedBytesForAlignment(MachineBasicBlock *MBB) const;

▲ Show 20 Lines • Show All 3,447 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 1,801 Lines • ▼ Show 20 Lines
	}			}

	SDValue SelectionDAG::getConstantPool(const Constant *C, EVT VT,			SDValue SelectionDAG::getConstantPool(const Constant *C, EVT VT,
	MaybeAlign Alignment, int Offset,			MaybeAlign Alignment, int Offset,
	bool isTarget, unsigned TargetFlags) {			bool isTarget, unsigned TargetFlags) {
	assert((TargetFlags == 0 \|\| isTarget) &&			assert((TargetFlags == 0 \|\| isTarget) &&
	"Cannot set target flags on target-independent globals");			"Cannot set target flags on target-independent globals");
	if (!Alignment)			if (!Alignment)
	Alignment = shouldOptForSize()			Alignment = TLI->getPrefConstantPoolAlignment(C, shouldOptForSize(),
	? getDataLayout().getABITypeAlign(C->getType())			getDataLayout());
	: getDataLayout().getPrefTypeAlign(C->getType());
	unsigned Opc = isTarget ? ISD::TargetConstantPool : ISD::ConstantPool;			unsigned Opc = isTarget ? ISD::TargetConstantPool : ISD::ConstantPool;
	FoldingSetNodeID ID;			FoldingSetNodeID ID;
	AddNodeIDNode(ID, Opc, getVTList(VT), std::nullopt);			AddNodeIDNode(ID, Opc, getVTList(VT), std::nullopt);
	ID.AddInteger(Alignment->value());			ID.AddInteger(Alignment->value());
	ID.AddInteger(Offset);			ID.AddInteger(Offset);
	ID.AddPointer(C);			ID.AddPointer(C);
	ID.AddInteger(TargetFlags);			ID.AddInteger(TargetFlags);
	void *IP = nullptr;			void *IP = nullptr;
	▲ Show 20 Lines • Show All 10,885 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVISelLowering.h

Show First 20 Lines • Show All 667 Lines • ▼ Show 20 Lines	public:

/// Returns true if the target allows unaligned memory accesses of the		/// Returns true if the target allows unaligned memory accesses of the
/// specified type.		/// specified type.
bool allowsMisalignedMemoryAccesses(		bool allowsMisalignedMemoryAccesses(
EVT VT, unsigned AddrSpace = 0, Align Alignment = Align(1),		EVT VT, unsigned AddrSpace = 0, Align Alignment = Align(1),
MachineMemOperand::Flags Flags = MachineMemOperand::MONone,		MachineMemOperand::Flags Flags = MachineMemOperand::MONone,
unsigned *Fast = nullptr) const override;		unsigned *Fast = nullptr) const override;

		/// Return the preferred alignment for a constant pool entry for C.
		Align getPrefConstantPoolAlignment(const Constant *C, bool ShouldOptForSize,
		const DataLayout &DL) const override;

bool splitValueIntoRegisterParts(		bool splitValueIntoRegisterParts(
SelectionDAG & DAG, const SDLoc &DL, SDValue Val, SDValue *Parts,		SelectionDAG & DAG, const SDLoc &DL, SDValue Val, SDValue *Parts,
unsigned NumParts, MVT PartVT, std::optional<CallingConv::ID> CC)		unsigned NumParts, MVT PartVT, std::optional<CallingConv::ID> CC)
const override;		const override;

SDValue joinRegisterPartsIntoValue(		SDValue joinRegisterPartsIntoValue(
SelectionDAG & DAG, const SDLoc &DL, const SDValue *Parts,		SelectionDAG & DAG, const SDLoc &DL, const SDValue *Parts,
unsigned NumParts, MVT PartVT, EVT ValueVT,		unsigned NumParts, MVT PartVT, EVT ValueVT,
▲ Show 20 Lines • Show All 277 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,580 Lines • ▼ Show 20 Lines	bool RISCVTargetLowering::allowsMisalignedMemoryAccesses(
// e8 element type access. Given this, we effectively support all unmasked		// e8 element type access. Given this, we effectively support all unmasked
// misaligned accesses. TODO: Work through the codegen implications of		// misaligned accesses. TODO: Work through the codegen implications of
// allowing such accesses to be formed, and considered fast.		// allowing such accesses to be formed, and considered fast.
if (Fast)		if (Fast)
*Fast = 0;		*Fast = 0;
return Subtarget.enableUnalignedVectorMem();		return Subtarget.enableUnalignedVectorMem();
}		}

		Align RISCVTargetLowering::getPrefConstantPoolAlignment(const Constant *C,
		bool ShouldOptForSize,
		const DataLayout &DL) const {
		// Generally, our goal here is to use the smallest alignment which will be
		// performant. Note that this alignment is not ABI visible, and thus we
		// are safe to use something smaller than ABI alignment here.

		// For vectors, we can always use the element type's alignment.
		if (isa<VectorType>(C->getType())) {
		EVT VT = getValueType(DL, C->getType());
		EVT ElemVT = VT.getVectorElementType();
		Align ElemAlign(ElemVT.getStoreSize());
		#ifndef NDEBUG
		unsigned Fast = 0;
		assert(allowsMemoryAccess(C->getContext(), DL, VT, 0, ElemAlign,
		MachineMemOperand::MONone, &Fast) && Fast);
		#endif
		return ElemAlign;
		}

		// Note: i64 is the only case which hits this in practice, and thus
		// there's no profit to try to reduce the alignment.
		return ShouldOptForSize ? DL.getABITypeAlign(C->getType()) :
		DL.getPrefTypeAlign(C->getType());
		}

bool RISCVTargetLowering::splitValueIntoRegisterParts(		bool RISCVTargetLowering::splitValueIntoRegisterParts(
SelectionDAG &DAG, const SDLoc &DL, SDValue Val, SDValue *Parts,		SelectionDAG &DAG, const SDLoc &DL, SDValue Val, SDValue *Parts,
unsigned NumParts, MVT PartVT, std::optional<CallingConv::ID> CC) const {		unsigned NumParts, MVT PartVT, std::optional<CallingConv::ID> CC) const {
bool IsABIRegCopy = CC.has_value();		bool IsABIRegCopy = CC.has_value();
EVT ValueVT = Val.getValueType();		EVT ValueVT = Val.getValueType();
if (IsABIRegCopy && (ValueVT == MVT::f16 \|\| ValueVT == MVT::bf16) &&		if (IsABIRegCopy && (ValueVT == MVT::f16 \|\| ValueVT == MVT::bf16) &&
PartVT == MVT::f32) {		PartVT == MVT::f32) {
// Cast the [b]f16 to i16, extend to i32, pad with ones to make a float		// Cast the [b]f16 to i16, extend to i32, pad with ones to make a float
▲ Show 20 Lines • Show All 481 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-constant-pool-alignment.ll

	; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -mtriple=riscv64 -mattr=+v -verify-machineinstrs < %s \| FileCheck %s

	define void @constantpool_v16xi8(ptr %x) {			define void @constantpool_v16xi8(ptr %x) {
	; CHECK: .section .rodata.cst16,"aM",@progbits,16			; CHECK: .section .rodata.cst16,"aM",@progbits,16
	; CHECK: .p2align 4, 0x0			; CHECK-NOT: .p2align
				efriedmaUnsubmitted Not Done Reply Inline Actions Messing with the alignment of constants in .rodata.cst16 is going to have nearly zero effect; the data is getting emitted in 16-byte chunks, so there's no padding anyway. (Not sure if this is what you meant to test.) efriedma: Messing with the alignment of constants in .rodata.cst16 is going to have nearly zero effect…
				reamesAuthorUnsubmitted Done Reply Inline Actions I went and dug through this carefully. As you point out, the data fragments do end up being concatenated together into the .rodata section, and the alignment of the entries within the section doesn't end up mattering much. We group 16b chunks into .rodata.cst16, and 32b chunks into .rodata.cst16. These are then combined to form the final .rodata section, and thus the internal alignment doesn't create any additional padding within that section. This change does end up reducing the alignment requirement of the .rodata section as a whole (from 32 to 8 in this example), but that's a pretty marginal gain. When I wrote the patch, I was thinking the data fragments were in the .text section between the functions. This is not true. reames: I went and dug through this carefully. As you point out, the data fragments do end up being…
	; CHECK: .byte			; CHECK: .byte
	; CHECK: .globl constantpool_v16xi8			; CHECK: .globl constantpool_v16xi8
	; CHECK: .p2align 2			; CHECK: .p2align 2
	store <16 x i8> <i8 0, i8 1, i8 3, i8 3, i8 4, i8 5, i8 15, i8 7, i8 27, i8 9, i8 10, i8 11, i8 12, i8 13, i8 12, i8 15>, ptr %x			store <16 x i8> <i8 0, i8 1, i8 3, i8 3, i8 4, i8 5, i8 15, i8 7, i8 27, i8 9, i8 10, i8 11, i8 12, i8 13, i8 12, i8 15>, ptr %x
	ret void			ret void
	}			}

	define void @constantpool_v4xi32(ptr %x) {			define void @constantpool_v4xi32(ptr %x) {
	; CHECK: .section .rodata.cst16,"aM",@progbits,16			; CHECK: .section .rodata.cst16,"aM",@progbits,16
	; CHECK: .p2align 4, 0x0			; CHECK: .p2align 2, 0x0
	; CHECK: .word			; CHECK: .word
	; CHECK: .globl constantpool_v4xi32			; CHECK: .globl constantpool_v4xi32
	; CHECK: .p2align 2			; CHECK: .p2align 2
	store <4 x i32> <i32 -27, i32 255, i32 3, i32 63>, ptr %x			store <4 x i32> <i32 -27, i32 255, i32 3, i32 63>, ptr %x
	ret void			ret void
	}			}

	; Note that to exercise the 64 bit alignment case, we need four elements			; Note that to exercise the 64 bit alignment case, we need four elements
	; as all of the two element small constant cases get optimized to some			; as all of the two element small constant cases get optimized to some
	; other sequence			; other sequence
	define void @constantpool_v4xi64(ptr %x) {			define void @constantpool_v4xi64(ptr %x) {
	; CHECK: .section .rodata.cst32,"aM",@progbits,32			; CHECK: .section .rodata.cst32,"aM",@progbits,32
	; CHECK: .p2align 5, 0x0			; CHECK: .p2align 3, 0x0
	; CHECK: .quad			; CHECK: .quad
	; CHECK: .globl constantpool_v4xi64			; CHECK: .globl constantpool_v4xi64
	; CHECK: .p2align 2			; CHECK: .p2align 2
	store <4 x i64> <i64 -27, i64 255, i64 3, i64 63>, ptr %x			store <4 x i64> <i64 -27, i64 255, i64 3, i64 63>, ptr %x
	ret void			ret void
	}			}

	define void @constantpool_i64(ptr %x) {			define void @constantpool_i64(ptr %x) {
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Reduce alignment of vector constant pool entriesAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 541586

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

llvm/lib/Target/RISCV/RISCVISelLowering.h

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-constant-pool-alignment.ll

[RISCV] Reduce alignment of vector constant pool entries
AbandonedPublic