This is an archive of the discontinued LLVM Phabricator instance.

If I'm following correctly, the problem here is that we don't have paired load/store operations for byte operations, so the threshold is roughly double what it should be. StrictAlign isn't really a good proxy for that; it doesn't have anything to do with whether a particular memcpy will lower to paired operations. Ideally, we should probably come up with some callback to give a bonus to paired operations rather than penalize all operations in StrictAlign mode.

That said, just checking StrictAlign might be a good enough approximation for now; most code isn't built with strict alignment. Needs an comment explaining what you're doing, though.

Given you're just checking the subtarget, I don't think you need to make getMaxStoresPerMemcpy virtual; you can just modify AArch64TargetLowering::AArch64TargetLowering.

In D47349#1111849, @efriedma wrote:

If I'm following correctly, the problem here is that we don't have paired load/store operations for byte operations, so the threshold is roughly double what it should be. StrictAlign isn't really a good proxy for that; it doesn't have anything to do with whether a particular memcpy will lower to paired operations. Ideally, we should probably come up with some callback to give a bonus to paired operations rather than penalize all operations in StrictAlign mode.

The problem is less the lack of paired byte wide loads and stores than having to use byte wide loads and stores.

That said, just checking StrictAlign might be a good enough approximation for now; most code isn't built with strict alignment. Needs an comment explaining what you're doing, though.

Given you're just checking the subtarget, I don't think you need to make getMaxStoresPerMemcpy virtual; you can just modify AArch64TargetLowering::AArch64TargetLowering.

The sub-target doesn't exist in the base class though.

The AArch64Subtarget is explicitly passed as a parameter to AArch64TargetLowering::AArch64TargetLowering.

In D47349#1112584, @efriedma wrote:

The AArch64Subtarget is explicitly passed as a parameter to AArch64TargetLowering::AArch64TargetLowering.

But getMaxStoresPerMemcpy() is called in SelectionDAG as TargetLowering::getMaxStoresPerMemcpy(). So, unless it's virtual, the latter, instead of AArch64TargetLowering::getMaxStoresPerMemcpy(), will be called.

I meant that you can change the value of MaxStoresPerMemcpy in AArch64TargetLowering::AArch64TargetLowering, instead of overriding getMaxStoresPerMemcpy.

In D47349#1112719, @efriedma wrote:

I meant that you can change the value of MaxStoresPerMemcpy in AArch64TargetLowering::AArch64TargetLowering, instead of overriding getMaxStoresPerMemcpy.

If you mean decreasing the number of MaxStoresPerMemcpy, then that would restrict the inlining of memcpy() without -mattr=+strict-align though.

MaxStoresPerMemcpy = STI.requiresStrictAlign() ? 4 : 16;?

In D47349#1112803, @efriedma wrote:

MaxStoresPerMemcpy = STI.requiresStrictAlign() ? 4 : 16;?

Oy, you mean in the constructor? Got it!

There's no need to make these methods virtual and override them in the sub target. Rather, modify the limits when the sub target is initialized, as @eli.friedman suggested.

evandro removed a child revision: D45098: [AArch64] Fix PR32384: bump up the number of stores per memset and memcpy.May 25 2018, 2:14 PM

evandro mentioned this in D45098: [AArch64] Fix PR32384: bump up the number of stores per memset and memcpy.May 25 2018, 2:22 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

8 lines

lib/

Target/

AArch64/

AArch64ISelLowering.h

12 lines

AArch64ISelLowering.cpp

13 lines

test/

CodeGen/

AArch64/

arm64-misaligned-memcpy-inline.ll

13 lines

Diff 148499

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 1,179 Lines • ▼ Show 20 Lines	public:
}		}

/// Get maximum # of store operations permitted for llvm.memset		/// Get maximum # of store operations permitted for llvm.memset
///		///
/// This function returns the maximum number of store operations permitted		/// This function returns the maximum number of store operations permitted
/// to replace a call to llvm.memset. The value is set by the target at the		/// to replace a call to llvm.memset. The value is set by the target at the
/// performance threshold for such a replacement. If OptSize is true,		/// performance threshold for such a replacement. If OptSize is true,
/// return the limit for functions that have OptSize attribute.		/// return the limit for functions that have OptSize attribute.
unsigned getMaxStoresPerMemset(bool OptSize) const {		virtual unsigned getMaxStoresPerMemset(bool OptSize) const {
return OptSize ? MaxStoresPerMemsetOptSize : MaxStoresPerMemset;		return OptSize ? MaxStoresPerMemsetOptSize : MaxStoresPerMemset;
}		}

/// Get maximum # of store operations permitted for llvm.memcpy		/// Get maximum # of store operations permitted for llvm.memcpy
///		///
/// This function returns the maximum number of store operations permitted		/// This function returns the maximum number of store operations permitted
/// to replace a call to llvm.memcpy. The value is set by the target at the		/// to replace a call to llvm.memcpy. The value is set by the target at the
/// performance threshold for such a replacement. If OptSize is true,		/// performance threshold for such a replacement. If OptSize is true,
/// return the limit for functions that have OptSize attribute.		/// return the limit for functions that have OptSize attribute.
unsigned getMaxStoresPerMemcpy(bool OptSize) const {		virtual unsigned getMaxStoresPerMemcpy(bool OptSize) const {
return OptSize ? MaxStoresPerMemcpyOptSize : MaxStoresPerMemcpy;		return OptSize ? MaxStoresPerMemcpyOptSize : MaxStoresPerMemcpy;
}		}

/// \brief Get maximum # of store operations to be glued together		/// \brief Get maximum # of store operations to be glued together
///		///
/// This function returns the maximum number of store operations permitted		/// This function returns the maximum number of store operations permitted
/// to glue together during lowering of llvm.memcpy. The value is set by		/// to glue together during lowering of llvm.memcpy. The value is set by
// the target at the performance threshold for such a replacement.		// the target at the performance threshold for such a replacement.
virtual unsigned getMaxGluedStoresPerMemcpy() const {		virtual unsigned getMaxGluedStoresPerMemcpy() const {
return MaxGluedStoresPerMemcpy;		return MaxGluedStoresPerMemcpy;
}		}

/// Get maximum # of load operations permitted for memcmp		/// Get maximum # of load operations permitted for memcmp
///		///
/// This function returns the maximum number of load operations permitted		/// This function returns the maximum number of load operations permitted
/// to replace a call to memcmp. The value is set by the target at the		/// to replace a call to memcmp. The value is set by the target at the
/// performance threshold for such a replacement. If OptSize is true,		/// performance threshold for such a replacement. If OptSize is true,
/// return the limit for functions that have OptSize attribute.		/// return the limit for functions that have OptSize attribute.
unsigned getMaxExpandSizeMemcmp(bool OptSize) const {		virtual unsigned getMaxExpandSizeMemcmp(bool OptSize) const {
return OptSize ? MaxLoadsPerMemcmpOptSize : MaxLoadsPerMemcmp;		return OptSize ? MaxLoadsPerMemcmpOptSize : MaxLoadsPerMemcmp;
}		}

/// For memcmp expansion when the memcmp result is only compared equal or		/// For memcmp expansion when the memcmp result is only compared equal or
/// not-equal to 0, allow up to this number of load pairs per block. As an		/// not-equal to 0, allow up to this number of load pairs per block. As an
/// example, this may allow 'memcmp(a, b, 3) == 0' in a single block:		/// example, this may allow 'memcmp(a, b, 3) == 0' in a single block:
/// a0 = load2bytes &a[0]		/// a0 = load2bytes &a[0]
/// b0 = load2bytes &b[0]		/// b0 = load2bytes &b[0]
/// a2 = load1byte &a[2]		/// a2 = load1byte &a[2]
/// b2 = load1byte &b[2]		/// b2 = load1byte &b[2]
/// r = cmp eq (a0 ^ b0 \| a2 ^ b2), 0		/// r = cmp eq (a0 ^ b0 \| a2 ^ b2), 0
virtual unsigned getMemcmpEqZeroLoadsPerBlock() const {		virtual unsigned getMemcmpEqZeroLoadsPerBlock() const {
return 1;		return 1;
}		}

/// Get maximum # of store operations permitted for llvm.memmove		/// Get maximum # of store operations permitted for llvm.memmove
///		///
/// This function returns the maximum number of store operations permitted		/// This function returns the maximum number of store operations permitted
/// to replace a call to llvm.memmove. The value is set by the target at the		/// to replace a call to llvm.memmove. The value is set by the target at the
/// performance threshold for such a replacement. If OptSize is true,		/// performance threshold for such a replacement. If OptSize is true,
/// return the limit for functions that have OptSize attribute.		/// return the limit for functions that have OptSize attribute.
unsigned getMaxStoresPerMemmove(bool OptSize) const {		virtual unsigned getMaxStoresPerMemmove(bool OptSize) const {
return OptSize ? MaxStoresPerMemmoveOptSize : MaxStoresPerMemmove;		return OptSize ? MaxStoresPerMemmoveOptSize : MaxStoresPerMemmove;
}		}

/// Determine if the target supports unaligned memory accesses.		/// Determine if the target supports unaligned memory accesses.
///		///
/// This function returns true if the target allows unaligned memory accesses		/// This function returns true if the target allows unaligned memory accesses
/// of the specified type in the given address space. If true, it also returns		/// of the specified type in the given address space. If true, it also returns
/// whether the unaligned memory access is "fast" in the last argument by		/// whether the unaligned memory access is "fast" in the last argument by
▲ Show 20 Lines • Show All 2,395 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 232 Lines • ▼ Show 20 Lines
}		}

} // end anonymous namespace		} // end anonymous namespace

class AArch64Subtarget;		class AArch64Subtarget;
class AArch64TargetMachine;		class AArch64TargetMachine;

class AArch64TargetLowering : public TargetLowering {		class AArch64TargetLowering : public TargetLowering {
		/// Keep a pointer to the AArch64Subtarget around so that we can
		/// make the right decision when generating code for different targets.
		const AArch64Subtarget *Subtarget;

public:		public:
explicit AArch64TargetLowering(const TargetMachine &TM,		explicit AArch64TargetLowering(const TargetMachine &TM,
const AArch64Subtarget &STI);		const AArch64Subtarget &STI);

/// Selects the correct CCAssignFn for a given CallingConvention value.		/// Selects the correct CCAssignFn for a given CallingConvention value.
CCAssignFn *CCAssignFnForCall(CallingConv::ID CC, bool IsVarArg) const;		CCAssignFn *CCAssignFnForCall(CallingConv::ID CC, bool IsVarArg) const;

/// Selects the correct CCAssignFn for a given CallingConvention value.		/// Selects the correct CCAssignFn for a given CallingConvention value.
Show All 12 Lines	public:
MVT getScalarShiftAmountTy(const DataLayout &DL, EVT) const override;		MVT getScalarShiftAmountTy(const DataLayout &DL, EVT) const override;

/// Returns true if the target allows unaligned memory accesses of the		/// Returns true if the target allows unaligned memory accesses of the
/// specified type.		/// specified type.
bool allowsMisalignedMemoryAccesses(EVT VT, unsigned AddrSpace = 0,		bool allowsMisalignedMemoryAccesses(EVT VT, unsigned AddrSpace = 0,
unsigned Align = 1,		unsigned Align = 1,
bool *Fast = nullptr) const override;		bool *Fast = nullptr) const override;

		// In case of strict alignment, avoid an excessive number of byte wide stores.
		unsigned getMaxStoresPerMemset(bool OptSize) const override;
		unsigned getMaxStoresPerMemcpy(bool OptSize) const override;

/// Provide custom lowering hooks for some operations.		/// Provide custom lowering hooks for some operations.
SDValue LowerOperation(SDValue Op, SelectionDAG &DAG) const override;		SDValue LowerOperation(SDValue Op, SelectionDAG &DAG) const override;

const char *getTargetNodeName(unsigned Opcode) const override;		const char *getTargetNodeName(unsigned Opcode) const override;

SDValue PerformDAGCombine(SDNode *N, DAGCombinerInfo &DCI) const override;		SDValue PerformDAGCombine(SDNode *N, DAGCombinerInfo &DCI) const override;

/// Returns true if a cast between SrcAS and DestAS is a noop.		/// Returns true if a cast between SrcAS and DestAS is a noop.
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	public:
MachineMemOperand::Flags getMMOFlags(const Instruction &I) const override;		MachineMemOperand::Flags getMMOFlags(const Instruction &I) const override;

bool functionArgumentNeedsConsecutiveRegisters(Type *Ty,		bool functionArgumentNeedsConsecutiveRegisters(Type *Ty,
CallingConv::ID CallConv,		CallingConv::ID CallConv,
bool isVarArg) const override;		bool isVarArg) const override;
private:		private:
bool isExtFreeImpl(const Instruction *Ext) const override;		bool isExtFreeImpl(const Instruction *Ext) const override;

/// Keep a pointer to the AArch64Subtarget around so that we can
/// make the right decision when generating code for different targets.
const AArch64Subtarget *Subtarget;

void addTypeForNEON(MVT VT, MVT PromotedBitwiseVT);		void addTypeForNEON(MVT VT, MVT PromotedBitwiseVT);
void addDRTypeForNEON(MVT VT);		void addDRTypeForNEON(MVT VT);
void addQRTypeForNEON(MVT VT);		void addQRTypeForNEON(MVT VT);

SDValue LowerFormalArguments(SDValue Chain, CallingConv::ID CallConv,		SDValue LowerFormalArguments(SDValue Chain, CallingConv::ID CallConv,
bool isVarArg,		bool isVarArg,
const SmallVectorImpl<ISD::InputArg> &Ins,		const SmallVectorImpl<ISD::InputArg> &Ins,
const SDLoc &DL, SelectionDAG &DAG,		const SDLoc &DL, SelectionDAG &DAG,
▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,059 Lines • ▼ Show 20 Lines	*Fast = !Subtarget->isMisaligned128StoreSlow() \|\| VT.getStoreSize() != 16 \|\|

// Disregard v2i64. Memcpy lowering produces those and splitting		// Disregard v2i64. Memcpy lowering produces those and splitting
// them regresses performance on micro-benchmarks and olden/bh.		// them regresses performance on micro-benchmarks and olden/bh.
VT == MVT::v2i64;		VT == MVT::v2i64;
}		}
return true;		return true;
}		}

		// In case of strict alignment, avoid an excessive number of byte wide stores
		// for memset() and...
		unsigned AArch64TargetLowering::getMaxStoresPerMemset(bool OptSize) const {
		return OptSize \|\| Subtarget->requiresStrictAlign()
		? MaxStoresPerMemsetOptSize : MaxStoresPerMemset;
		}

		// memcpy().
		unsigned AArch64TargetLowering::getMaxStoresPerMemcpy(bool OptSize) const {
		return OptSize \|\| Subtarget->requiresStrictAlign()
		? MaxStoresPerMemcpyOptSize : MaxStoresPerMemcpy;
		}

FastISel *		FastISel *
AArch64TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,		AArch64TargetLowering::createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo) const {		const TargetLibraryInfo *libInfo) const {
return AArch64::createFastISel(funcInfo, libInfo);		return AArch64::createFastISel(funcInfo, libInfo);
}		}

const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {		const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {
switch ((AArch64ISD::NodeType)Opcode) {		switch ((AArch64ISD::NodeType)Opcode) {
▲ Show 20 Lines • Show All 10,370 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-misaligned-memcpy-inline.ll

	; RUN: llc -mtriple=arm64-apple-ios -mattr=+strict-align < %s \| FileCheck %s			; RUN: llc -mtriple=arm64-apple-ios -mattr=+strict-align < %s \| FileCheck %s

	; Small (16-bytes here) unaligned memcpys should stay memcpy calls if			; Small (16 bytes here) unaligned memcpy should stay memcpy call if
	; strict-alignment is turned on.			; strict-alignment is turned on.
	define void @t0(i8* %out, i8* %in) {			define void @t0(i8* %out, i8* %in) {
	; CHECK-LABEL: t0:			; CHECK-LABEL: t0:
	; CHECK: orr w2, wzr, #0x10			; CHECK: orr w2, wzr, #0x10
	; CHECK-NEXT: bl _memcpy			; CHECK-NEXT: bl _memcpy
	entry:			entry:
	call void @llvm.memcpy.p0i8.p0i8.i64(i8* %out, i8* %in, i64 16, i1 false)			call void @llvm.memcpy.p0i8.p0i8.i64(i8* %out, i8* %in, i64 16, i1 false)
	ret void			ret void
	}			}

				; Tiny (4 bytes here) unaligned memcpy could be inlined with byte sized
				; loads and stores if strict-alignment is turned on.
				define void @t1(i8* %out, i8* %in) {
				; CHECK-LABEL: t1:
				; CHECK: ldrb w{{[0-9]+}}, [x1]
				; CHECK: strb w{{[0-9]+}}, [x0]
				entry:
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %out, i8* %in, i64 4, i1 false)
				ret void
				}

	declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture, i8* nocapture readonly, i64, i1)			declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture, i8* nocapture readonly, i64, i1)

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Limit inlining string functions with strict alignmentAbandonedPublic

Details

Diff Detail