Download Raw Diff

Details

Reviewers

t.p.northover
• tstellarAMD
mcrosier

Commits

rG22e839f4b2d2: [AArch64] Improve code generation for logical instructions taking immediate…
rG19077aaee0e0: [AArch64] Improve code generation for logical instructions taking immediate…
rGe327f098329f: [AArch64] Improve code generation for logical instructions taking immediate…
rL301019: [AArch64] Improve code generation for logical instructions taking
rL300930: [AArch64] Improve code generation for logical instructions taking
rL300913: [AArch64] Improve code generation for logical instructions taking

Summary

llvm currently turns the following code

void foo1(int a, char *p) {

int t = a & 0xfd;
*p = t;

}

into these instructions:

movz w8, #253
and w8, w0, w8
strb w8, [x1]

This can be done using just two instructions, since we don't care what the upper 24-bits of the "and" instruction are.

and w8, w0, 0xfffffffd
strb w8, [x1]

This patch adds a target hook to TargetLowering::TargetLoweringOpt and overrides it in the AArch64 backend to sign-extend an immediate operand if the upper bits are not demanded and sign-extending enables folding the immediate into the instruction.

This optimization speeds up 253.perlbmk by 5%.

Diff Detail

Event Timeline

ahatanak updated this revision to Diff 14357.Oct 2 2014, 4:08 PM

ahatanak retitled this revision from to AArch64: Fold immediate into the immediate field of logical instructions.

ahatanak updated this object.

ahatanak edited the test plan for this revision. (Show Details)

ahatanak added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptOct 2 2014, 4:08 PM

Nice gain!! Just a few nits.

include/llvm/Target/TargetLowering.h
2056	s/LTO/TLO ?
lib/Target/AArch64/AArch64ISelLowering.cpp
674	Remove vertical whitespace.
681	Please add an assert message: assert(cond && "msg");

aadg added a subscriber: aadg.Oct 3 2014, 2:20 AM

Address Chad's comments.

ping

lib/Target/AArch64/AArch64ISelLowering.cpp
685	The code I previously had here didn't make sense, so I cleaned it up. It was checking Size > 0 after Size = std::max(VT.getSizeInBits(), 32u). Also, I changed it to break if VT is a vector, just in case this function is called on a vector node.

ahatanak added a reviewer: t.p.northover.Oct 22 2014, 10:03 AM

Hi Akira,

I've got a couple of comments:

include/llvm/Target/TargetLowering.h
2058	Functions should usually start with a lower-case letter.
lib/Target/AArch64/AArch64ISelLowering.cpp
693–694	OK, so we've got an immediate "ab...iJK..Z" where lower-case digits aren't used. Sign extending converts this to "JJ...JJK...Z". Is there any particular reason to expect that's representable? It seems like a bit of a shot in the dark.
700	Shifting a negative number left is undefined behaviour.

Tim, I fixed the undefined behavior and renamed the function to start with lower-case letter.

The reason I only do sign-extension is that it seemed to be the cheapest way to get the most gain without hurting compile time or making the code overly complicated. I looked at the instructions llvm emits, and I saw many places where logical instructions were followed by truncating stores.

I can try changing it to do a more exhaustive search of the bit patterns to see how much further performance can be improved, if that's necessary.

Hi Akira,

The reason I only do sign-extension is that it seemed to be the cheapest way to get the most gain without hurting compile time or making the code overly complicated. I looked at the instructions llvm emits, and I saw many places where logical instructions were followed by truncating stores.

I think the kind of store being done is largely orthogonal to the
calculation most likely to get you a valid immediate. It tells you
what bits can be ignored, but nothing about the best way to fill them.

So I think it's reasonable to start with just assuming that
DemandedBits has some number of low bits set and the rest ignored
based on your observations. I think it's harder to justify coming up
with a single NewImm value by sign extending the existing immediate
and giving up if that fails.

Cheers.

Tim.

Hi again,

I think it's harder to justify coming up
with a single NewImm value by sign extending the existing immediate
and giving up if that fails.

I've been doing some more thinking here, and if we're willing to
assume the truncation is to a power of two type (I am, anyone using
i14 deserves whatever they get) then I think we can cover *all* valid
cases by instead replicating the demanded bits across the 32-bit
register. E.g. try 0xfdfdfdfd instead of 0xfffffffd for the 8-bit
0xfd.

The argument goes that if the input is morally contiguous, then there
are multiple representations involving sign extension to 4, 8, 16 or
32 followed by replication. Otherwise the replication width is less
than the demanded width so we're completely forced and have to
continue the replication that's already started.

Cheers.

Tim.

In D5591#15, @t.p.northover wrote:

The argument goes that if the input is morally contiguous, then there
are multiple representations involving sign extension to 4, 8, 16 or
32 followed by replication. Otherwise the replication width is less
than the demanded width so we're completely forced and have to
continue the replication that's already started.

Sign-extension followed by replication enables converting constants like 0xfdfd (demanded = 16-bits) or 0x3dfd (demanded = 14bits) to bimm, where just sign-extending fails, but I think we should also try to handle cases like 0x19 (demanded = 5bits). This isn't a bimm as it stands and sign-extending doesn't make it a bimm either. In this case, we have to copy bits 1-3 to bits 5-7.

I think I can come up with a patch that does this.

Rewrote the algorithm for searching for a bitmask immediate based on Tim's feedback.

I had to make changes to DAGCombiner::visitAND because it was transforming the DAG in way that made it impossible to do any optimization in optimizeConstant (the code here was originally committed in r97616). This change doesn't seem to have any noticeable impact on performance, but I'm still investigating.

Sorry for not updating the patch for a long time. Here is my new patch.

For the most part, the new patch takes the same approach as the previous patch to find an immediate operand, but there were a couple of changes made.

Function optimizeLogicalImm emits AArch64 machine nodes to prevent the target-independent dagcombine from undoing the optimization. With this change, there is no need to make changes in DAGCombiner::visitOR as I did in my previous patch.
In optimizeLogicalImm, rotation is used to avoid using branches and simplify the logic.
A target-specific function object for optimizing nodes with immediates is passed to the constructor of TargetLoweringOpt, which gets called later in TargetLoweringOpt::ShrinkDemandedConstant.

Herald added a subscriber: rengolin. · View Herald TranscriptMay 19 2015, 1:51 PM

Rebase and make a couple of changes:

Add comments.
In optimizeLogicalImm, return false instead of true when the immediate is already a bimm32 or bimm64 so that it doesn't inhibit the optimization done later in AArch64ISelDAGToDAG.cpp that emits BFXIL.
Remove redundant instructions in bitreverse.ll.

Some drive-by comments; I haven't looked into the optimizeLogicalImm logic yet.

include/llvm/Target/TargetLowering.h
2252	Add more detail on when this is called and what the parameters are?
2278–2281	I don't think this is the best location for this: I'd rather have TargetLoweringOpt be "the result of an optimization", and TargetLowering be "how to do optimizations". What do you think of going back to the TLI virtual hook, and fixing the other TLO methods to do the same: https://reviews.llvm.org/differential/diff/73495/
lib/Target/AArch64/AArch64ISelLowering.cpp
745–746	Check this only once when initializing the std::function? (or leave it here if you remove the std::function, I suppose)
test/CodeGen/AArch64/optimize-imm.ll
2	Triple can be simplified to: -mtriple=aarch64--

ahatanak added inline comments.Oct 4 2016, 10:32 PM

include/llvm/Target/TargetLowering.h
2278–2281	I'm not sure what TargetLoweringOpt is supposed to do, but using TLI virtual hooks looks like a better approach.

Address review comments.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptOct 5 2016, 3:21 PM

Herald added subscribers: nhaehnle, arsenm. · View Herald Transcript

ahatanak marked 2 inline comments as done.Oct 5 2016, 3:22 PM

Fix comment.

Herald edited edge metadata. · View Herald TranscriptOct 5 2016, 7:21 PM

Herald added a subscriber: wdng. · View Herald Transcript

Rebase. Move TargetLoweringOpt::SimplifyDemandedBits into TargetLowering.

Herald edited edge metadata. · View Herald TranscriptNov 16 2016, 5:28 PM

Rebase.

Herald added a subscriber: igorb. · View Herald TranscriptJan 24 2017, 3:46 PM

AsafBadouh added a subscriber: AsafBadouh.Jan 25 2017, 1:05 AM

evandro added a subscriber: evandro.Jan 27 2017, 1:18 PM

mcrosier added inline comments.Apr 12 2017, 11:19 AM

lib/Target/AArch64/AArch64ISelLowering.cpp
806	You should use AArch64_AM::isLogicalImmediate() here.
901	If you put this after the below switch, you'll guarantee Op has operand 1.
907	Can't we early exit if we demand all of the bits? E.g., if (Demanded.countPopulation() == Size) return false;
922	Default case goes at top of switch, per coding guidelines.

FWIW, I think this is in pretty good shape. Also, I ran correctness tests on everything I've got (e.g., llvm-ts, SPEC200X, internal tests) and saw no correctness failures.

mcrosier added a reviewer: mcrosier.Apr 12 2017, 2:17 PM

Address Chad's comments.

Thanks for working on this, Akira. I believe you've addressed all of my concerns as well as Tim's and Arnaud's. LGTM, assuming you've done the due diligence to ensure this doesn't dramatically increase compile-time.

This revision is now accepted and ready to land.Apr 17 2017, 10:18 AM

I compiled the files in MultiSource/Applications of the test suite and didn't see any measurable increase in compile time. I'll commit this patch today.

Thank you for the review.

Thanks for working on this patch and for being *extremely* patient with the 2.5 year review, Akira!

Closed by commit rL300913: [AArch64] Improve code generation for logical instructions taking (authored by ahatanak). · Explain WhyApr 20 2017, 4:00 PM

This revision was automatically updated to reflect the committed changes.

Diff 14357

include/llvm/Target/TargetLowering.h

Context not available.
	APInt &KnownZero, APInt &KnownOne,	APInt &KnownZero, APInt &KnownOne,
	TargetLoweringOpt &TLO, unsigned Depth = 0) const;	TargetLoweringOpt &TLO, unsigned Depth = 0) const;

		/// Do a target-specific optimization of Op's constant immediate operand using
		/// the demanded bits information. For example, targets can sign-extend the
		/// immediate in order to have it folded into the immediate field of the
		/// instruction. Return true if the immediate node doesn't need further
		/// optimization and set the New and Old fields of LTO if a new immediate
		mcrosierUnsubmitted Not Done Reply Inline Actions s/LTO/TLO ? mcrosier: s/LTO/TLO ?
		/// node was created.
		virtual bool OptimizeConstant(SDValue Op, const APInt &Demanded,
		t.p.northoverUnsubmitted Not Done Reply Inline Actions Functions should usually start with a lower-case letter. t.p.northover: Functions should usually start with a lower-case letter.
		TargetLoweringOpt &TLO) const;

	/// Determine which of the bits specified in Mask are known to be either zero	/// Determine which of the bits specified in Mask are known to be either zero
	/// or one and return them in the KnownZero/KnownOne bitsets.	/// or one and return them in the KnownZero/KnownOne bitsets.
	virtual void computeKnownBitsForTargetNode(const SDValue Op,	virtual void computeKnownBitsForTargetNode(const SDValue Op,
Context not available.
		abUnsubmitted Done Reply Inline Actions I don't think this is the best location for this: I'd rather have TargetLoweringOpt be "the result of an optimization", and TargetLowering be "how to do optimizations". What do you think of going back to the TLI virtual hook, and fixing the other TLO methods to do the same: https://reviews.llvm.org/differential/diff/73495/ ab: I don't think this is the best location for this: I'd rather have TargetLoweringOpt be "the…
		ahatanakAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure what TargetLoweringOpt is supposed to do, but using TLI virtual hooks looks like a better approach. ahatanak: I'm not sure what TargetLoweringOpt is supposed to do, but using TLI virtual hooks looks like a…
		abUnsubmitted Not Done Reply Inline Actions Add more detail on when this is called and what the parameters are? ab: Add more detail on when this is called and what the parameters are?

lib/CodeGen/SelectionDAG/TargetLowering.cpp

Context not available.
	const APInt &Demanded) {	const APInt &Demanded) {
	SDLoc dl(Op);	SDLoc dl(Op);

		if (Op.getOpcode() == ISD::XOR) {
		ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1));

		// If an XOR already has all the bits set, nothing to change but don't
		// shrink either.
		if (!C \|\| (C->getAPIntValue() \| (~Demanded)).isAllOnesValue())
		return false;
		}

		assert(!Old.getNode() && !New.getNode());

		// Return if the constant operand doesn't need further optimization.
		if (DAG.getTargetLoweringInfo().OptimizeConstant(Op, Demanded, *this))
		return New.getNode();

	// FIXME: ISD::SELECT, ISD::SELECT_CC	// FIXME: ISD::SELECT, ISD::SELECT_CC
	switch (Op.getOpcode()) {	switch (Op.getOpcode()) {
	default: break;	default: break;
Context not available.
	ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1));	ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1));
	if (!C) return false;	if (!C) return false;

	if (Op.getOpcode() == ISD::XOR &&
	(C->getAPIntValue() \| (~Demanded)).isAllOnesValue())
	return false;

	// if we can expand it to have all bits set, do it	// if we can expand it to have all bits set, do it
	if (C->getAPIntValue().intersects(~Demanded)) {	if (C->getAPIntValue().intersects(~Demanded)) {
	EVT VT = Op.getValueType();	EVT VT = Op.getValueType();
Context not available.
	return false;	return false;
	}	}

		bool TargetLowering::OptimizeConstant(SDValue Op, const APInt &Demanded,
		TargetLoweringOpt &TLO) const {
		return false;
		}

	/// computeKnownBitsForTargetNode - Determine which of the bits specified	/// computeKnownBitsForTargetNode - Determine which of the bits specified
	/// in Mask are known to be either zero or one and return them in the	/// in Mask are known to be either zero or one and return them in the
	/// KnownZero/KnownOne bitsets.	/// KnownZero/KnownOne bitsets.
Context not available.

lib/Target/AArch64/AArch64ISelLowering.h

Context not available.
	/// Selects the correct CCAssignFn for a given CallingConvention value.	/// Selects the correct CCAssignFn for a given CallingConvention value.
	CCAssignFn *CCAssignFnForCall(CallingConv::ID CC, bool IsVarArg) const;	CCAssignFn *CCAssignFnForCall(CallingConv::ID CC, bool IsVarArg) const;

		bool OptimizeConstant(SDValue Op, const APInt &Demanded,
		TargetLoweringOpt &TLO) const override;

	/// computeKnownBitsForTargetNode - Determine which of the bits specified in	/// computeKnownBitsForTargetNode - Determine which of the bits specified in
	/// Mask are known to be either zero or one and return them in the	/// Mask are known to be either zero or one and return them in the
	/// KnownZero/KnownOne bitsets.	/// KnownZero/KnownOne bitsets.
Context not available.

lib/Target/AArch64/AArch64ISelLowering.cpp

Context not available.
	return VT.changeVectorElementTypeToInteger();	return VT.changeVectorElementTypeToInteger();
	}	}

		bool AArch64TargetLowering::OptimizeConstant(SDValue Op, const APInt &Demanded,
		TargetLoweringOpt &TLO) const {
		// Delay this optimization to as late as possible.
		if (!TLO.LegalOps)
		return false;

		switch (Op.getOpcode()) {
		default: break;
		case ISD::AND:
		case ISD::OR:
		case ISD::XOR:
		ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1));

		mcrosierUnsubmitted Not Done Reply Inline Actions Remove vertical whitespace. mcrosier: Remove vertical whitespace.
		if (!C)
		break;

		uint64_t Enc;
		EVT VT = Op.getValueType();
		unsigned Size = std::max(VT.getSizeInBits(), 32u);
		assert(Size > 0 && Size <= 64);
		mcrosierUnsubmitted Not Done Reply Inline Actions Please add an assert message: assert(cond && "msg"); mcrosier: Please add an assert message: assert(cond && "msg");
		uint64_t Mask = ((uint64_t)(-1LL) >> (64 - Size));
		int64_t Imm = C->getSExtValue();

		// Return if the immediate is already a bimm32 or bimm64.
		ahatanakAuthorUnsubmitted Not Done Reply Inline Actions The code I previously had here didn't make sense, so I cleaned it up. It was checking Size > 0 after Size = std::max(VT.getSizeInBits(), 32u). Also, I changed it to break if VT is a vector, just in case this function is called on a vector node. ahatanak: The code I previously had here didn't make sense, so I cleaned it up. It was checking Size > 0…
		if (AArch64_AM::processLogicalImmediate(Imm & Mask, Size, Enc))
		return true;

		// Try sign-extending the immediate and see if we can turn it into a bimm32
		// or bimm64.
		unsigned LZ = Demanded.countLeadingZeros() + (64 - Demanded.getBitWidth());

		if (LZ == 0 \|\| LZ == 64)
		break;
		t.p.northoverUnsubmitted Not Done Reply Inline Actions OK, so we've got an immediate "ab...iJK..Z" where lower-case digits aren't used. Sign extending converts this to "JJ...JJK...Z". Is there any particular reason to expect that's representable? It seems like a bit of a shot in the dark. t.p.northover: OK, so we've got an immediate "ab...iJK..Z" where lower-case digits aren't used. Sign extending…

		int64_t NewImm = (Imm << LZ) >> LZ;

		if (NewImm == Imm \|\|
		!AArch64_AM::processLogicalImmediate(NewImm & Mask, Size, Enc))
		break;
		t.p.northoverUnsubmitted Not Done Reply Inline Actions Shifting a negative number left is undefined behaviour. t.p.northover: Shifting a negative number left is undefined behaviour.

		// Create the new constant immediate node.
		SDValue New = TLO.DAG.getNode(Op.getOpcode(), SDLoc(Op), VT,
		Op.getOperand(0),
		TLO.DAG.getConstant(NewImm, VT));
		return TLO.CombineTo(Op, New);
		}

		return false;
		}

	/// computeKnownBitsForTargetNode - Determine which of the bits specified in	/// computeKnownBitsForTargetNode - Determine which of the bits specified in
	/// Mask are known to be either zero or one and return them in the	/// Mask are known to be either zero or one and return them in the
	/// KnownZero/KnownOne bitsets.	/// KnownZero/KnownOne bitsets.
Context not available.
		abUnsubmitted Not Done Reply Inline Actions Check this only once when initializing the std::function? (or leave it here if you remove the std::function, I suppose) ab: Check this only once when initializing the std::function? (or leave it here if you remove the…
		mcrosierUnsubmitted Done Reply Inline Actions If you put this after the below switch, you'll guarantee Op has operand 1. mcrosier: If you put this after the below switch, you'll guarantee Op has operand 1.
		mcrosierUnsubmitted Done Reply Inline Actions Default case goes at top of switch, per coding guidelines. mcrosier: Default case goes at top of switch, per coding guidelines.
		mcrosierUnsubmitted Done Reply Inline Actions You should use AArch64_AM::isLogicalImmediate() here. mcrosier: You should use AArch64_AM::isLogicalImmediate() here.
		mcrosierUnsubmitted Done Reply Inline Actions Can't we early exit if we demand all of the bits? E.g., if (Demanded.countPopulation() == Size) return false; mcrosier: Can't we early exit if we demand all of the bits? E.g., if (Demanded.countPopulation() ==…

test/CodeGen/AArch64/optimize-imm.ll

This file was added.

				; RUN: llc -o - %s -march=aarch64 \| FileCheck %s

				abUnsubmitted Done Reply Inline Actions Triple can be simplified to: -mtriple=aarch64-- ab: Triple can be simplified to: -mtriple=aarch64--
				; CHECK-LABEL: _and1:
				; CHECK: and {{w[0-9]+}}, w0, #0xfffffffd

				define void @and1(i32 %a, i8* nocapture %p) {
				entry:
				%and = and i32 %a, 253
				%conv = trunc i32 %and to i8
				store i8 %conv, i8* %p, align 1
				ret void
				}

				; Make sure we don't shrink or optimize an XOR's immediate operand if the
				; immediate is -1. Instruction selection turns (and ((xor $mask, -1), $v0)) into
				; a BIC.

				; CHECK-LABEL: _xor1:
				; CHECK: orr [[R0:w[0-9]+]], wzr, #0x38
				; CHECK: bic {{w[0-9]+}}, [[R0]], w0, lsl #3

				define i32 @xor1(i32 %a) {
				entry:
				%shl = shl i32 %a, 3
				%xor = and i32 %shl, 56
				%and = xor i32 %xor, 56
				ret i32 %and
				}

This is an archive of the discontinued LLVM Phabricator instance.

AArch64: Fold immediate into the immediate field of logical instructions
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 14357

include/llvm/Target/TargetLowering.h

lib/CodeGen/SelectionDAG/TargetLowering.cpp

lib/Target/AArch64/AArch64ISelLowering.h

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/optimize-imm.ll

This is an archive of the discontinued LLVM Phabricator instance.

AArch64: Fold immediate into the immediate field of logical instructionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 14357

include/llvm/Target/TargetLowering.h

lib/CodeGen/SelectionDAG/TargetLowering.cpp

lib/Target/AArch64/AArch64ISelLowering.h

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/optimize-imm.ll

AArch64: Fold immediate into the immediate field of logical instructions
ClosedPublic