This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
TargetLowering.h
-
lib/
-
CodeGen/
2
MachineBlockPlacement.cpp
-
Target/ARM/
-
ARM/
2
ARM.td
-
ARMISelLowering.h
2
ARMISelLowering.cpp
-
ARMSubtarget.h
-
test/CodeGen/ARM/
-
CodeGen/
-
ARM/
-
loop-align-cortex-m.ll

Differential D51780

ARM: align loops to 4 bytes on Cortex-M3 and Cortex-M4.
ClosedPublic

Authored by t.p.northover on Sep 7 2018, 5:23 AM.

Download Raw Diff

Details

Reviewers

javed.absar
SjoerdMeijer
dmgreen
t.p.northover

Summary

The Technical Reference Manuals for these two CPUs state that branching to an unaligned 32-bit instruction incurs an extra pipeline reload penalty. That's bad.

I also enable the optimization at -Os for just these two CPUs. My impression has been that it's a bit of a gamble with the bigger cores, and it also wastes more space. But for these two we're getting 1 cycle per iteration in return for 1 byte per loop (on average); that seems like it definitely fits into LLVM's quirky definition of -Os.

I'm open to extending it to other processors, but my research indicates Cortex-M0 is too simple to benefit (it claims conditional branches are always 3 cycles if taken), and Cortex-M7 has no performance documentation.

Diff Detail

Repository: rL LLVM

Event Timeline

t.p.northover created this revision.Sep 7 2018, 5:23 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptSep 7 2018, 5:23 AM

Herald added subscribers: chrib, hiraditya, kristof.beyls, mcrosier. · View Herald Transcript

kristof.beyls added a reviewer: SjoerdMeijer.Sep 7 2018, 5:38 AM

Good stuff. The downstream version of this we have uses Subtarget->isMClass() && Subtarget->hasThumb2(). (Dont ask me why it's downstream, I'm not sure. It's from before my time). I think you are right about the M7 though, the results there are just different, not necessarily better.

We should also be aligning functions?

And we should include the M33 and its derivatives (but like you said, not M0+/M23)

llvm/lib/CodeGen/MachineBlockPlacement.cpp
2500–2501	This seems like an odd change to make at Os. It, by definition, increases code size. Like you said, this might be an llvm quirk though. Do you have a link to llvm's definition of -Os?
llvm/lib/Target/ARM/ARM.td
950	What do you think of using FeatureAlignBranchTargets or something like it?
llvm/lib/Target/ARM/ARMISelLowering.cpp
1204	Here with this gubbins may be a better place for setting the loop alignment.

We should also be aligning functions?

Interesting question. There are a couple of reasons to think the tradeoffs are different there. First, I'd probably (based purely on intuition rather than data) expect a loop to be executed more times than most functions. Second, when r7 is the FP, a function is almost guaranteed to start with a 16-bit instruction (push {..., r7, lr}).

I do have a captive embedded developer who cares very much about individual cycles at the moment though, so I'll ask him to benchmark it.

Anyway, thanks for commenting. I'll get started on the obvious changes.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
2500–2501	It's mostly in old threads, since our manuals are not great and a bit generic: http://clang-developers.42468.n3.nabble.com/RFC-Codifying-but-not-formalizing-the-optimization-levels-in-LLVM-and-Clang-tt4029685.html#a4029741 http://clang-developers.42468.n3.nabble.com/Meaning-of-LLVM-optimization-levels-td4032493.html#a4032496 The first, particularly, is the understanding I've got.
llvm/lib/Target/ARM/ARM.td
950	An excellent idea.
llvm/lib/Target/ARM/ARMISelLowering.cpp
1204	Yep.

Interesting question. There are a couple of reasons to think the tradeoffs are different there. First, I'd probably (based purely on intuition rather than data) expect a loop to be executed more times than most functions. Second, when r7 is the FP, a function is almost guaranteed to start with a 16-bit instruction (push {..., r7, lr}).

The counter argument would be that in the case of function alignment, you don't need to execute the added NOP. For loops, there's always the small chance that you could be running an outer loop more than the inner one, leading to a executed nops that each take a cycle (unless it's the M33 and you get lucky with it's dual issue). If you don't care about the codesize, I think function alignment it will always be a win because the extra padding is just in unused space.

But let us know how the benchmarks look.

I think I've implemented the suggested changes, except for the function alignment one.

The benchmarks came back as about a 0.2% difference in cycle count, and (crucially) there's no way when deciding function alignment to check for OptSize so we'd inevitably pessimize some cases.

The benchmarks came back as about a 0.2% difference in cycle count, and (crucially) there's no way when deciding function alignment to check for OptSize so we'd inevitably pessimize some cases.

I think this what getPrefAlignment is for? As opposed to MinFunctionAlignment? I agree that that's a different issue though, and doesn't need to be done with this.

Although I personally think the definition of Os is a bit odd, and this is increasing codesize, everyone else I've talked to agreed with you that this is fine. Like you said, in (almost) all cases we get a performance win for the bytes we spend.

So LGTM!

This revision is now accepted and ready to land.Sep 12 2018, 10:39 AM

Thanks. Committed as r342127.

dmgreen closed this revision.Oct 18 2018, 1:41 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

6 lines

lib/

CodeGen/

MachineBlockPlacement.cpp

3 lines

Target/

ARM/

6 lines

2 lines

7 lines

7 lines

test/

CodeGen/

ARM/

loop-align-cortex-m.ll

49 lines

Diff 165056

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 1,429 Lines • ▼ Show 20 Lines	unsigned getPrefFunctionAlignment() const {
return PrefFunctionAlignment;		return PrefFunctionAlignment;
}		}

/// Return the preferred loop alignment.		/// Return the preferred loop alignment.
virtual unsigned getPrefLoopAlignment(MachineLoop *ML = nullptr) const {		virtual unsigned getPrefLoopAlignment(MachineLoop *ML = nullptr) const {
return PrefLoopAlignment;		return PrefLoopAlignment;
}		}

		/// Should loops be aligned even when the function is marked OptSize (but not
		/// MinSize).
		virtual bool alignLoopsWithOptSize() const {
		return false;
		}

/// If the target has a standard location for the stack protector guard,		/// If the target has a standard location for the stack protector guard,
/// returns the address of that location. Otherwise, returns nullptr.		/// returns the address of that location. Otherwise, returns nullptr.
/// DEPRECATED: please override useLoadStackGuardNode and customize		/// DEPRECATED: please override useLoadStackGuardNode and customize
/// LOAD_STACK_GUARD, or customize \@llvm.stackguard().		/// LOAD_STACK_GUARD, or customize \@llvm.stackguard().
virtual Value *getIRStackGuard(IRBuilder<> &IRB) const;		virtual Value *getIRStackGuard(IRBuilder<> &IRB) const;

/// Inserts necessary declarations for SSP (stack protection) purpose.		/// Inserts necessary declarations for SSP (stack protection) purpose.
/// Should be used only when getIRStackGuard returns nullptr.		/// Should be used only when getIRStackGuard returns nullptr.
▲ Show 20 Lines • Show All 2,271 Lines • Show Last 20 Lines

llvm/lib/CodeGen/MachineBlockPlacement.cpp

	Show First 20 Lines • Show All 2,491 Lines • ▼ Show 20 Lines
	}			}

	void MachineBlockPlacement::alignBlocks() {			void MachineBlockPlacement::alignBlocks() {
	// Walk through the backedges of the function now that we have fully laid out			// Walk through the backedges of the function now that we have fully laid out
	// the basic blocks and align the destination of each backedge. We don't rely			// the basic blocks and align the destination of each backedge. We don't rely
	// exclusively on the loop info here so that we can align backedges in			// exclusively on the loop info here so that we can align backedges in
	// unnatural CFGs and backedges that were introduced purely because of the			// unnatural CFGs and backedges that were introduced purely because of the
	// loop rotations done during this layout pass.			// loop rotations done during this layout pass.
	if (F->getFunction().optForSize())			if (F->getFunction().optForMinSize() \|\|
				(F->getFunction().optForSize() && !TLI->alignLoopsWithOptSize()))
				dmgreenUnsubmitted Not Done Reply Inline Actions This seems like an odd change to make at Os. It, by definition, increases code size. Like you said, this might be an llvm quirk though. Do you have a link to llvm's definition of -Os? dmgreen: This seems like an odd change to make at Os. It, by definition, increases code size. Like you…
				t.p.northoverAuthorUnsubmitted Not Done Reply Inline Actions It's mostly in old threads, since our manuals are not great and a bit generic: http://clang-developers.42468.n3.nabble.com/RFC-Codifying-but-not-formalizing-the-optimization-levels-in-LLVM-and-Clang-tt4029685.html#a4029741 http://clang-developers.42468.n3.nabble.com/Meaning-of-LLVM-optimization-levels-td4032493.html#a4032496 The first, particularly, is the understanding I've got. t.p.northover: It's mostly in old threads, since our manuals are not great and a bit generic: http://clang…
	return;			return;
	BlockChain &FunctionChain = *BlockToChain[&F->front()];			BlockChain &FunctionChain = *BlockToChain[&F->front()];
	if (FunctionChain.begin() == FunctionChain.end())			if (FunctionChain.begin() == FunctionChain.end())
	return; // Empty chain.			return; // Empty chain.

	const BranchProbability ColdProb(1, 5); // 20%			const BranchProbability ColdProb(1, 5); // 20%
	BlockFrequency EntryFreq = MBFI->getBlockFreq(&F->front());			BlockFrequency EntryFreq = MBFI->getBlockFreq(&F->front());
	BlockFrequency WeightedEntryFreq = EntryFreq * ColdProb;			BlockFrequency WeightedEntryFreq = EntryFreq * ColdProb;
	▲ Show 20 Lines • Show All 406 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARM.td

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines
def FeatureVMLxForwarding : SubtargetFeature<"vmlx-forwarding",		def FeatureVMLxForwarding : SubtargetFeature<"vmlx-forwarding",
"HasVMLxForwarding", "true",		"HasVMLxForwarding", "true",
"Has multiplier accumulator forwarding">;		"Has multiplier accumulator forwarding">;

// Disable 32-bit to 16-bit narrowing for experimentation.		// Disable 32-bit to 16-bit narrowing for experimentation.
def FeaturePref32BitThumb : SubtargetFeature<"32bit", "Pref32BitThumb", "true",		def FeaturePref32BitThumb : SubtargetFeature<"32bit", "Pref32BitThumb", "true",
"Prefer 32-bit Thumb instrs">;		"Prefer 32-bit Thumb instrs">;

		def FeaturePrefLoopAlign32 : SubtargetFeature<"loop-align", "PrefLoopAlignment","2",
		"Prefer 32-bit alignment for loops">;

/// Some instructions update CPSR partially, which can add false dependency for		/// Some instructions update CPSR partially, which can add false dependency for
/// out-of-order implementation, e.g. Cortex-A9, unless each individual bit is		/// out-of-order implementation, e.g. Cortex-A9, unless each individual bit is
/// mapped to a separate physical register. Avoid partial CPSR update for these		/// mapped to a separate physical register. Avoid partial CPSR update for these
/// processors.		/// processors.
def FeatureAvoidPartialCPSR : SubtargetFeature<"avoid-partial-cpsr",		def FeatureAvoidPartialCPSR : SubtargetFeature<"avoid-partial-cpsr",
"AvoidCPSRPartialUpdate", "true",		"AvoidCPSRPartialUpdate", "true",
"Avoid CPSR partial update for OOO execution">;		"Avoid CPSR partial update for OOO execution">;

▲ Show 20 Lines • Show All 655 Lines • ▼ Show 20 Lines	def : ProcessorModel<"cortex-r8", CortexA8Model, [ARMv7r,
FeatureMP,		FeatureMP,
FeatureSlowFPBrcc,		FeatureSlowFPBrcc,
FeatureHWDivARM,		FeatureHWDivARM,
FeatureHasSlowFPVMLx,		FeatureHasSlowFPVMLx,
FeatureAvoidPartialCPSR]>;		FeatureAvoidPartialCPSR]>;

def : ProcessorModel<"cortex-m3", CortexM3Model, [ARMv7m,		def : ProcessorModel<"cortex-m3", CortexM3Model, [ARMv7m,
ProcM3,		ProcM3,
		FeaturePrefLoopAlign32,
FeatureHasNoBranchPredictor]>;		FeatureHasNoBranchPredictor]>;

def : ProcessorModel<"sc300", CortexM3Model, [ARMv7m,		def : ProcessorModel<"sc300", CortexM3Model, [ARMv7m,
ProcM3,		ProcM3,
FeatureHasNoBranchPredictor]>;		FeatureHasNoBranchPredictor]>;

def : ProcessorModel<"cortex-m4", CortexM3Model, [ARMv7em,		def : ProcessorModel<"cortex-m4", CortexM3Model, [ARMv7em,
FeatureVFP4,		FeatureVFP4,
		dmgreenUnsubmitted Not Done Reply Inline Actions What do you think of using FeatureAlignBranchTargets or something like it? dmgreen: What do you think of using FeatureAlignBranchTargets or something like it?
		t.p.northoverAuthorUnsubmitted Not Done Reply Inline Actions An excellent idea. t.p.northover: An excellent idea.
FeatureVFPOnlySP,		FeatureVFPOnlySP,
FeatureD16,		FeatureD16,
		FeaturePrefLoopAlign32,
FeatureHasNoBranchPredictor]>;		FeatureHasNoBranchPredictor]>;

def : ProcNoItin<"cortex-m7", [ARMv7em,		def : ProcNoItin<"cortex-m7", [ARMv7em,
FeatureFPARMv8,		FeatureFPARMv8,
FeatureD16]>;		FeatureD16]>;

def : ProcNoItin<"cortex-m23", [ARMv8mBaseline,		def : ProcNoItin<"cortex-m23", [ARMv8mBaseline,
FeatureNoMovt]>;		FeatureNoMovt]>;

def : ProcessorModel<"cortex-m33", CortexM3Model, [ARMv8mMainline,		def : ProcessorModel<"cortex-m33", CortexM3Model, [ARMv8mMainline,
FeatureDSP,		FeatureDSP,
FeatureFPARMv8,		FeatureFPARMv8,
FeatureD16,		FeatureD16,
FeatureVFPOnlySP,		FeatureVFPOnlySP,
		FeaturePrefLoopAlign32,
FeatureHasNoBranchPredictor]>;		FeatureHasNoBranchPredictor]>;

def : ProcNoItin<"cortex-a32", [ARMv8a,		def : ProcNoItin<"cortex-a32", [ARMv8a,
FeatureHWDivThumb,		FeatureHWDivThumb,
FeatureHWDivARM,		FeatureHWDivARM,
FeatureCrypto,		FeatureCrypto,
FeatureCRC]>;		FeatureCRC]>;

▲ Show 20 Lines • Show All 182 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 569 Lines • ▼ Show 20 Lines	public:
CCAssignFn *CCAssignFnForReturn(CallingConv::ID CC, bool isVarArg) const;		CCAssignFn *CCAssignFnForReturn(CallingConv::ID CC, bool isVarArg) const;

/// Returns true if \p VecTy is a legal interleaved access type. This		/// Returns true if \p VecTy is a legal interleaved access type. This
/// function checks the vector element type and the overall width of the		/// function checks the vector element type and the overall width of the
/// vector.		/// vector.
bool isLegalInterleavedAccessType(VectorType *VecTy,		bool isLegalInterleavedAccessType(VectorType *VecTy,
const DataLayout &DL) const;		const DataLayout &DL) const;

		bool alignLoopsWithOptSize() const override;

/// Returns the number of interleaved accesses that will be generated when		/// Returns the number of interleaved accesses that will be generated when
/// lowering accesses of the given type.		/// lowering accesses of the given type.
unsigned getNumInterleavedAccesses(VectorType *VecTy,		unsigned getNumInterleavedAccesses(VectorType *VecTy,
const DataLayout &DL) const;		const DataLayout &DL) const;

void finalizeLowering(MachineFunction &MF) const override;		void finalizeLowering(MachineFunction &MF) const override;

/// Return the correct alignment for the current calling convention.		/// Return the correct alignment for the current calling convention.
▲ Show 20 Lines • Show All 232 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,193 Lines • ▼ Show 20 Lines	ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM,

// On ARM arguments smaller than 4 bytes are extended, so all arguments		// On ARM arguments smaller than 4 bytes are extended, so all arguments
// are at least 4 bytes aligned.		// are at least 4 bytes aligned.
setMinStackArgumentAlignment(4);		setMinStackArgumentAlignment(4);

// Prefer likely predicted branches to selects on out-of-order cores.		// Prefer likely predicted branches to selects on out-of-order cores.
PredictableSelectIsExpensive = Subtarget->getSchedModel().isOutOfOrder();		PredictableSelectIsExpensive = Subtarget->getSchedModel().isOutOfOrder();

		setPrefLoopAlignment(Subtarget->getPrefLoopAlignment());

setMinFunctionAlignment(Subtarget->isThumb() ? 1 : 2);		setMinFunctionAlignment(Subtarget->isThumb() ? 1 : 2);
		dmgreenUnsubmitted Not Done Reply Inline Actions Here with this gubbins may be a better place for setting the loop alignment. dmgreen: Here with this gubbins may be a better place for setting the loop alignment.
		t.p.northoverAuthorUnsubmitted Not Done Reply Inline Actions Yep. t.p.northover: Yep.
}		}

bool ARMTargetLowering::useSoftFloat() const {		bool ARMTargetLowering::useSoftFloat() const {
return Subtarget->useSoftFloat();		return Subtarget->useSoftFloat();
}		}

// FIXME: It might make sense to define the representative register class as the		// FIXME: It might make sense to define the representative register class as the
// nearest super-register that has a non-null superset. For example, DPR_VFP2 is		// nearest super-register that has a non-null superset. For example, DPR_VFP2 is
▲ Show 20 Lines • Show All 13,479 Lines • ▼ Show 20 Lines	Value ARMTargetLowering::emitStoreConditional(IRBuilder<> &Builder, Value Val,
Function *Strex = Intrinsic::getDeclaration(M, Int, Tys);		Function *Strex = Intrinsic::getDeclaration(M, Int, Tys);

return Builder.CreateCall(		return Builder.CreateCall(
Strex, {Builder.CreateZExtOrBitCast(		Strex, {Builder.CreateZExtOrBitCast(
Val, Strex->getFunctionType()->getParamType(0)),		Val, Strex->getFunctionType()->getParamType(0)),
Addr});		Addr});
}		}


		bool ARMTargetLowering::alignLoopsWithOptSize() const {
		return Subtarget->isMClass();
		}

/// A helper function for determining the number of interleaved accesses we		/// A helper function for determining the number of interleaved accesses we
/// will generate when lowering accesses of the given type.		/// will generate when lowering accesses of the given type.
unsigned		unsigned
ARMTargetLowering::getNumInterleavedAccesses(VectorType *VecTy,		ARMTargetLowering::getNumInterleavedAccesses(VectorType *VecTy,
const DataLayout &DL) const {		const DataLayout &DL) const {
return (DL.getTypeSizeInBits(VecTy) + 127) / 128;		return (DL.getTypeSizeInBits(VecTy) + 127) / 128;
}		}

▲ Show 20 Lines • Show All 436 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMSubtarget.h

Show First 20 Lines • Show All 432 Lines • ▼ Show 20 Lines	protected:
/// What kind of timing do load multiple/store multiple have (double issue,		/// What kind of timing do load multiple/store multiple have (double issue,
/// single issue etc).		/// single issue etc).
ARMLdStMultipleTiming LdStMultipleTiming = SingleIssue;		ARMLdStMultipleTiming LdStMultipleTiming = SingleIssue;

/// The adjustment that we need to apply to get the operand latency from the		/// The adjustment that we need to apply to get the operand latency from the
/// operand cycle returned by the itinerary data for pre-ISel operands.		/// operand cycle returned by the itinerary data for pre-ISel operands.
int PreISelOperandLatencyAdjustment = 2;		int PreISelOperandLatencyAdjustment = 2;

		/// What alignment is preferred for loop bodies, in log2(bytes).
		unsigned PrefLoopAlignment = 0;

/// IsLittle - The target is Little Endian		/// IsLittle - The target is Little Endian
bool IsLittle;		bool IsLittle;

/// TargetTriple - What processor and OS we're targeting.		/// TargetTriple - What processor and OS we're targeting.
Triple TargetTriple;		Triple TargetTriple;

/// SchedModel - Processor specific instruction costs.		/// SchedModel - Processor specific instruction costs.
MCSchedModel SchedModel;		MCSchedModel SchedModel;
▲ Show 20 Lines • Show All 350 Lines • ▼ Show 20 Lines	public:
}		}

/// Allow movt+movw for PIC global address calculation.		/// Allow movt+movw for PIC global address calculation.
/// ELF does not have GOT relocations for movt+movw.		/// ELF does not have GOT relocations for movt+movw.
/// ROPI does not use GOT.		/// ROPI does not use GOT.
bool allowPositionIndependentMovt() const {		bool allowPositionIndependentMovt() const {
return isROPI() \|\| !isTargetELF();		return isROPI() \|\| !isTargetELF();
}		}

		unsigned getPrefLoopAlignment() const {
		return PrefLoopAlignment;
		}
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_ARM_ARMSUBTARGET_H		#endif // LLVM_LIB_TARGET_ARM_ARMSUBTARGET_H

llvm/test/CodeGen/ARM/loop-align-cortex-m.ll

This file was added.

				; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m3 -o - \| FileCheck %s
				; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m4 -o - \| FileCheck %s
				; RUN: llc -mtriple=thumbv7m-none-eabi %s -mcpu=cortex-m33 -o - \| FileCheck %s

				define void @test_loop_alignment(i32* %in, i32* %out) optsize {
				; CHECK-LABEL: test_loop_alignment:
				; CHECK: movs {{r[0-9]+}}, #0
				; CHECK: .p2align 2

				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%in.addr = getelementptr inbounds i32, i32* %in, i32 %i
				%lhs = load i32, i32* %in.addr, align 4
				%res = mul nsw i32 %lhs, 5
				%out.addr = getelementptr inbounds i32, i32* %out, i32 %i
				store i32 %res, i32* %out.addr, align 4
				%i.next = add i32 %i, 1
				%done = icmp eq i32 %i.next, 1024
				br i1 %done, label %end, label %loop

				end:
				ret void
				}

				define void @test_loop_alignment_minsize(i32* %in, i32* %out) minsize {
				; CHECK-LABEL: test_loop_alignment_minsize:
				; CHECK: movs {{r[0-9]+}}, #0
				; CHECK-NOT: .p2align

				entry:
				br label %loop

				loop:
				%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
				%in.addr = getelementptr inbounds i32, i32* %in, i32 %i
				%lhs = load i32, i32* %in.addr, align 4
				%res = mul nsw i32 %lhs, 5
				%out.addr = getelementptr inbounds i32, i32* %out, i32 %i
				store i32 %res, i32* %out.addr, align 4
				%i.next = add i32 %i, 1
				%done = icmp eq i32 %i.next, 1024
				br i1 %done, label %end, label %loop

				end:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

ARM: align loops to 4 bytes on Cortex-M3 and Cortex-M4.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 165056

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/MachineBlockPlacement.cpp

llvm/lib/Target/ARM/ARM.td

llvm/lib/Target/ARM/ARMISelLowering.h

llvm/lib/Target/ARM/ARMISelLowering.cpp

llvm/lib/Target/ARM/ARMSubtarget.h

llvm/test/CodeGen/ARM/loop-align-cortex-m.ll

ARM: align loops to 4 bytes on Cortex-M3 and Cortex-M4.
ClosedPublic