This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
-
RISCVTargetTransformInfo.h
9/11
RISCVTargetTransformInfo.cpp
-
test/Transforms/LoopUnroll/RISCV/
-
Transforms/
-
LoopUnroll/
-
RISCV/
-
lit.local.cfg
8/8
unroll.ll

Differential D113798

Add loop unrolling and peeling preferences for RISCV
ClosedPublic

Authored by mcberg2021 on Nov 12 2021, 12:46 PM.

Download Raw Diff

Details

Reviewers

craig.topper
frasercrmck
evandro
jrtc27

Commits

rGf95ee6074aae: [RISCV] Add target specific loop unrolling and peeling preferences
rG8487981a7249: [RISCV] Add target specific loop unrolling and peeling preferences

Summary

Both these preference helper functions have initial support with this change. The loop unrolling preferences are set with initial settings to control thresholds, size and attributes of loops to unroll with some tuning done. The peeling preferences may need some tuning as well as the initial support looks much like what other architectures utilize. An unrolling test is added for RISCV as well to track how preferences modify/control loop unrolling.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mcberg2021 created this revision.Nov 12 2021, 12:46 PM

Herald added subscribers: VincentWu, luke957, achieveartificialintelligence and 25 others. · View Herald TranscriptNov 12 2021, 12:46 PM

mcberg2021 requested review of this revision.Nov 12 2021, 12:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 12 2021, 12:46 PM

Herald added subscribers: llvm-commits, MaskRay. · View Herald Transcript

craig.topper added inline comments.Nov 12 2021, 1:05 PM

llvm/test/CodeGen/RISCV/unroll.ll
9 ↗	(On Diff #386925)	Drop the FunctionAttrs comment and the dso_local and local_unnamed_addr
169 ↗	(On Diff #386925)	I doubt we need all these attributes and metadata

craig.topper added inline comments.Nov 12 2021, 1:06 PM

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
204	The MVE in this comment appears to have been copied from ARM.

This test screams of "I took C and shoved it into a test case" rather than actually taking the time to distill it down to a minimal example of IR free from irrelevant clutter

llvm/test/CodeGen/RISCV/unroll.ll
2 ↗	(On Diff #386925)	opt test in CodeGen?

jrtc27 added inline comments.Nov 12 2021, 1:08 PM

llvm/test/CodeGen/RISCV/unroll.ll
2 ↗	(On Diff #386925)	(this belong in llvm/test/Transforms/LoopUnroll/RISCV)

Updated as per feedback.

mcberg2021 marked an inline comment as done.Nov 12 2021, 1:45 PM

mcberg2021 marked 4 inline comments as done.

mcberg2021 added inline comments.

llvm/test/CodeGen/RISCV/unroll.ll
2 ↗	(On Diff #386925)	Test moved to LoopUnroll/RISCV

Missed an update...

Harbormaster completed remote builds in B134028: Diff 386946.Nov 12 2021, 3:30 PM

jrtc27 added inline comments.Nov 13 2021, 12:18 PM

llvm/test/Transforms/LoopUnroll/RISCV/unroll.ll
3	Use -mtriple=riscv64 unless it's genuinely OS-dependent (which this is not)
5	I doubt you need all these pointer attributes
147	These comments do nothing
147	This would also be more natural as the final block in the function; presumably the current block schedule is an artefact of the order in which various optimisation happened on the original C and IR
164	Is the TBAA actually needed?

Updated with simplifications and formatting.

Marked tasks as done.

jrtc27 added inline comments.Nov 14 2021, 12:25 PM

llvm/test/Transforms/LoopUnroll/RISCV/unroll.ll
147	This comment is still there
164	!0 is only used self-referentially; only !1 is referenced from outside of the metadata itself. So I don't think this distinct does anything.
165	Is this needed (same for the !llvm.loop)? If yes, just inline it like !llvm.loop, if no delete them.

Harbormaster completed remote builds in B134164: Diff 387122.Nov 14 2021, 12:51 PM

More cleanup

mcberg2021 marked 3 inline comments as done.Nov 14 2021, 3:44 PM

mcberg2021 added a reviewer: jrtc27.Nov 14 2021, 4:22 PM

Harbormaster completed remote builds in B134180: Diff 387141.Nov 14 2021, 4:31 PM

frasercrmck added inline comments.Nov 15 2021, 2:00 AM

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
204	Is this truly checking "vectorized loops" or just loops containing vector instructions? We've already checked that the loop isn't vectorized according to metadata. What about code written using RVV intrinsics, or with OpenCL/SYCL/etc? We might want to unroll those loops, right?
217	Does the explicit size of `4` help much or should we just use `SmallVector<const Value*>`?

khchen added a subscriber: khchen.Nov 15 2021, 4:34 AM

mcberg2021 added inline comments.Nov 15 2021, 9:10 AM

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
204	I think for now I am going to mark this with a TODO for more tuning, I updated the comment for vectorized instructions, it will be uploaded soon...
217	This setting mirrors SLP's generic vector operand setting, which we utilize, so it does seem appropriate.

craig.topper added inline comments.Nov 15 2021, 9:11 AM

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
204	This part of the patch came from a change I made internally. I believe this entire loop was just naively copied from the ARM target.
217	This was also copied from ARM.

Updated comments as needed.

mcberg2021 marked 4 inline comments as done.Nov 15 2021, 9:20 AM

craig.topper added inline comments.Nov 15 2021, 9:21 AM

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
217	This shouldn't be mirroring anything. The ARM code I copied from pre-dates SmallVector's ability to automatically pick a value for the second template parameter. I think it will automatically pick a value more than 4. So I think the second parameter can be removed.

Removed size constraint on initialization of Operands to be consumed in getUserCost.

mcberg2021 marked an inline comment as done.Nov 15 2021, 9:36 AM

Harbormaster completed remote builds in B134282: Diff 387298.Nov 15 2021, 10:09 AM

mcberg2021 updated this revision to Diff 387314.Nov 15 2021, 10:11 AM

Harbormaster completed remote builds in B134296: Diff 387314.Nov 15 2021, 10:55 AM

Are there any further concerns? If not can we progress towards approval?

LGTM

This revision is now accepted and ready to land.Nov 19 2021, 10:00 AM

Perhaps we should run this across a set of benchmarks we're interested in?

In D113798#3143299, @frasercrmck wrote:

Perhaps we should run this across a set of benchmarks we're interested in?

We've been using this internally on our SiFive 7 series and U8. Should we check the CPU?

In D113798#3143309, @craig.topper wrote:

In D113798#3143299, @frasercrmck wrote:

Perhaps we should run this across a set of benchmarks we're interested in?

We've been using this internally on our SiFive 7 series and U8. Should we check the CPU?

I know that @asb often runs benchmarks over prospective patches, so I thought he might have some thoughts about we generally know this sort of thing is ready to go.

Ping, @asb can you take a look at this?

Sorry, I missed the previous ping and was out last week:

@luismarques reports this is performance neutral for Embench and Coremark on Ibex.
Representative benchmarks for anything like this is clearly difficult. If anyone has run e.g. SPEC on real RISC-V hardware, that would be interesting. Cases where the change is roughly performance neutral but may waste I$ might not show up on simple benchmarks.
Just as another datapoint, I ran this against the GCC torture suite. One case that stuck out to me was pr85169.c. It seems pretty unlikely that huge number of unrolled stores of zero byte is profitable (though maybe my intuition is wrong!). Could you please take a quick look at this case to see if there is any obvious tuning that can be done for it?

Otherwise, this looks good to me, and I don't think pr85169.c needs to be a blocker.

Alex, we have been using these unrolling preferences in house since mid summer for RISC-V. I will have a look at the outlier case too.

I altered the unrolling preferences for the indicated case, we meet the unroll criteria anyways, as the loop is small at the time of evaluation. Also the abort was motioned to the exit block by the time of the evaluation and so is never encountered as a call in the loop. So I think we are ok for this case.

Closed by commit rG8487981a7249: [RISCV] Add target specific loop unrolling and peeling preferences (authored by mcberg2021). · Explain WhyDec 7 2021, 3:07 PM

This revision was automatically updated to reflect the committed changes.

mcberg2021 added a commit: rG8487981a7249: [RISCV] Add target specific loop unrolling and peeling preferences.

mcberg2021 added a reverting change: rG3e363f14e128: Revert "[RISCV] Add target specific loop unrolling and peeling preferences".Dec 7 2021, 3:14 PM

Please include why you're reverting a commit in the commit message, as per the developer policy https://llvm.org/docs/DeveloperPolicy.html#patch-reversion-policy

In D113798#3174400, @asb wrote:

@luismarques reports this is performance neutral for Embench and Coremark on Ibex.

I think that was with an old version of this patch. With the current patch (now reverted) the numbers are:

CoreMark O3: +11.4% perf, 2.88 times the size
CoreMark O2: +9.36% perf, 2.88 times the size
Embench O3: no perf change, 23.6% size increase
Embench O2: no perf change, 28.6% size increase

No changes for Os and Oz.
That's almost tripling the CoreMark size.

The initial checkin was reverted because a lit cfg was missing to exclude non target transform tests of RISCV loop unrolling, should be fixed shortly.

In D113798#3180380, @mcberg2021 wrote:

The initial checkin was reverted because a lit cfg was missing to exclude non target transform tests of RISCV loop unrolling, should be fixed shortly.

Let's discuss this patch tomorrow in the RISC-V sync-up call. Please don't merge it yet, even if that test issue is fixed.

luismarques reopened this revision.Dec 8 2021, 12:15 PM

This revision is now accepted and ready to land.Dec 8 2021, 12:15 PM

In D113798#3179913, @luismarques wrote:

In D113798#3174400, @asb wrote:

@luismarques reports this is performance neutral for Embench and Coremark on Ibex.

I think that was with an old version of this patch. With the current patch (now reverted) the numbers are:

CoreMark O3: +11.4% perf, 2.88 times the size
CoreMark O2: +9.36% perf, 2.88 times the size
Embench O3: no perf change, 23.6% size increase
Embench O2: no perf change, 28.6% size increase

No changes for Os and Oz.
That's almost tripling the CoreMark size.

CoreMark is a pretty tiny benchmark that fits in the L1 I$ of many processors, so that's probably not hugely surprising, though I suspect you could get a similar performance increase with a much more targeted segment of unrollings...

@luismarques and I were chatting about this patch some more. A few thoughts I'm writing down so we don't lose them (we should discuss on the call today too).

The key question is whether this unrolling should be enabled for all RISC-V targets or not. Looking at other backends:

AArch64: more aggressive unrolling options only enabled for in-order models
ARM: Most unrolling options only enabled for M-class cores

Is it your view that this transformation is worthwhile on all common RISC-V microarchitectures?

eopXD added a subscriber: eopXD.Dec 10 2021, 8:39 AM

A branch containing this patch was accidentally pushed to GitHub: https://github.com/llvm/llvm-project/tree/arcpatch-D113798

Can someone please remove it?

In D113798#3185992, @lbenes wrote:

A branch containing this patch was accidentally pushed to GitHub: https://github.com/llvm/llvm-project/tree/arcpatch-D113798

Can someone please remove it?

Gone.

mcberg2021 added a comment.Dec 15 2021, 2:11 PM

This comment was removed by mcberg2021.

The uploaded excel spreadsheet shows the difference between the default unroll preferences and the version presented in this review. The data is collected for Spec2k6 INT base.

unrolling_pref_rollup.xlsx69 KBDownload

Updated as per request, SiFive SubTargets are now guarded and default preferences used otherwise.

craig.topper added inline comments.Dec 17 2021, 5:47 PM

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
175	I think this should use `getTuneCPU` not `getCPU`

Updated to use getTuneCPU in place of getCPU

mcberg2021 marked an inline comment as done.Dec 17 2021, 6:17 PM

Harbormaster completed remote builds in B139941: Diff 395249.Dec 17 2021, 6:52 PM

Looking at recent issues, the test/Bindings/Go failure is intermittent on premerge testing for most changes.

This revision was landed with ongoing or failed builds.Dec 18 2021, 12:55 PM

Closed by commit rGf95ee6074aae: [RISCV] Add target specific loop unrolling and peeling preferences (authored by mcberg2021). · Explain Why

This revision was automatically updated to reflect the committed changes.

mcberg2021 added a commit: rGf95ee6074aae: [RISCV] Add target specific loop unrolling and peeling preferences.

zixuan-wu added a subscriber: zixuan-wu.Dec 19 2021, 10:08 PM

zixuan-wu added inline comments.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
179	How about getting UseDefaultPreferences flag from subtarget? And initialize the value in subtarget.

zixuan-wu added inline comments.Dec 22 2021, 1:55 AM

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
175	It's not a general principle to just enumerate tune cpus. Could we predict some feature or parameter from subtarget such as whether it's out-of-order. Or we need get UseDefaultPreferences value which has been initialized in subtarget directly.

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVTargetTransformInfo.h

7 lines

RISCVTargetTransformInfo.cpp

91 lines

test/

Transforms/

LoopUnroll/

RISCV/

lit.local.cfg

5 lines

unroll.ll

162 lines

Diff 395306

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {
case TargetTransformInfo::RGK_ScalableVector:		case TargetTransformInfo::RGK_ScalableVector:
return TypeSize::getScalable(		return TypeSize::getScalable(
ST->hasVInstructions() ? RISCV::RVVBitsPerBlock : 0);		ST->hasVInstructions() ? RISCV::RVVBitsPerBlock : 0);
}		}

llvm_unreachable("Unsupported register kind");		llvm_unreachable("Unsupported register kind");
}		}

		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
		TTI::UnrollingPreferences &UP,
		OptimizationRemarkEmitter *ORE);

		void getPeelingPreferences(Loop *L, ScalarEvolution &SE,
		TTI::PeelingPreferences &PP);

unsigned getMinVectorRegisterBitWidth() const {		unsigned getMinVectorRegisterBitWidth() const {
return ST->hasVInstructions() ? ST->getMinRVVVectorSizeInBits() : 0;		return ST->hasVInstructions() ? ST->getMinRVVVectorSizeInBits() : 0;
}		}

InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
Align Alignment,		Align Alignment,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
Alignment, CostKind, I);		Alignment, CostKind, I);

auto *VTy = cast<FixedVectorType>(DataTy);		auto *VTy = cast<FixedVectorType>(DataTy);
unsigned NumLoads = VTy->getNumElements();		unsigned NumLoads = VTy->getNumElements();
InstructionCost MemOpCost =		InstructionCost MemOpCost =
getMemoryOpCost(Opcode, VTy->getElementType(), Alignment, 0, CostKind, I);		getMemoryOpCost(Opcode, VTy->getElementType(), Alignment, 0, CostKind, I);
return NumLoads * MemOpCost;		return NumLoads * MemOpCost;
}		}

		void RISCVTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
		TTI::UnrollingPreferences &UP,
		OptimizationRemarkEmitter *ORE) {
		// TODO: More tuning on benchmarks and metrics with changes as needed
		// would apply to all settings below to enable performance.

		// Support explicit targets enabled for SiFive with the unrolling preferences
		// below
		bool UseDefaultPreferences = true;
		if (ST->getTuneCPU().contains("sifive-e76") \|\|
		craig.topperUnsubmitted Done Reply Inline Actions I think this should use `getTuneCPU` not `getCPU` craig.topper: I think this should use `getTuneCPU` not `getCPU`
		zixuan-wuUnsubmitted Not Done Reply Inline Actions It's not a general principle to just enumerate tune cpus. Could we predict some feature or parameter from subtarget such as whether it's out-of-order. Or we need get UseDefaultPreferences value which has been initialized in subtarget directly. zixuan-wu: It's not a general principle to just enumerate tune cpus. Could we predict some feature or…
		ST->getTuneCPU().contains("sifive-s76") \|\|
		ST->getTuneCPU().contains("sifive-u74") \|\|
		ST->getTuneCPU().contains("sifive-7"))
		UseDefaultPreferences = false;
		zixuan-wuUnsubmitted Not Done Reply Inline Actions How about getting UseDefaultPreferences flag from subtarget? And initialize the value in subtarget. zixuan-wu: How about getting UseDefaultPreferences flag from subtarget? And initialize the value in…

		if (UseDefaultPreferences)
		return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP, ORE);

		// Enable Upper bound unrolling universally, not dependant upon the conditions
		// below.
		UP.UpperBound = true;

		// Disable loop unrolling for Oz and Os.
		UP.OptSizeThreshold = 0;
		UP.PartialOptSizeThreshold = 0;
		if (L->getHeader()->getParent()->hasOptSize())
		return;

		SmallVector<BasicBlock *, 4> ExitingBlocks;
		L->getExitingBlocks(ExitingBlocks);
		LLVM_DEBUG(dbgs() << "Loop has:\n"
		<< "Blocks: " << L->getNumBlocks() << "\n"
		<< "Exit blocks: " << ExitingBlocks.size() << "\n");

		// Only allow another exit other than the latch. This acts as an early exit
		// as it mirrors the profitability calculation of the runtime unroller.
		if (ExitingBlocks.size() > 2)
		return;

		craig.topperUnsubmitted Done Reply Inline Actions The MVE in this comment appears to have been copied from ARM. craig.topper: The MVE in this comment appears to have been copied from ARM.
		frasercrmckUnsubmitted Done Reply Inline Actions Is this truly checking "vectorized loops" or just loops containing vector instructions? We've already checked that the loop isn't vectorized according to metadata. What about code written using RVV intrinsics, or with OpenCL/SYCL/etc? We might want to unroll those loops, right? frasercrmck: Is this truly checking "vectorized loops" or just loops containing vector instructions? We've…
		mcberg2021AuthorUnsubmitted Done Reply Inline Actions I think for now I am going to mark this with a TODO for more tuning, I updated the comment for vectorized instructions, it will be uploaded soon... mcberg2021: I think for now I am going to mark this with a TODO for more tuning, I updated the comment for…
		craig.topperUnsubmitted Done Reply Inline Actions This part of the patch came from a change I made internally. I believe this entire loop was just naively copied from the ARM target. craig.topper: This part of the patch came from a change I made internally. I believe this entire loop was…
		// Limit the CFG of the loop body for targets with a branch predictor.
		// Allowing 4 blocks permits if-then-else diamonds in the body.
		if (L->getNumBlocks() > 4)
		return;

		// Don't unroll vectorized loops, including the remainder loop
		if (getBooleanLoopAttribute(L, "llvm.loop.isvectorized"))
		return;

		// Scan the loop: don't unroll loops with calls as this could prevent
		// inlining.
		InstructionCost Cost = 0;
		for (auto *BB : L->getBlocks()) {
		frasercrmckUnsubmitted Done Reply Inline Actions Does the explicit size of `4` help much or should we just use `SmallVector<const Value>`? frasercrmck:* Does the explicit size of `4` help much or should we just use `SmallVector<const Value*>`?
		mcberg2021AuthorUnsubmitted Done Reply Inline Actions This setting mirrors SLP's generic vector operand setting, which we utilize, so it does seem appropriate. mcberg2021: This setting mirrors SLP's generic vector operand setting, which we utilize, so it does seem…
		craig.topperUnsubmitted Done Reply Inline Actions This was also copied from ARM. craig.topper: This was also copied from ARM.
		craig.topperUnsubmitted Done Reply Inline Actions This shouldn't be mirroring anything. The ARM code I copied from pre-dates SmallVector's ability to automatically pick a value for the second template parameter. I think it will automatically pick a value more than 4. So I think the second parameter can be removed. craig.topper: This shouldn't be mirroring anything. The ARM code I copied from pre-dates SmallVector's…
		for (auto &I : *BB) {
		// Initial setting - Don't unroll loops containing vectorized
		// instructions.
		if (I.getType()->isVectorTy())
		return;

		if (isa<CallInst>(I) \|\| isa<InvokeInst>(I)) {
		if (const Function *F = cast<CallBase>(I).getCalledFunction()) {
		if (!isLoweredToCall(F))
		continue;
		}
		return;
		}

		SmallVector<const Value *> Operands(I.operand_values());
		Cost +=
		getUserCost(&I, Operands, TargetTransformInfo::TCK_SizeAndLatency);
		}
		}

		LLVM_DEBUG(dbgs() << "Cost of loop: " << Cost << "\n");

		UP.Partial = true;
		UP.Runtime = true;
		UP.UnrollRemainder = true;
		UP.UnrollAndJam = true;
		UP.UnrollAndJamInnerLoopThreshold = 60;

		// Force unrolling small loops can be very useful because of the branch
		// taken cost of the backedge.
		if (Cost < 12)
		UP.Force = true;
		}

		void RISCVTTIImpl::getPeelingPreferences(Loop *L, ScalarEvolution &SE,
		TTI::PeelingPreferences &PP) {
		BaseT::getPeelingPreferences(L, SE, PP);
		}

llvm/test/Transforms/LoopUnroll/RISCV/lit.local.cfg

This file was added.

				config.suffixes = ['.ll']

				targets = set(config.root.targets_to_build.split())
				if not 'RISCV' in targets:
				config.unsupported = True

llvm/test/Transforms/LoopUnroll/RISCV/unroll.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt %s -S -mtriple=riscv64 -loop-unroll -mcpu=sifive-7-rv64 \| FileCheck %s

				jrtc27Unsubmitted Done Reply Inline Actions Use -mtriple=riscv64 unless it's genuinely OS-dependent (which this is not) jrtc27: Use -mtriple=riscv64 unless it's genuinely OS-dependent (which this is not)
				define dso_local void @saxpy(float %a, float* %x, float* %y) {
				; CHECK-LABEL: @saxpy(
				jrtc27Unsubmitted Done Reply Inline Actions I doubt you need all these pointer attributes jrtc27: I doubt you need all these pointer attributes
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT_15:%.*]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[MUL:%.]] = fmul fast float [[TMP0]], [[A:%.]]
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[Y:%.*]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[MUL]], [[TMP1]]
				; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT]]
				; CHECK-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX_1]], align 4
				; CHECK-NEXT: [[MUL_1:%.*]] = fmul fast float [[TMP2]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT]]
				; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX2_1]], align 4
				; CHECK-NEXT: [[ADD_1:%.*]] = fadd fast float [[MUL_1]], [[TMP3]]
				; CHECK-NEXT: store float [[ADD_1]], float* [[ARRAYIDX2_1]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_1:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT]], 1
				; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_1]]
				; CHECK-NEXT: [[TMP4:%.]] = load float, float [[ARRAYIDX_2]], align 4
				; CHECK-NEXT: [[MUL_2:%.*]] = fmul fast float [[TMP4]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_1]]
				; CHECK-NEXT: [[TMP5:%.]] = load float, float [[ARRAYIDX2_2]], align 4
				; CHECK-NEXT: [[ADD_2:%.*]] = fadd fast float [[MUL_2]], [[TMP5]]
				; CHECK-NEXT: store float [[ADD_2]], float* [[ARRAYIDX2_2]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_2:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_1]], 1
				; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_2]]
				; CHECK-NEXT: [[TMP6:%.]] = load float, float [[ARRAYIDX_3]], align 4
				; CHECK-NEXT: [[MUL_3:%.*]] = fmul fast float [[TMP6]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_2]]
				; CHECK-NEXT: [[TMP7:%.]] = load float, float [[ARRAYIDX2_3]], align 4
				; CHECK-NEXT: [[ADD_3:%.*]] = fadd fast float [[MUL_3]], [[TMP7]]
				; CHECK-NEXT: store float [[ADD_3]], float* [[ARRAYIDX2_3]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_3:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_2]], 1
				; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_3]]
				; CHECK-NEXT: [[TMP8:%.]] = load float, float [[ARRAYIDX_4]], align 4
				; CHECK-NEXT: [[MUL_4:%.*]] = fmul fast float [[TMP8]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_4:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_3]]
				; CHECK-NEXT: [[TMP9:%.]] = load float, float [[ARRAYIDX2_4]], align 4
				; CHECK-NEXT: [[ADD_4:%.*]] = fadd fast float [[MUL_4]], [[TMP9]]
				; CHECK-NEXT: store float [[ADD_4]], float* [[ARRAYIDX2_4]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_4:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_3]], 1
				; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_4]]
				; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX_5]], align 4
				; CHECK-NEXT: [[MUL_5:%.*]] = fmul fast float [[TMP10]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_5:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_4]]
				; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2_5]], align 4
				; CHECK-NEXT: [[ADD_5:%.*]] = fadd fast float [[MUL_5]], [[TMP11]]
				; CHECK-NEXT: store float [[ADD_5]], float* [[ARRAYIDX2_5]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_5:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_4]], 1
				; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_5]]
				; CHECK-NEXT: [[TMP12:%.]] = load float, float [[ARRAYIDX_6]], align 4
				; CHECK-NEXT: [[MUL_6:%.*]] = fmul fast float [[TMP12]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_6:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_5]]
				; CHECK-NEXT: [[TMP13:%.]] = load float, float [[ARRAYIDX2_6]], align 4
				; CHECK-NEXT: [[ADD_6:%.*]] = fadd fast float [[MUL_6]], [[TMP13]]
				; CHECK-NEXT: store float [[ADD_6]], float* [[ARRAYIDX2_6]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_6:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_5]], 1
				; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_6]]
				; CHECK-NEXT: [[TMP14:%.]] = load float, float [[ARRAYIDX_7]], align 4
				; CHECK-NEXT: [[MUL_7:%.*]] = fmul fast float [[TMP14]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_7:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_6]]
				; CHECK-NEXT: [[TMP15:%.]] = load float, float [[ARRAYIDX2_7]], align 4
				; CHECK-NEXT: [[ADD_7:%.*]] = fadd fast float [[MUL_7]], [[TMP15]]
				; CHECK-NEXT: store float [[ADD_7]], float* [[ARRAYIDX2_7]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_7:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_6]], 1
				; CHECK-NEXT: [[ARRAYIDX_8:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_7]]
				; CHECK-NEXT: [[TMP16:%.]] = load float, float [[ARRAYIDX_8]], align 4
				; CHECK-NEXT: [[MUL_8:%.*]] = fmul fast float [[TMP16]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_8:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_7]]
				; CHECK-NEXT: [[TMP17:%.]] = load float, float [[ARRAYIDX2_8]], align 4
				; CHECK-NEXT: [[ADD_8:%.*]] = fadd fast float [[MUL_8]], [[TMP17]]
				; CHECK-NEXT: store float [[ADD_8]], float* [[ARRAYIDX2_8]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_8:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_7]], 1
				; CHECK-NEXT: [[ARRAYIDX_9:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_8]]
				; CHECK-NEXT: [[TMP18:%.]] = load float, float [[ARRAYIDX_9]], align 4
				; CHECK-NEXT: [[MUL_9:%.*]] = fmul fast float [[TMP18]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_9:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_8]]
				; CHECK-NEXT: [[TMP19:%.]] = load float, float [[ARRAYIDX2_9]], align 4
				; CHECK-NEXT: [[ADD_9:%.*]] = fadd fast float [[MUL_9]], [[TMP19]]
				; CHECK-NEXT: store float [[ADD_9]], float* [[ARRAYIDX2_9]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_9:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_8]], 1
				; CHECK-NEXT: [[ARRAYIDX_10:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_9]]
				; CHECK-NEXT: [[TMP20:%.]] = load float, float [[ARRAYIDX_10]], align 4
				; CHECK-NEXT: [[MUL_10:%.*]] = fmul fast float [[TMP20]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_10:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_9]]
				; CHECK-NEXT: [[TMP21:%.]] = load float, float [[ARRAYIDX2_10]], align 4
				; CHECK-NEXT: [[ADD_10:%.*]] = fadd fast float [[MUL_10]], [[TMP21]]
				; CHECK-NEXT: store float [[ADD_10]], float* [[ARRAYIDX2_10]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_10:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_9]], 1
				; CHECK-NEXT: [[ARRAYIDX_11:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_10]]
				; CHECK-NEXT: [[TMP22:%.]] = load float, float [[ARRAYIDX_11]], align 4
				; CHECK-NEXT: [[MUL_11:%.*]] = fmul fast float [[TMP22]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_11:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_10]]
				; CHECK-NEXT: [[TMP23:%.]] = load float, float [[ARRAYIDX2_11]], align 4
				; CHECK-NEXT: [[ADD_11:%.*]] = fadd fast float [[MUL_11]], [[TMP23]]
				; CHECK-NEXT: store float [[ADD_11]], float* [[ARRAYIDX2_11]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_11:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_10]], 1
				; CHECK-NEXT: [[ARRAYIDX_12:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_11]]
				; CHECK-NEXT: [[TMP24:%.]] = load float, float [[ARRAYIDX_12]], align 4
				; CHECK-NEXT: [[MUL_12:%.*]] = fmul fast float [[TMP24]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_12:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_11]]
				; CHECK-NEXT: [[TMP25:%.]] = load float, float [[ARRAYIDX2_12]], align 4
				; CHECK-NEXT: [[ADD_12:%.*]] = fadd fast float [[MUL_12]], [[TMP25]]
				; CHECK-NEXT: store float [[ADD_12]], float* [[ARRAYIDX2_12]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_12:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_11]], 1
				; CHECK-NEXT: [[ARRAYIDX_13:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_12]]
				; CHECK-NEXT: [[TMP26:%.]] = load float, float [[ARRAYIDX_13]], align 4
				; CHECK-NEXT: [[MUL_13:%.*]] = fmul fast float [[TMP26]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_13:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_12]]
				; CHECK-NEXT: [[TMP27:%.]] = load float, float [[ARRAYIDX2_13]], align 4
				; CHECK-NEXT: [[ADD_13:%.*]] = fadd fast float [[MUL_13]], [[TMP27]]
				; CHECK-NEXT: store float [[ADD_13]], float* [[ARRAYIDX2_13]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_13:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_12]], 1
				; CHECK-NEXT: [[ARRAYIDX_14:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_13]]
				; CHECK-NEXT: [[TMP28:%.]] = load float, float [[ARRAYIDX_14]], align 4
				; CHECK-NEXT: [[MUL_14:%.*]] = fmul fast float [[TMP28]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_14:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_13]]
				; CHECK-NEXT: [[TMP29:%.]] = load float, float [[ARRAYIDX2_14]], align 4
				; CHECK-NEXT: [[ADD_14:%.*]] = fadd fast float [[MUL_14]], [[TMP29]]
				; CHECK-NEXT: store float [[ADD_14]], float* [[ARRAYIDX2_14]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_14:%.*]] = add nuw nsw i64 [[INDVARS_IV_NEXT_13]], 1
				; CHECK-NEXT: [[ARRAYIDX_15:%.]] = getelementptr inbounds float, float [[X]], i64 [[INDVARS_IV_NEXT_14]]
				; CHECK-NEXT: [[TMP30:%.]] = load float, float [[ARRAYIDX_15]], align 4
				; CHECK-NEXT: [[MUL_15:%.*]] = fmul fast float [[TMP30]], [[A]]
				; CHECK-NEXT: [[ARRAYIDX2_15:%.]] = getelementptr inbounds float, float [[Y]], i64 [[INDVARS_IV_NEXT_14]]
				; CHECK-NEXT: [[TMP31:%.]] = load float, float [[ARRAYIDX2_15]], align 4
				; CHECK-NEXT: [[ADD_15:%.*]] = fadd fast float [[MUL_15]], [[TMP31]]
				; CHECK-NEXT: store float [[ADD_15]], float* [[ARRAYIDX2_15]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT_15]] = add nuw nsw i64 [[INDVARS_IV_NEXT_14]], 1
				; CHECK-NEXT: [[EXITCOND_NOT_15:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT_15]], 64
				; CHECK-NEXT: br i1 [[EXITCOND_NOT_15]], label [[EXIT_LOOP:%.*]], label [[FOR_BODY]]
				; CHECK: exit_loop:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				jrtc27Unsubmitted Done Reply Inline Actions These comments do nothing jrtc27: These comments do nothing
				jrtc27Unsubmitted Done Reply Inline Actions This would also be more natural as the final block in the function; presumably the current block schedule is an artefact of the order in which various optimisation happened on the original C and IR jrtc27: This would also be more natural as the final block in the function; presumably the current…
				jrtc27Unsubmitted Done Reply Inline Actions This comment is still there jrtc27: This comment is still there
				%arrayidx = getelementptr inbounds float, float* %x, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%mul = fmul fast float %0, %a
				%arrayidx2 = getelementptr inbounds float, float* %y, i64 %indvars.iv
				%1 = load float, float* %arrayidx2, align 4
				%add = fadd fast float %mul, %1
				store float %add, float* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, 64
				br i1 %exitcond.not, label %exit_loop, label %for.body

				exit_loop:
				ret void
				}

				jrtc27Unsubmitted Done Reply Inline Actions Is the TBAA actually needed? jrtc27: Is the TBAA actually needed?
				jrtc27Unsubmitted Done Reply Inline Actions !0 is only used self-referentially; only !1 is referenced from outside of the metadata itself. So I don't think this distinct does anything. jrtc27: !0 is only used self-referentially; only !1 is referenced from outside of the metadata itself.
				jrtc27Unsubmitted Done Reply Inline Actions Is this needed (same for the !llvm.loop)? If yes, just inline it like !llvm.loop, if no delete them. jrtc27: Is this needed (same for the !llvm.loop)? If yes, just inline it like !llvm.loop, if no delete…