This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.h
2
AArch64ISelLowering.cpp
1
AArch64SchedThunderX2T99.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
5
loop-micro-op-buffer-size-t99.ll

Differential D40177

performance improvements for ThunderX2 T99
AbandonedPublic

Authored by steleman on Nov 17 2017, 7:14 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
MatzeB

Summary

This changeset causes performance improvements for the Cavium ThunderX2T99 micro-arch. The changeset is specific to T99. It does not affect any other micro-arch.

Tested with SPECcpu2017 and libquantum.

As an example, for the performance gains on libquantum, please see here:

https://docs.google.com/spreadsheets/d/1Lo1o2E1NjrpkwS7DvYYWsiVvPdd93h7KBaqeptMrZPY/edit?usp=sharing

Diff Detail

Repository: rL LLVM

Event Timeline

steleman created this revision.Nov 17 2017, 7:14 AM

Herald added a subscriber: javed.absar. · View Herald TranscriptNov 17 2017, 7:14 AM

Could you please add a few tests? Also please remember to add llvm-commits as subscriber to your patches.

In D40177#928936, @mgrang wrote:

Could you please add a few tests? Also please remember to add llvm-commits as subscriber to your patches.

I'm not sure how I can add tests for this change. There is, in fact, no functional change.

I'm under the impression this is not NFC.
Presumably, the changes in the scheduler reflect in a different schedule being produced.
You can then check the IR (or MIR) with the new schedule.
If you feel motivated, you can check the test before (standalone), so that the commit will highlight the differences pre/post & will make the changeset more readable.
If no change in the schedule is produced, I'm confused.

Added test case.

davide added inline comments.Nov 21 2017, 8:48 PM

test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
6	I don't understand why this change is related to loop-unrolling, can you please elaborate?

steleman added inline comments.Nov 21 2017, 9:02 PM

test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
6	I don't understand why this change is related to loop-unrolling, can you please elaborate? I am not sure I understand your question. Are you referring to the entire test case, or just to the two lines above your comment? If you are asking about the entire test case, then the answer is: the change in the T99 scheduler is about loop unrolling, and only about loop unrolling. It's not about anything else.

davide added inline comments.Nov 21 2017, 9:13 PM

test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
6	Can you please check in the test standalone and update the change to show the difference?

steleman added inline comments.Nov 21 2017, 9:44 PM

test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
6	I can't check in because I don't have check-in privileges. :-) But I uploaded the relevant files to my Google Drive: https://drive.google.com/open?id=18GseEsH4XN6bjD3UAcSB06GkXhWzPSAf (Please let me know if you have problems accessing that directory; I don't think you should; I set it to world-readable). There are 3 files in there: t99-loop-unrolling-without-scheduler-patch.txt That's the output from opt from LLVM Git 2017/11/16 without my scheduler patch. t99-loop-unrolling-with-scheduler-patch.txt That's the output from opt from LVM Git 2017/11/16 with my scheduler patch. loop.c That's a test/experiment that I wrote/used to test the changes in CodeGen triggered by the change in the scheduler. It's an interesting test. If you have the time, and the inclination, you can use loop.c to track what happens to the loop at different optimization levels (-O1, -O2, -O3).

Also, yes, it wasn't entirely clear to me this was related to unrolling.
This is because the commit message is really vague and doesn't really explain the change.
I think you might consider rephrasing it to be more explicative, thanks!

This change makes sense to me (after the comments are addressed), but please wait for another opinion (@MatzeB / @qcolombet are probably good people to take another look/sign off)

test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll
2	I don't think you need the debug output.

kristof.beyls added a subscriber: kristof.beyls.Nov 22 2017, 12:30 AM

kristof.beyls added inline comments.

lib/Target/AArch64/AArch64ISelLowering.cpp
10981–10984	I get the impression that the enabling of AggressiveFMAFusion is orthogonal to the change in the pipeline model. If so, I think it's better to commit these separately. Also, I think this needs a regression test to show that this does what you want.

fhahn added a subscriber: fhahn.Nov 22 2017, 1:39 AM

huntergr added a subscriber: huntergr.Nov 22 2017, 2:11 AM

fhahn added inline comments.Nov 22 2017, 8:57 AM

lib/Target/AArch64/AArch64SchedThunderX2T99.td
394	`Unsupported = 0` should be the default, so you should be able to just drop this line, see https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Target/TargetSchedule.td#L257 Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB. So it should be possible to create a test making sure the scheduler uses that info as expected. Although I am not sure if it's worth it and how much work that would be :) Finally, this seems unrelated to the LoopMicroOpBufferSize change and could be split out in a separate patch.

Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB.

It doesn't. Maybe it was supposed to, in theory, but in reality it makes no difference whatsoever.

The only noticeable difference is in instruction cost: when Unsupported == 1, the estimated instruction cost is higher than when Unsupported == 0. Which makes no difference whatsoever in practice, given that the real cost of LSE instructions is pretty high anyway.

We've been emitting LSE instructions with no problems for many months. Whether or not Unsupported was 0 or 1 made no difference whatsoever.

In D40177#932950, @steleman wrote:

Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB.

It doesn't. Maybe it was supposed to, in theory, but in reality it makes no difference whatsoever.

The only noticeable difference is in instruction cost: when Unsupported == 1, the estimated instruction cost is higher than when Unsupported == 0. Which makes no difference whatsoever in practice, given that the real cost of LSE instructions is pretty high anyway.

We've been emitting LSE instructions with no problems for many months. Whether or not Unsupported was 0 or 1 made no difference whatsoever.

It is probably unlikely that atomic instructions are in hot loops of most benchmarks, so I am not surprised that you do not see any impact on runtimes (assuming that's what you meant with 'it makes no difference whatsoever). However there should be cases where it should make a difference when scheduling, but it probably requires some time to come up with a test case.

I think going forward, it would be best to split up this patch in 3 independent parts:

enableAggressiveFMAFusion
WriteAtomic change
LoopMicroOpBufferSize

This would make it slightly easier to review/accept, keep track of comments independent things, and to revert in case something goes wrong.

MatzeB requested changes to this revision.Nov 27 2017, 12:49 PM

MatzeB added inline comments.

lib/Target/AArch64/AArch64ISelLowering.cpp
10981–10984	Please avoid `getProcFamily()` checks outside of AArch64SubtargetInfo. Use target features and transfer the logic into SubtargetInfo!

This revision now requires changes to proceed.Nov 27 2017, 12:49 PM

Please avoid getProcFamily() checks outside of AArch64SubtargetInfo. Use target features and transfer the logic into SubtargetInfo!

Can you please explain the rationale behind this restriction?

It's certainly not supported by the class interface design, as the Subtarget pointer is accessible and Subtarget->getProcFamily() is accessible as well.

It also does not appear to be a real restriction at all, considering that Subtarget->getProcFamily() is already being used in AArch64ISelLowering.cpp.

What you are asking here is the implementation of a new target feature. This seems (a) unnecessary and (b) introduces unnecessary code complexity.

For starters, aggressive FMA is a compiler-dependent subjective decision. It's not a target-dependent objective attribute, such as LSE or Neon. What's aggressive FMA to LLVM is not aggressive at all to GCC for example.

The design implication is that, for every single existing public interface we are going to implement an additional, functionally duplicative, target feature. Which really isn't a target feature to begin with.

The implementation implication is that, a one-line code change becomes a 10-line code change that accomplishes the same exact thing as the one-line change.

Thank you in advance for your explanation.

In D40177#936935, @steleman wrote:

Please avoid getProcFamily() checks outside of AArch64SubtargetInfo. Use target features and transfer the logic into SubtargetInfo!

Can you please explain the rationale behind this restriction?

It's not a strict rule but unless there is a good reason not to do it we should :)

It's certainly not supported by the class interface design, as the Subtarget pointer is accessible and Subtarget->getProcFamily() is accessible as well.

It also does not appear to be a real restriction at all, considering that Subtarget->getProcFamily() is already being used in AArch64ISelLowering.cpp.

What you are asking here is the implementation of a new target feature. This seems (a) unnecessary and (b) introduces unnecessary code complexity.

For starters, aggressive FMA is a compiler-dependent subjective decision. It's not a target-dependent objective attribute, such as LSE or Neon. What's aggressive FMA to LLVM is not aggressive at all to GCC for example.

The design implication is that, for every single existing public interface we are going to implement an additional, functionally duplicative, target feature. Which really isn't a target feature to being with.

The implementation implication is that, a one-line code change becomes a 10-line code change that accomplishes the same exact thing as the one-line change.

Thank you in advance for your explanation.

I actively worked on changing those CPU specific checks/tweaks across the target code into target features, I'd like for this to stay this way.
The name "target feature" is a certainly unfortunate as those aren't really features but rather tuning objectives. Nevertheless they are target specific so the target feature system works nice here.
Some reasons as to why making feature bits out of them is a good thing:
- Things often start out as an inconspicuous if (CPU) doSomething(). But in time more CPUs are added resulting in longish if (CPU || CPUversion2 || CPUversion3 || otherVendorsCPU) doSomething() scattered accross the code.
- We suddenly start mixing code that deals with something like "more aggressive FMA folding" with small lists of CPUs that should use this. This is annoying when adding new CPUs: It is hard to spot all places where checks could/should be added. It feels a lot cleaner to do this in AArch64.td as a list of feature/tuning bits.
- As a bonus, target features can be enabled/disabled with -Xclang -target-feature=-xxx,+yyy which is sometimes nice for compiler developers to experiment with.

In D40177#935813, @fhahn wrote:

In D40177#932950, @steleman wrote:

Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB.

It doesn't. Maybe it was supposed to, in theory, but in reality it makes no difference whatsoever.

The only noticeable difference is in instruction cost: when Unsupported == 1, the estimated instruction cost is higher than when Unsupported == 0. Which makes no difference whatsoever in practice, given that the real cost of LSE instructions is pretty high anyway.

We've been emitting LSE instructions with no problems for many months. Whether or not Unsupported was 0 or 1 made no difference whatsoever.

It is probably unlikely that atomic instructions are in hot loops of most benchmarks, so I am not surprised that you do not see any impact on runtimes (assuming that's what you meant with 'it makes no difference whatsoever). However there should be cases where it should make a difference when scheduling, but it probably requires some time to come up with a test case.

That's not what I meant. What I meant has nothing to do with hot loops in SPECcpu benchmarks.

You can convince yourself of what I am saying by compiling the following test program:

https://drive.google.com/file/d/1gZ_i738yJ1TlJcQQ_oo-vTghydPUU1ZR/view?usp=sharing

(lseadd.c)

You can change around the #define for ATOMIC_MEMORY_ORDERING in the test program.

clang { -O0|-O1|-O2|-O3 } -std=c99 -mcpu=thunderx2t99 -S lseadd.c -o lseadd-t99.S
clang { -O0|-O1|-O2|-O3 } -std=c99 -mcpu=cortex-a57 -S lseadd.c -o lseadd-a57.S

T99 supports LSE Atomics. Cortex-A57 does not. For Cortex-A57, LLVM will emit LL/SC for the atomics. For T99, LLVM will always emit LSE Atomics for this test program, regardless of whether Unsupported == 0 or Unsupported == 1 in the AArch64SchedThunderX2T99.td scheduler file.

When compiling for T99, no warning is being emitted about Unsupported == 1.

Also, when compiling LLVM, no warning is emitted by llvm-tblgen about Unsupported == 0 while having LSE Atomics available.

Superseded by:

D40694 - Remove Unsupported flag from T99 Scheduler
D40695 - Improve loop unrolling performance on T99
D40696 - Enable aggressive FMA on T99 and provide AArch64 option for other micro-arch's

MatzeB mentioned this in D40696: Enable aggressive FMA on T99 and provide AArch64 option for other micro-arch's.Nov 30 2017, 5:26 PM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64ISelLowering.h

3 lines

AArch64ISelLowering.cpp

5 lines

AArch64SchedThunderX2T99.td

5 lines

test/

CodeGen/

AArch64/

loop-micro-op-buffer-size-t99.ll

124 lines

Diff 123874

lib/Target/AArch64/AArch64ISelLowering.h

Context not available.
	return VT == MVT::f32 \|\| VT == MVT::f64;	return VT == MVT::f32 \|\| VT == MVT::f64;
	}	}

	bool supportSplitCSR(MachineFunction *MF) const override {	bool supportSplitCSR(MachineFunction *MF) const override {
	return MF->getFunction()->getCallingConv() == CallingConv::CXX_FAST_TLS &&	return MF->getFunction()->getCallingConv() == CallingConv::CXX_FAST_TLS &&
	MF->getFunction()->hasFnAttribute(Attribute::NoUnwind);	MF->getFunction()->hasFnAttribute(Attribute::NoUnwind);
	}	}
	void initializeSplitCSR(MachineBasicBlock *Entry) const override;	void initializeSplitCSR(MachineBasicBlock *Entry) const override;
	void insertCopiesSplitCSR(	void insertCopiesSplitCSR(
	MachineBasicBlock *Entry,	MachineBasicBlock *Entry,
	const SmallVectorImpl<MachineBasicBlock *> &Exits) const override;	const SmallVectorImpl<MachineBasicBlock *> &Exits) const override;

	bool supportSwiftError() const override {	bool supportSwiftError() const override {
	return true;	return true;
	}	}

		/// Enable aggressive FMA fusion on targets that want it.
		bool enableAggressiveFMAFusion(EVT VT) const override;

	/// Returns the size of the platform's va_list object.	/// Returns the size of the platform's va_list object.
	unsigned getVaListSizeInBits(const DataLayout &DL) const override;	unsigned getVaListSizeInBits(const DataLayout &DL) const override;

	/// Returns true if \p VecTy is a legal interleaved access type. This	/// Returns true if \p VecTy is a legal interleaved access type. This
	/// function checks the vector element type and the overall width of the	/// function checks the vector element type and the overall width of the
	/// vector.	/// vector.
	bool isLegalInterleavedAccessType(VectorType *VecTy,	bool isLegalInterleavedAccessType(VectorType *VecTy,
	const DataLayout &DL) const;	const DataLayout &DL) const;

	/// Returns the number of interleaved accesses that will be generated when	/// Returns the number of interleaved accesses that will be generated when
	/// lowering accesses of the given type.	/// lowering accesses of the given type.
	unsigned getNumInterleavedAccesses(VectorType *VecTy,	unsigned getNumInterleavedAccesses(VectorType *VecTy,
	const DataLayout &DL) const;	const DataLayout &DL) const;

	MachineMemOperand::Flags getMMOFlags(const Instruction &I) const override;	MachineMemOperand::Flags getMMOFlags(const Instruction &I) const override;

Context not available.

lib/Target/AArch64/AArch64ISelLowering.cpp

Context not available.
	}	}
	}	}

	bool AArch64TargetLowering::isIntDivCheap(EVT VT, AttributeList Attr) const {	bool AArch64TargetLowering::isIntDivCheap(EVT VT, AttributeList Attr) const {
	// Integer division on AArch64 is expensive. However, when aggressively	// Integer division on AArch64 is expensive. However, when aggressively
	// optimizing for code size, we prefer to use a div instruction, as it is	// optimizing for code size, we prefer to use a div instruction, as it is
	// usually smaller than the alternative sequence.	// usually smaller than the alternative sequence.
	// The exception to this is vector division. Since AArch64 doesn't have vector	// The exception to this is vector division. Since AArch64 doesn't have vector
	// integer division, leaving the division as-is is a loss even in terms of	// integer division, leaving the division as-is is a loss even in terms of
	// size, because it will have to be scalarized, while the alternative code	// size, because it will have to be scalarized, while the alternative code
	// sequence can be performed in vector form.	// sequence can be performed in vector form.
	bool OptSize =	bool OptSize =
	Attr.hasAttribute(AttributeList::FunctionIndex, Attribute::MinSize);	Attr.hasAttribute(AttributeList::FunctionIndex, Attribute::MinSize);
	return OptSize && !VT.isVector();	return OptSize && !VT.isVector();
	}	}

		bool AArch64TargetLowering::enableAggressiveFMAFusion(EVT VT) const {
		return Subtarget->getProcFamily() == AArch64Subtarget::ThunderX2T99 &&
		VT.isFloatingPoint();
		}
		kristof.beylsUnsubmitted Not Done Reply Inline Actions I get the impression that the enabling of AggressiveFMAFusion is orthogonal to the change in the pipeline model. If so, I think it's better to commit these separately. Also, I think this needs a regression test to show that this does what you want. kristof.beyls: I get the impression that the enabling of AggressiveFMAFusion is orthogonal to the change in…
		MatzeBUnsubmitted Not Done Reply Inline Actions Please avoid `getProcFamily()` checks outside of AArch64SubtargetInfo. Use target features and transfer the logic into SubtargetInfo! MatzeB: Please avoid `getProcFamily()` checks outside of AArch64SubtargetInfo. Use target features and…

	unsigned	unsigned
	AArch64TargetLowering::getVaListSizeInBits(const DataLayout &DL) const {	AArch64TargetLowering::getVaListSizeInBits(const DataLayout &DL) const {
	if (Subtarget->isTargetDarwin() \|\| Subtarget->isTargetWindows())	if (Subtarget->isTargetDarwin() \|\| Subtarget->isTargetWindows())
	return getPointerTy(DL).getSizeInBits();	return getPointerTy(DL).getSizeInBits();

	return 3 * getPointerTy(DL).getSizeInBits() + 2 * 32;	return 3 * getPointerTy(DL).getSizeInBits() + 2 * 32;
	}	}
Context not available.

lib/Target/AArch64/AArch64SchedThunderX2T99.td

Context not available.
	//	//
	// This file defines the scheduling model for Cavium ThunderX2T99	// This file defines the scheduling model for Cavium ThunderX2T99
	// processors.	// processors.
	// Based on Broadcom Vulcan.	// Based on Broadcom Vulcan.
	//	//
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//

	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//
	// 2. Pipeline Description.	// 2. Pipeline Description.

	def ThunderX2T99Model : SchedMachineModel {	def ThunderX2T99Model : SchedMachineModel {
	let IssueWidth = 4; // 4 micro-ops dispatched at a time.	let IssueWidth = 4; // 4 micro-ops dispatched at a time.
	let MicroOpBufferSize = 180; // 180 entries in micro-op re-order buffer.	let MicroOpBufferSize = 180; // 180 entries in micro-op re-order buffer.
	let LoadLatency = 4; // Optimistic load latency.	let LoadLatency = 4; // Optimistic load latency.
	let MispredictPenalty = 12; // Extra cycles for mispredicted branch.	let MispredictPenalty = 12; // Extra cycles for mispredicted branch.
	// Determined via a mix of micro-arch details and experimentation.	// Determined via a mix of micro-arch details and experimentation.
	let LoopMicroOpBufferSize = 32;	let LoopMicroOpBufferSize = 128;
	let PostRAScheduler = 1; // Using PostRA sched.	let PostRAScheduler = 1; // Using PostRA sched.
	let CompleteModel = 1;	let CompleteModel = 1;

	list<Predicate> UnsupportedFeatures = [HasSVE];	list<Predicate> UnsupportedFeatures = [HasSVE];
	}	}

	// Define the issue ports.	// Define the issue ports.

	Show All 16 Lines
	let NumMicroOps = 2;	let NumMicroOps = 2;
	}	}

	def : WriteRes<WriteSys, []> { let Latency = 1; }	def : WriteRes<WriteSys, []> { let Latency = 1; }
	def : WriteRes<WriteBarrier, []> { let Latency = 1; }	def : WriteRes<WriteBarrier, []> { let Latency = 1; }
	def : WriteRes<WriteHint, []> { let Latency = 1; }	def : WriteRes<WriteHint, []> { let Latency = 1; }

	def : WriteRes<WriteAtomic, []> {	def : WriteRes<WriteAtomic, []> {
	let Unsupported = 1;	let Unsupported = 0;
		fhahnUnsubmitted Not Done Reply Inline Actions `Unsupported = 0` should be the default, so you should be able to just drop this line, see https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Target/TargetSchedule.td#L257 Also if I understand correctly, this should have an impact on scheduling instructions using WriteAtomic, like CASB. So it should be possible to create a test making sure the scheduler uses that info as expected. Although I am not sure if it's worth it and how much work that would be :) Finally, this seems unrelated to the LoopMicroOpBufferSize change and could be split out in a separate patch. fhahn: `Unsupported = 0` should be the default, so you should be able to just drop this line, see…
		let Latency = 4;
	let NumMicroOps = 2;	let NumMicroOps = 2;
	}	}

	//---	//---
	// Branch	// Branch
	//---	//---
	def : InstRW<[THX2T99Write_1Cyc_I2], (instrs B, BL, BR, BLR)>;	def : InstRW<[THX2T99Write_1Cyc_I2], (instrs B, BL, BR, BLR)>;
	def : InstRW<[THX2T99Write_1Cyc_I2], (instrs RET)>;	def : InstRW<[THX2T99Write_1Cyc_I2], (instrs RET)>;
	def : InstRW<[THX2T99Write_1Cyc_I2], (instregex "^B..$")>;	def : InstRW<[THX2T99Write_1Cyc_I2], (instregex "^B..$")>;
	def : InstRW<[THX2T99Write_1Cyc_I2],	def : InstRW<[THX2T99Write_1Cyc_I2],
	(instregex "^CBZ", "^CBNZ", "^TBZ", "^TBNZ")>;	(instregex "^CBZ", "^CBNZ", "^TBZ", "^TBNZ")>;

	//---	//---
	// 3.2 Arithmetic and Logical Instructions	// 3.2 Arithmetic and Logical Instructions
	// 3.3 Move and Shift Instructions	// 3.3 Move and Shift Instructions
	//---	//---
Context not available.

test/CodeGen/AArch64/loop-micro-op-buffer-size-t99.ll

				; REQUIRES: asserts
				; RUN: opt -mcpu=thunderx2t99 -loop-unroll --debug-only=loop-unroll -S -unroll-allow-partial < %s 2>&1 \| FileCheck %s
				davideUnsubmitted Not Done Reply Inline Actions I don't think you need the debug output. davide: I don't think you need the debug output.

				target triple = "aarch64-unknown-linux-gnu"

				; CHECK: Loop Unroll: F[foo] Loop %loop.2.header
				davideUnsubmitted Not Done Reply Inline Actions I don't understand why this change is related to loop-unrolling, can you please elaborate? davide: I don't understand why this change is related to loop-unrolling, can you please elaborate?
				stelemanAuthorUnsubmitted Not Done Reply Inline Actions I don't understand why this change is related to loop-unrolling, can you please elaborate? I am not sure I understand your question. Are you referring to the entire test case, or just to the two lines above your comment? If you are asking about the entire test case, then the answer is: the change in the T99 scheduler is about loop unrolling, and only about loop unrolling. It's not about anything else. steleman: > I don't understand why this change is related to loop-unrolling, can you please elaborate? I…
				davideUnsubmitted Not Done Reply Inline Actions Can you please check in the test standalone and update the change to show the difference? davide: Can you please check in the test standalone and update the change to show the difference?
				stelemanAuthorUnsubmitted Not Done Reply Inline Actions I can't check in because I don't have check-in privileges. :-) But I uploaded the relevant files to my Google Drive: https://drive.google.com/open?id=18GseEsH4XN6bjD3UAcSB06GkXhWzPSAf (Please let me know if you have problems accessing that directory; I don't think you should; I set it to world-readable). There are 3 files in there: t99-loop-unrolling-without-scheduler-patch.txt That's the output from opt from LLVM Git 2017/11/16 without my scheduler patch. t99-loop-unrolling-with-scheduler-patch.txt That's the output from opt from LVM Git 2017/11/16 with my scheduler patch. loop.c That's a test/experiment that I wrote/used to test the changes in CodeGen triggered by the change in the scheduler. It's an interesting test. If you have the time, and the inclination, you can use loop.c to track what happens to the loop at different optimization levels (-O1, -O2, -O3). steleman: I can't check in because I don't have check-in privileges. :-) But I uploaded the relevant…
				; CHECK: Loop Size = 19
				; CHECK: Trip Count = 512
				; CHECK: Trip Multiple = 512
				; CHECK: UNROLLING loop %loop.2.header by 4 with a breakout at trip 0
				; CHECK: Merging:
				; CHECK: Loop Unroll: F[foo] Loop %loop.header
				; CHECK: Loop Size = 18
				; CHECK: Trip Count = 512
				; CHECK: Trip Multiple = 512
				; CHECK: UNROLLING loop %loop.header by 4 with a breakout at trip 0
				; CHECK: Merging:
				; CHECK: %counter = phi i32 [ 0, %entry ], [ %inc.3, %loop.inc.3 ]
				; CHECK: %val = add nuw nsw i32 %counter, 5
				; CHECK: %val1 = add nuw nsw i32 %counter, 6
				; CHECK: %val2 = add nuw nsw i32 %counter, 7
				; CHECK: %val3 = add nuw nsw i32 %counter, 8
				; CHECK: %val4 = add nuw nsw i32 %counter, 9
				; CHECK: %val5 = add nuw nsw i32 %counter, 10
				; CHECK-NOT: %val = add i32 %counter, 5
				; CHECK-NOT: %val = add i32 %counter, 6
				; CHECK-NOT: %val = add i32 %counter, 7
				; CHECK-NOT: %val = add i32 %counter, 8
				; CHECK-NOT: %val = add i32 %counter, 9
				; CHECK-NOT: %val = add i32 %counter, 10
				; CHECK: %counter.2 = phi i32 [ 0, %exit.0 ], [ %inc.2.3, %loop.2.inc.3 ]

				define void @foo(i32 * %out) {
				entry:
				%0 = alloca [1024 x i32]
				%x0 = alloca [1024 x i32]
				%x01 = alloca [1024 x i32]
				%x02 = alloca [1024 x i32]
				%x03 = alloca [1024 x i32]
				%x04 = alloca [1024 x i32]
				%x05 = alloca [1024 x i32]
				%x06 = alloca [1024 x i32]
				br label %loop.header

				loop.header:
				%counter = phi i32 [0, %entry], [%inc, %loop.inc]
				br label %loop.body

				loop.body:
				%ptr = getelementptr [1024 x i32], [1024 x i32]* %0, i32 0, i32 %counter
				store i32 %counter, i32* %ptr
				%val = add i32 %counter, 5
				%xptr = getelementptr [1024 x i32], [1024 x i32]* %x0, i32 0, i32 %counter
				store i32 %val, i32* %xptr
				%val1 = add i32 %counter, 6
				%xptr1 = getelementptr [1024 x i32], [1024 x i32]* %x01, i32 0, i32 %counter
				store i32 %val1, i32* %xptr1
				%val2 = add i32 %counter, 7
				%xptr2 = getelementptr [1024 x i32], [1024 x i32]* %x02, i32 0, i32 %counter
				store i32 %val2, i32* %xptr2
				%val3 = add i32 %counter, 8
				%xptr3 = getelementptr [1024 x i32], [1024 x i32]* %x03, i32 0, i32 %counter
				store i32 %val3, i32* %xptr3
				%val4 = add i32 %counter, 9
				%xptr4 = getelementptr [1024 x i32], [1024 x i32]* %x04, i32 0, i32 %counter
				store i32 %val4, i32* %xptr4
				%val5 = add i32 %counter, 10
				%xptr5 = getelementptr [1024 x i32], [1024 x i32]* %x05, i32 0, i32 %counter
				store i32 %val5, i32* %xptr5
				br label %loop.inc

				loop.inc:
				%inc = add i32 %counter, 2
				%1 = icmp sge i32 %inc, 1023
				br i1 %1, label %exit.0, label %loop.header

				exit.0:
				%2 = getelementptr [1024 x i32], [1024 x i32]* %0, i32 0, i32 5
				%3 = load i32, i32* %2
				store i32 %3, i32 * %out
				br label %loop.2.header


				loop.2.header:
				%counter.2 = phi i32 [0, %exit.0], [%inc.2, %loop.2.inc]
				br label %loop.2.body

				loop.2.body:
				%ptr.2 = getelementptr [1024 x i32], [1024 x i32]* %0, i32 0, i32 %counter.2
				store i32 %counter.2, i32* %ptr.2
				%val.2 = add i32 %counter.2, 5
				%xptr.2 = getelementptr [1024 x i32], [1024 x i32]* %x0, i32 0, i32 %counter.2
				store i32 %val.2, i32* %xptr.2
				%val1.2 = add i32 %counter.2, 6
				%xptr1.2 = getelementptr [1024 x i32], [1024 x i32]* %x01, i32 0, i32 %counter.2
				store i32 %val1, i32* %xptr1.2
				%val2.2 = add i32 %counter.2, 7
				%xptr2.2 = getelementptr [1024 x i32], [1024 x i32]* %x02, i32 0, i32 %counter.2
				store i32 %val2, i32* %xptr2.2
				%val3.2 = add i32 %counter.2, 8
				%xptr3.2 = getelementptr [1024 x i32], [1024 x i32]* %x03, i32 0, i32 %counter.2
				store i32 %val3.2, i32* %xptr3.2
				%val4.2 = add i32 %counter.2, 9
				%xptr4.2 = getelementptr [1024 x i32], [1024 x i32]* %x04, i32 0, i32 %counter.2
				store i32 %val4.2, i32* %xptr4.2
				%val5.2 = add i32 %counter.2, 10
				%xptr5.2 = getelementptr [1024 x i32], [1024 x i32]* %x05, i32 0, i32 %counter.2
				store i32 %val5.2, i32* %xptr5.2
				%xptr6.2 = getelementptr [1024 x i32], [1024 x i32]* %x06, i32 0, i32 %counter.2
				store i32 %val5.2, i32* %xptr6.2
				br label %loop.2.inc

				loop.2.inc:
				%inc.2 = add i32 %counter.2, 2
				%4 = icmp sge i32 %inc.2, 1023
				br i1 %4, label %exit.2, label %loop.2.header

				exit.2:
				%x2 = getelementptr [1024 x i32], [1024 x i32]* %0, i32 0, i32 6
				%x3 = load i32, i32* %x2
				%out2 = getelementptr i32, i32 * %out, i32 1
				store i32 %3, i32 * %out2
				ret void
				}