This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
Passes/
-
PassBuilderPipelines.cpp
-
Transforms/IPO/
-
IPO/
-
PassManagerBuilder.cpp
-
test/
-
Other/
-
new-pm-defaults.ll
-
Transforms/PhaseOrdering/AArch64/
-
PhaseOrdering/
-
AArch64/
-
matrix-extract-insert.ll

Differential D102496

[Passes] Run vector-combine early with -fenable-matrix.
ClosedPublic

Authored by fhahn on May 14 2021, 6:50 AM.

Download Raw Diff

Details

Reviewers

anemet
spatel
RKSimon

Commits

rGa7c6471a8538: [Passes] Run vector-combine early with -fenable-matrix.

Summary

IR with matrix intrinsics is likely to also contain large vector
operations, which can benefit from early simplifications.

This is the last step in a series of changes to improve code-gen for
code using matrix subscript operators with the C/C++ matrix extension in
CLang, like

using matrix_t = double __attribute__((matrix_type(15, 15)));

void foo(unsigned i, matrix_t &A, matrix_t &B) {
  for (unsigned j = 0; j < 4; ++j)
    for (unsigned k = 0; k < i; k++)
      B[k][j] -= A[k][j] * B[i][j];
}

https://clang.godbolt.org/z/6dKxK1Ed7

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.May 14 2021, 6:50 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptMay 14 2021, 6:50 AM

fhahn requested review of this revision.May 14 2021, 6:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2021, 6:50 AM

fhahn added parent revisions: D102478: [Matrix] Emit assumption that matrix indices are valid., D102476: [VectorCombine] Use constant range info for index scalarization legality..May 14 2021, 6:51 AM

Any testcase to ensure that passes together now produce what you want?

Harbormaster completed remote builds in B104492: Diff 345429.May 14 2021, 8:12 AM

Some test coverage (phaseordering?) would be useful.

Thanks for taking a look! I added a phase ordering test and updated the pipeline tests as well.

Harbormaster completed remote builds in B104705: Diff 345704.May 16 2021, 5:52 AM

LGTM - anyone else have any comments?

Over in D102002, I am looking at divergence between the regular and LTO pipelines...
If I'm seeing this correctly, we will not alter the LTO pipeline in this patch. Is that intentional?

In D102496#2772019, @spatel wrote:

Over in D102002, I am looking at divergence between the regular and LTO pipelines...
If I'm seeing this correctly, we will not alter the LTO pipeline in this patch. Is that intentional?

Yes that's intentional. The motivation to run vector-combine early here is to catch combine & scalarization opportunities before operations are moved too much by GVN, unrolling & co. At the LTO stage, those should already be covered by the pre-LTO steps, so there's no need to do another run during the LTO stage I think.

In D102496#2773930, @fhahn wrote:

In D102496#2772019, @spatel wrote:

Over in D102002, I am looking at divergence between the regular and LTO pipelines...
If I'm seeing this correctly, we will not alter the LTO pipeline in this patch. Is that intentional?

Yes that's intentional. The motivation to run vector-combine early here is to catch combine & scalarization opportunities before operations are moved too much by GVN, unrolling & co. At the LTO stage, those should already be covered by the pre-LTO steps, so there's no need to do another run during the LTO stage I think.

Ah, I still haven't made sense of all the pipeline stages for LTO.
LGTM.

This revision is now accepted and ready to land.May 21 2021, 9:28 AM

fhahn added a parent revision: D110171: [VectorCombine] Switch to using a worklist..Sep 21 2021, 2:50 PM

Herald added a subscriber: ormris. · View Herald TranscriptSep 21 2021, 2:50 PM

Rebased. I am planning on landing this after D110171 lands.

Harbormaster completed remote builds in B124991: Diff 374041.Sep 21 2021, 2:52 PM

This revision was landed with ongoing or failed builds.Sep 22 2021, 4:49 AM

Closed by commit rGa7c6471a8538: [Passes] Run vector-combine early with -fenable-matrix. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rGa7c6471a8538: [Passes] Run vector-combine early with -fenable-matrix..

spatel mentioned this in D138353: [Passes][VectorCombine] enable early run generally and try load folds.Nov 19 2022, 7:47 AM

spatel mentioned this in rG8f337f8ffe36: [VectorCombine] generalize pass param name for early combines; NFC.Nov 21 2022, 10:58 AM

spatel mentioned this in rG163bb6d64e5f: [Passes][VectorCombine] enable early run generally and try load folds.

Revision Contents

Path

Size

llvm/

lib/

Passes/

PassBuilderPipelines.cpp

5 lines

Transforms/

IPO/

PassManagerBuilder.cpp

5 lines

test/

Other/

new-pm-defaults.ll

1 line

Transforms/

PhaseOrdering/

AArch64/

matrix-extract-insert.ll

111 lines

Diff 374041

llvm/lib/Passes/PassBuilderPipelines.cpp

Show First 20 Lines • Show All 494 Lines • ▼ Show 20 Lines	PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
// All loop passes must preserve it, in order to be able to use it.		// All loop passes must preserve it, in order to be able to use it.
FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM2),		FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM2),
/UseMemorySSA=/false,		/UseMemorySSA=/false,
/UseBlockFrequencyInfo=/false));		/UseBlockFrequencyInfo=/false));

// Delete small array after loop unroll.		// Delete small array after loop unroll.
FPM.addPass(SROA());		FPM.addPass(SROA());

		// The matrix extension can introduce large vector operations early, which can
		// benefit from running vector-combine early on.
		if (EnableMatrix)
		FPM.addPass(VectorCombinePass());

// Eliminate redundancies.		// Eliminate redundancies.
FPM.addPass(MergedLoadStoreMotionPass());		FPM.addPass(MergedLoadStoreMotionPass());
if (RunNewGVN)		if (RunNewGVN)
FPM.addPass(NewGVNPass());		FPM.addPass(NewGVNPass());
else		else
FPM.addPass(GVN());		FPM.addPass(GVN());

// Sparse conditional constant propagation.		// Sparse conditional constant propagation.
▲ Show 20 Lines • Show All 1,220 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 431 Lines • ▼ Show 20 Lines	if (SizeLevel == 0)
MPM.add(createPGOMemOPSizeOptLegacyPass());		MPM.add(createPGOMemOPSizeOptLegacyPass());

// TODO: Investigate the cost/benefit of tail call elimination on debugging.		// TODO: Investigate the cost/benefit of tail call elimination on debugging.
if (OptLevel > 1)		if (OptLevel > 1)
MPM.add(createTailCallEliminationPass()); // Eliminate tail calls		MPM.add(createTailCallEliminationPass()); // Eliminate tail calls
MPM.add(createCFGSimplificationPass()); // Merge & remove BBs		MPM.add(createCFGSimplificationPass()); // Merge & remove BBs
MPM.add(createReassociatePass()); // Reassociate expressions		MPM.add(createReassociatePass()); // Reassociate expressions

		// The matrix extension can introduce large vector operations early, which can
		// benefit from running vector-combine early on.
		if (EnableMatrix)
		MPM.add(createVectorCombinePass());

// Begin the loop pass pipeline.		// Begin the loop pass pipeline.
if (EnableSimpleLoopUnswitch) {		if (EnableSimpleLoopUnswitch) {
// The simple loop unswitch pass relies on separate cleanup passes. Schedule		// The simple loop unswitch pass relies on separate cleanup passes. Schedule
// them first so when we re-process a loop they run before other loop		// them first so when we re-process a loop they run before other loop
// passes.		// passes.
MPM.add(createLoopInstSimplifyPass());		MPM.add(createLoopInstSimplifyPass());
MPM.add(createLoopSimplifyCFGPass());		MPM.add(createLoopSimplifyCFGPass());
}		}
▲ Show 20 Lines • Show All 879 Lines • Show Last 20 Lines

llvm/test/Other/new-pm-defaults.ll

	Show First 20 Lines • Show All 164 Lines • ▼ Show 20 Lines
	; CHECK-O-NEXT: Running pass: LCSSAPass			; CHECK-O-NEXT: Running pass: LCSSAPass
	; CHECK-O-NEXT: Running pass: LoopIdiomRecognizePass			; CHECK-O-NEXT: Running pass: LoopIdiomRecognizePass
	; CHECK-O-NEXT: Running pass: IndVarSimplifyPass			; CHECK-O-NEXT: Running pass: IndVarSimplifyPass
	; CHECK-EP-LOOP-LATE-NEXT: Running pass: NoOpLoopPass			; CHECK-EP-LOOP-LATE-NEXT: Running pass: NoOpLoopPass
	; CHECK-O-NEXT: Running pass: LoopDeletionPass			; CHECK-O-NEXT: Running pass: LoopDeletionPass
	; CHECK-O-NEXT: Running pass: LoopFullUnrollPass			; CHECK-O-NEXT: Running pass: LoopFullUnrollPass
	; CHECK-EP-LOOP-END-NEXT: Running pass: NoOpLoopPass			; CHECK-EP-LOOP-END-NEXT: Running pass: NoOpLoopPass
	; CHECK-O-NEXT: Running pass: SROA on foo			; CHECK-O-NEXT: Running pass: SROA on foo
				; CHECK-MATRIX: Running pass: VectorCombinePass
	; CHECK-O23SZ-NEXT: Running pass: MergedLoadStoreMotionPass			; CHECK-O23SZ-NEXT: Running pass: MergedLoadStoreMotionPass
	; CHECK-O23SZ-NEXT: Running pass: GVN			; CHECK-O23SZ-NEXT: Running pass: GVN
	; CHECK-O23SZ-NEXT: Running analysis: MemoryDependenceAnalysis			; CHECK-O23SZ-NEXT: Running analysis: MemoryDependenceAnalysis
	; CHECK-O23SZ-NEXT: Running analysis: PhiValuesAnalysis			; CHECK-O23SZ-NEXT: Running analysis: PhiValuesAnalysis
	; CHECK-O1-NEXT: Running pass: MemCpyOptPass			; CHECK-O1-NEXT: Running pass: MemCpyOptPass
	; CHECK-O-NEXT: Running pass: SCCPPass			; CHECK-O-NEXT: Running pass: SCCPPass
	; CHECK-O-NEXT: Running pass: BDCEPass			; CHECK-O-NEXT: Running pass: BDCEPass
	; CHECK-O-NEXT: Running analysis: DemandedBitsAnalysis			; CHECK-O-NEXT: Running analysis: DemandedBitsAnalysis
	▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/AArch64/matrix-extract-insert.ll

	Show All 20 Lines
	; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP6]])			; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP6]])
	; CHECK-NEXT: [[TMP7:%.]] = bitcast [225 x double] [[B:%.]] to <225 x double>			; CHECK-NEXT: [[TMP7:%.]] = bitcast [225 x double] [[B:%.]] to <225 x double>
	; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP7]], i64 0, i64 [[TMP5]]			; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP7]], i64 0, i64 [[TMP5]]
	; CHECK-NEXT: [[MATRIXEXT4:%.]] = load double, double [[TMP8]], align 8			; CHECK-NEXT: [[MATRIXEXT4:%.]] = load double, double [[TMP8]], align 8
	; CHECK-NEXT: [[MUL:%.*]] = fmul double [[MATRIXEXT]], [[MATRIXEXT4]]			; CHECK-NEXT: [[MUL:%.*]] = fmul double [[MATRIXEXT]], [[MATRIXEXT4]]
	; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP7]], i64 0, i64 [[TMP1]]			; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP7]], i64 0, i64 [[TMP1]]
	; CHECK-NEXT: [[MATRIXEXT7:%.]] = load double, double [[TMP9]], align 8			; CHECK-NEXT: [[MATRIXEXT7:%.]] = load double, double [[TMP9]], align 8
	; CHECK-NEXT: [[SUB:%.*]] = fsub double [[MATRIXEXT7]], [[MUL]]			; CHECK-NEXT: [[SUB:%.*]] = fsub double [[MATRIXEXT7]], [[MUL]]
	; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP7]], i64 0, i64 [[TMP1]]			; CHECK-NEXT: store double [[SUB]], double* [[TMP9]], align 8
	; CHECK-NEXT: store double [[SUB]], double* [[TMP10]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%i.addr = alloca i32, align 4			%i.addr = alloca i32, align 4
	%k.addr = alloca i32, align 4			%k.addr = alloca i32, align 4
	%j.addr = alloca i32, align 4			%j.addr = alloca i32, align 4
	%A.addr = alloca [225 x double]*, align 8			%A.addr = alloca [225 x double]*, align 8
	%B.addr = alloca [225 x double]*, align 8			%B.addr = alloca [225 x double]*, align 8
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	}			}
	define void @matrix_extract_insert_loop(i32 %i, [225 x double]* nonnull align 8 dereferenceable(1800) %A, [225 x double]* nonnull align 8 dereferenceable(1800) %B) {			define void @matrix_extract_insert_loop(i32 %i, [225 x double]* nonnull align 8 dereferenceable(1800) %A, [225 x double]* nonnull align 8 dereferenceable(1800) %B) {
	; CHECK-LABEL: @matrix_extract_insert_loop(			; CHECK-LABEL: @matrix_extract_insert_loop(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast [225 x double] [[A:%.]] to <225 x double>			; CHECK-NEXT: [[TMP0:%.]] = bitcast [225 x double] [[A:%.]] to <225 x double>
	; CHECK-NEXT: [[CONV6:%.]] = zext i32 [[I:%.]] to i64			; CHECK-NEXT: [[CONV6:%.]] = zext i32 [[I:%.]] to i64
	; CHECK-NEXT: [[TMP1:%.]] = bitcast [225 x double] [[B:%.]] to <225 x double>			; CHECK-NEXT: [[TMP1:%.]] = bitcast [225 x double] [[B:%.]] to <225 x double>
	; CHECK-NEXT: [[CMP212_NOT:%.*]] = icmp eq i32 [[I]], 0			; CHECK-NEXT: [[CMP212_NOT:%.*]] = icmp eq i32 [[I]], 0
	; CHECK-NEXT: br i1 [[CMP212_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND1_PREHEADER_US_PREHEADER:%.]]			; CHECK-NEXT: br i1 [[CMP212_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_COND1_PREHEADER_US:%.]]
	; CHECK: for.cond1.preheader.us.preheader:
	; CHECK-NEXT: [[DOTPRE_PRE:%.]] = load <225 x double>, <225 x double> [[TMP1]], align 8
	; CHECK-NEXT: br label [[FOR_COND1_PREHEADER_US:%.*]]
	; CHECK: for.cond1.preheader.us:			; CHECK: for.cond1.preheader.us:
	; CHECK-NEXT: [[DOTPRE:%.]] = phi <225 x double> [ [[MATINS_US:%.]], [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US:%.*]] ], [ [[DOTPRE_PRE]], [[FOR_COND1_PREHEADER_US_PREHEADER]] ]			; CHECK-NEXT: [[TMP2:%.*]] = icmp ult i32 [[I]], 225
	; CHECK-NEXT: [[J_014_US:%.]] = phi i32 [ [[INC13_US:%.]], [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US]] ], [ 0, [[FOR_COND1_PREHEADER_US_PREHEADER]] ]			; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP2]])
	; CHECK-NEXT: [[CONV5_US:%.*]] = zext i32 [[J_014_US]] to i64			; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP1]], i64 0, i64 [[CONV6]]
	; CHECK-NEXT: [[TMP2:%.*]] = mul nuw nsw i64 [[CONV5_US]], 15
	; CHECK-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], [[CONV6]]
	; CHECK-NEXT: [[TMP4:%.*]] = icmp ult i64 [[TMP3]], 225
	; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP4]])
	; CHECK-NEXT: br label [[FOR_BODY4_US:%.*]]			; CHECK-NEXT: br label [[FOR_BODY4_US:%.*]]
	; CHECK: for.body4.us:			; CHECK: for.body4.us:
	; CHECK-NEXT: [[TMP5:%.*]] = phi <225 x double> [ [[DOTPRE]], [[FOR_COND1_PREHEADER_US]] ], [ [[MATINS_US]], [[FOR_BODY4_US]] ]
	; CHECK-NEXT: [[K_013_US:%.]] = phi i32 [ 0, [[FOR_COND1_PREHEADER_US]] ], [ [[INC_US:%.]], [[FOR_BODY4_US]] ]			; CHECK-NEXT: [[K_013_US:%.]] = phi i32 [ 0, [[FOR_COND1_PREHEADER_US]] ], [ [[INC_US:%.]], [[FOR_BODY4_US]] ]
	; CHECK-NEXT: [[CONV_US:%.*]] = zext i32 [[K_013_US]] to i64			; CHECK-NEXT: [[CONV_US:%.*]] = zext i32 [[K_013_US]] to i64
	; CHECK-NEXT: [[TMP6:%.*]] = add nuw nsw i64 [[TMP2]], [[CONV_US]]			; CHECK-NEXT: [[TMP4:%.*]] = icmp ult i32 [[K_013_US]], 225
	; CHECK-NEXT: [[TMP7:%.*]] = icmp ult i64 [[TMP6]], 225			; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP4]])
	; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP7]])			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP0]], i64 0, i64 [[CONV_US]]
	; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP0]], i64 0, i64 [[TMP6]]			; CHECK-NEXT: [[MATRIXEXT_US:%.]] = load double, double [[TMP5]], align 8
	; CHECK-NEXT: [[MATRIXEXT_US:%.]] = load double, double [[TMP8]], align 8			; CHECK-NEXT: [[MATRIXEXT8_US:%.]] = load double, double [[TMP3]], align 8
	; CHECK-NEXT: [[MATRIXEXT8_US:%.*]] = extractelement <225 x double> [[TMP5]], i64 [[TMP3]]
	; CHECK-NEXT: [[MUL_US:%.*]] = fmul double [[MATRIXEXT_US]], [[MATRIXEXT8_US]]			; CHECK-NEXT: [[MUL_US:%.*]] = fmul double [[MATRIXEXT_US]], [[MATRIXEXT8_US]]
	; CHECK-NEXT: [[MATRIXEXT11_US:%.*]] = extractelement <225 x double> [[TMP5]], i64 [[TMP6]]			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP1]], i64 0, i64 [[CONV_US]]
				; CHECK-NEXT: [[MATRIXEXT11_US:%.]] = load double, double [[TMP6]], align 8
	; CHECK-NEXT: [[SUB_US:%.*]] = fsub double [[MATRIXEXT11_US]], [[MUL_US]]			; CHECK-NEXT: [[SUB_US:%.*]] = fsub double [[MATRIXEXT11_US]], [[MUL_US]]
	; CHECK-NEXT: [[MATINS_US]] = insertelement <225 x double> [[TMP5]], double [[SUB_US]], i64 [[TMP6]]			; CHECK-NEXT: store double [[SUB_US]], double* [[TMP6]], align 8
	; CHECK-NEXT: store <225 x double> [[MATINS_US]], <225 x double>* [[TMP1]], align 8			; CHECK-NEXT: [[INC_US]] = add nuw nsw i32 [[K_013_US]], 1
	; CHECK-NEXT: [[INC_US]] = add nuw i32 [[K_013_US]], 1
	; CHECK-NEXT: [[CMP2_US:%.*]] = icmp ult i32 [[INC_US]], [[I]]			; CHECK-NEXT: [[CMP2_US:%.*]] = icmp ult i32 [[INC_US]], [[I]]
	; CHECK-NEXT: br i1 [[CMP2_US]], label [[FOR_BODY4_US]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US]]			; CHECK-NEXT: br i1 [[CMP2_US]], label [[FOR_BODY4_US]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US:%.*]]
	; CHECK: for.cond1.for.cond.cleanup3_crit_edge.us:			; CHECK: for.cond1.for.cond.cleanup3_crit_edge.us:
	; CHECK-NEXT: [[INC13_US]] = add nuw nsw i32 [[J_014_US]], 1			; CHECK-NEXT: [[TMP7:%.*]] = add nuw nsw i64 [[CONV6]], 15
	; CHECK-NEXT: [[CMP_US:%.*]] = icmp ult i32 [[J_014_US]], 3			; CHECK-NEXT: [[TMP8:%.*]] = icmp ult i32 [[I]], 210
	; CHECK-NEXT: br i1 [[CMP_US]], label [[FOR_COND1_PREHEADER_US]], label [[FOR_COND_CLEANUP]]			; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP8]])
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP1]], i64 0, i64 [[TMP7]]
				; CHECK-NEXT: br label [[FOR_BODY4_US_1:%.*]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
				; CHECK: for.body4.us.1:
				; CHECK-NEXT: [[K_013_US_1:%.]] = phi i32 [ 0, [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US]] ], [ [[INC_US_1:%.]], [[FOR_BODY4_US_1]] ]
				; CHECK-NEXT: [[NARROW:%.*]] = add nuw nsw i32 [[K_013_US_1]], 15
				; CHECK-NEXT: [[TMP10:%.*]] = zext i32 [[NARROW]] to i64
				; CHECK-NEXT: [[TMP11:%.*]] = icmp ult i32 [[K_013_US_1]], 210
				; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP11]])
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP0]], i64 0, i64 [[TMP10]]
				; CHECK-NEXT: [[MATRIXEXT_US_1:%.]] = load double, double [[TMP12]], align 8
				; CHECK-NEXT: [[MATRIXEXT8_US_1:%.]] = load double, double [[TMP9]], align 8
				; CHECK-NEXT: [[MUL_US_1:%.*]] = fmul double [[MATRIXEXT_US_1]], [[MATRIXEXT8_US_1]]
				; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP1]], i64 0, i64 [[TMP10]]
				; CHECK-NEXT: [[MATRIXEXT11_US_1:%.]] = load double, double [[TMP13]], align 8
				; CHECK-NEXT: [[SUB_US_1:%.*]] = fsub double [[MATRIXEXT11_US_1]], [[MUL_US_1]]
				; CHECK-NEXT: store double [[SUB_US_1]], double* [[TMP13]], align 8
				; CHECK-NEXT: [[INC_US_1]] = add nuw nsw i32 [[K_013_US_1]], 1
				; CHECK-NEXT: [[CMP2_US_1:%.*]] = icmp ult i32 [[INC_US_1]], [[I]]
				; CHECK-NEXT: br i1 [[CMP2_US_1]], label [[FOR_BODY4_US_1]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_1:%.*]]
				; CHECK: for.cond1.for.cond.cleanup3_crit_edge.us.1:
				; CHECK-NEXT: [[TMP14:%.*]] = add nuw nsw i64 [[CONV6]], 30
				; CHECK-NEXT: [[TMP15:%.*]] = icmp ult i32 [[I]], 195
				; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP15]])
				; CHECK-NEXT: [[TMP16:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP1]], i64 0, i64 [[TMP14]]
				; CHECK-NEXT: br label [[FOR_BODY4_US_2:%.*]]
				; CHECK: for.body4.us.2:
				; CHECK-NEXT: [[K_013_US_2:%.]] = phi i32 [ 0, [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_1]] ], [ [[INC_US_2:%.]], [[FOR_BODY4_US_2]] ]
				; CHECK-NEXT: [[NARROW16:%.*]] = add nuw nsw i32 [[K_013_US_2]], 30
				; CHECK-NEXT: [[TMP17:%.*]] = zext i32 [[NARROW16]] to i64
				; CHECK-NEXT: [[TMP18:%.*]] = icmp ult i32 [[K_013_US_2]], 195
				; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP18]])
				; CHECK-NEXT: [[TMP19:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP0]], i64 0, i64 [[TMP17]]
				; CHECK-NEXT: [[MATRIXEXT_US_2:%.]] = load double, double [[TMP19]], align 8
				; CHECK-NEXT: [[MATRIXEXT8_US_2:%.]] = load double, double [[TMP16]], align 8
				; CHECK-NEXT: [[MUL_US_2:%.*]] = fmul double [[MATRIXEXT_US_2]], [[MATRIXEXT8_US_2]]
				; CHECK-NEXT: [[TMP20:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP1]], i64 0, i64 [[TMP17]]
				; CHECK-NEXT: [[MATRIXEXT11_US_2:%.]] = load double, double [[TMP20]], align 8
				; CHECK-NEXT: [[SUB_US_2:%.*]] = fsub double [[MATRIXEXT11_US_2]], [[MUL_US_2]]
				; CHECK-NEXT: store double [[SUB_US_2]], double* [[TMP20]], align 8
				; CHECK-NEXT: [[INC_US_2]] = add nuw nsw i32 [[K_013_US_2]], 1
				; CHECK-NEXT: [[CMP2_US_2:%.*]] = icmp ult i32 [[INC_US_2]], [[I]]
				; CHECK-NEXT: br i1 [[CMP2_US_2]], label [[FOR_BODY4_US_2]], label [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_2:%.*]]
				; CHECK: for.cond1.for.cond.cleanup3_crit_edge.us.2:
				; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[CONV6]], 45
				; CHECK-NEXT: [[TMP22:%.*]] = icmp ult i32 [[I]], 180
				; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP22]])
				; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP1]], i64 0, i64 [[TMP21]]
				; CHECK-NEXT: br label [[FOR_BODY4_US_3:%.*]]
				; CHECK: for.body4.us.3:
				; CHECK-NEXT: [[K_013_US_3:%.]] = phi i32 [ 0, [[FOR_COND1_FOR_COND_CLEANUP3_CRIT_EDGE_US_2]] ], [ [[INC_US_3:%.]], [[FOR_BODY4_US_3]] ]
				; CHECK-NEXT: [[NARROW17:%.*]] = add nuw nsw i32 [[K_013_US_3]], 45
				; CHECK-NEXT: [[TMP24:%.*]] = zext i32 [[NARROW17]] to i64
				; CHECK-NEXT: [[TMP25:%.*]] = icmp ult i32 [[K_013_US_3]], 180
				; CHECK-NEXT: tail call void @llvm.assume(i1 [[TMP25]])
				; CHECK-NEXT: [[TMP26:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP0]], i64 0, i64 [[TMP24]]
				; CHECK-NEXT: [[MATRIXEXT_US_3:%.]] = load double, double [[TMP26]], align 8
				; CHECK-NEXT: [[MATRIXEXT8_US_3:%.]] = load double, double [[TMP23]], align 8
				; CHECK-NEXT: [[MUL_US_3:%.*]] = fmul double [[MATRIXEXT_US_3]], [[MATRIXEXT8_US_3]]
				; CHECK-NEXT: [[TMP27:%.]] = getelementptr inbounds <225 x double>, <225 x double> [[TMP1]], i64 0, i64 [[TMP24]]
				; CHECK-NEXT: [[MATRIXEXT11_US_3:%.]] = load double, double [[TMP27]], align 8
				; CHECK-NEXT: [[SUB_US_3:%.*]] = fsub double [[MATRIXEXT11_US_3]], [[MUL_US_3]]
				; CHECK-NEXT: store double [[SUB_US_3]], double* [[TMP27]], align 8
				; CHECK-NEXT: [[INC_US_3]] = add nuw nsw i32 [[K_013_US_3]], 1
				; CHECK-NEXT: [[CMP2_US_3:%.*]] = icmp ult i32 [[INC_US_3]], [[I]]
				; CHECK-NEXT: br i1 [[CMP2_US_3]], label [[FOR_BODY4_US_3]], label [[FOR_COND_CLEANUP]]
	;			;
	entry:			entry:
	%i.addr = alloca i32, align 4			%i.addr = alloca i32, align 4
	%A.addr = alloca [225 x double]*, align 8			%A.addr = alloca [225 x double]*, align 8
	%B.addr = alloca [225 x double]*, align 8			%B.addr = alloca [225 x double]*, align 8
	%j = alloca i32, align 4			%j = alloca i32, align 4
	%cleanup.dest.slot = alloca i32, align 4			%cleanup.dest.slot = alloca i32, align 4
	%k = alloca i32, align 4			%k = alloca i32, align 4
	▲ Show 20 Lines • Show All 112 Lines • Show Last 20 Lines