This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Full/partial tile separation for vectorization
ClosedPublic

Authored by gareevroman on Oct 15 2015, 11:48 AM.

Download Raw Diff

Details

Reviewers

grosser
jdoerfert

Commits

rGca7f5bb7674f: Full/partial tile separation for vectorization
rPLO250809: Full/partial tile separation for vectorization
rL250809: Full/partial tile separation for vectorization

Summary

We isolate full tiles from partial tiles to be able to, for example, vectorize loops with parametric lower and/or upper bounds.

If we use -polly-vectorizer=stripmine, we can see execution-time improvements: correlation from 1m7361s to 0m5720s (-67.05 %), covariance from 1m5561s to 0m5680s (-63.50 %), ary3 from 2m3201s to 1m2361s (-46.72 %), CrystalMk from 8m5565s to 7m4285s (-13.18 %). However, there is a compile-time regression, for example, for 3mm from 0m6320s to 0m9881s (56.34%), which should be eliminated in future.

Diff Detail

Repository: rL LLVM

Event Timeline

gareevroman updated this revision to Diff 37503.Oct 15 2015, 11:48 AM

gareevroman retitled this revision from to [Polly] Full/partial tile separation for vectorization.

gareevroman updated this object.

gareevroman added reviewers: grosser, jdoerfert.

gareevroman added a subscriber: Restricted Project.

Hi Roman,

the patch looks good so far. I just have a set of minor documentation comments.

Best,
Tobias

lib/Transform/ScheduleOptimizer.cpp
165 ↗	(On Diff #37503)	Maybe add a brief comment what the function is doing.
181 ↗	(On Diff #37503)	Maybe add a brief comment what the function is doing.
193 ↗	(On Diff #37503)	"an every" ?
202 ↗	(On Diff #37503)	No empty line in between. Also it is unclear what prefixes you are talking about both in the function name and in the @brief header. Maybe call it 'getPartialTilePrefixes()' and describe early on in the comment what kind of prefixes you are calculating.
221–222 ↗	(On Diff #37503)	Again, 'prefixes' is rather generic. Maybe use 'partial tile prefixes'?
232 ↗	(On Diff #37503)	Can we make this an assert?

Some more comments from my side, but mostly style and simplification remarks/questions.

lib/Transform/ScheduleOptimizer.cpp
166 ↗	(On Diff #37503)	isn't this equivalent to isl_set_params(IsolateDomain)? If not can sb explain me the difference or maybe show me an example where it is not.
174 ↗	(On Diff #37503)	Can't we write all of the above as: auto IsolateRelation = isl_map_from_domain(IsolateDomain); Btw. please add the to the auto types as it helps to get at least that part of information and is consistent with the rest of Polly.
210 ↗	(On Diff #37503)	As Tobias once told me, drop_constraints_* is a dangerous function and it is usally not what we want. I do not fully understand why it is needed here but maybe we can use project_out or nothing?
228 ↗	(On Diff #37503)	This line is to much though ;)
229 ↗	(On Diff #37503)	Can we make this a (static) member of the optimizer class? I would probably use it without the scheduling pass at some point.
test/ScheduleOptimizer/full_partial_tile_separation.ll
21 ↗	(On Diff #37503)	Out of curiosity, why do we have 3 cases? The first and second seem clear but I don't understand the third.

Thanks Johannes for the comments. All remarks are clearly useful.

[Resend with "Johannes Doerfert w r o t e" line dropped to ensure
phabricator does not skip the inline comments]

Thanks Johannes for the comments. All remarks are clearly useful.

jdoerfert added a comment.

Some more comments from my side, but mostly style and simplification remarks/questions.

Comment at: lib/Transform/ScheduleOptimizer.cpp:166
@@ +165,3 @@
+ auto Dims = isl_set_dim(IsolateDomain, isl_dim_set);
+ auto IsolateRange =

+ isl_set_project_out(isl_set_copy(IsolateDomain), isl_dim_set, 0, Dims);

isn't this equivalent to isl_set_params(IsolateDomain)? If not can sb explain me the difference or maybe show me an example where it is not.

Seems like, indeed.

Comment at: lib/Transform/ScheduleOptimizer.cpp:174
@@ +173,3 @@
+ IsolateRelation = isl_map_intersect_domain(IsolateRelation, IsolateDomain);
+ IsolateRelation = isl_map_intersect_range(IsolateRelation, IsolateRange);

+ IsolateRelation = isl_map_move_dims(IsolateRelation, isl_dim_out, 0,

Can't we write all of the above as:
auto *IsolateRelation = isl_map_from_domain(IsolateDomain);
Btw. please add the * to the auto types as it helps to get at least that part of information and is consistent with the rest of Polly.

Good point.

Comment at: test/ScheduleOptimizer/full_partial_tile_separation.ll:21
@@ +20,3 @@
+; CHECK: } else if (32 * c1 + 3 >= nj)
+; CHECK: for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1)

+; CHECK: #pragma simd

Out of curiosity, why do we have 3 cases? The first and second seem clear but I don't understand the third.

Isl is distinguishing four cases:

The part to isolate
The part before the isolated part
The part after the isolated part
The part that is executed if there is no isolated part

In this case 1, 3 & 4 are generated while 2 is empty and left out. isl
should probably merge parts 3/4 or 2/4 if possible and beneficial, but
this optimization has not yet been implemented (I remember there were
some issues). I gonna send a test case to isl-dev for Sven to have a look.

Best,
Tobias

Hi Roman,

I also have a couple of comment regarding the compile time increase. There are two reasons for the compile-time increase:

isl is generating two branches instead of one

This means we are generating more IR and as a result the LLVM backends have more code-generation work to do. I will submit a test case for Sven to have a look.

We are spending a lot of time in IslAst, more than isl_codegen takes to generate the AST

Surprisingly when calling isl from Polly we spend significantly more time on AST generation then when generating the very same AST on the command line. Some of this time is due to us doing parallelism checks, but even if these checks are commented out we somehow still loose a notable amount of time for unknown reasons.

There are two steps to address this:

a) We can use 'mark' nodes to already annotate the SIMD loop during ScheduleTransformation, skip the parallelism checks and generate SIMD code/Parallelism annotations in the IslNodeBuilder when the SIMD marker is found

b) Find out where else compile time is lost.

My feeling is that at least 50% of the compile time increase is unnecessary and could be avoided.

Thank you for the comments!

lib/Transform/ScheduleOptimizer.cpp
193 ↗	(On Diff #37503)	Is "for any" better?
210 ↗	(On Diff #37503)	I’ve tried to get rid of this by explicit allocation of constraints in a new version of the patch.
229 ↗	(On Diff #37503)	Should we make this a public static member?

gareevroman updated this revision to Diff 37777.Oct 19 2015, 11:01 AM

gareevroman edited edge metadata.

LGTM.

lib/Transform/ScheduleOptimizer.cpp
210 ↗	(On Diff #37777)	That looks better and provides a constructive desciption of what is happening (instead of the destructive one before). thx!
229 ↗	(On Diff #37777)	Would be great yes.

isolateFullPartialTiles was made a public static member of the optimizer class.

@grosser, @jdoerfert, could we commit this patch, if there are no more issues, except for the compile-time regression?

Closed by commit rL250809: Full/partial tile separation for vectorization (authored by grosser). · Explain WhyOct 20 2015, 2:14 AM

This revision was automatically updated to reflect the committed changes.

@grosser, thanks!

Revision Contents

Path

Size

polly/

trunk/

include/

polly/

ScheduleOptimizer.h

11 lines

lib/

Transform/

ScheduleOptimizer.cpp

99 lines

test/

ScheduleOptimizer/

full_partial_tile_separation.ll

92 lines

Diff 37838

polly/trunk/include/polly/ScheduleOptimizer.h

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	public:
///		///
/// @param S The SCoP we optimize.		/// @param S The SCoP we optimize.
/// @param NewSchedule The new schedule we computed.		/// @param NewSchedule The new schedule we computed.
///		///
/// @return True, if we believe @p NewSchedule is an improvement for @p S.		/// @return True, if we believe @p NewSchedule is an improvement for @p S.
static bool isProfitableSchedule(polly::Scop &S,		static bool isProfitableSchedule(polly::Scop &S,
__isl_keep isl_union_map *NewSchedule);		__isl_keep isl_union_map *NewSchedule);

		/// @brief Isolate a set of partial tile prefixes.
		///
		/// This set should ensure that it contains only partial tile prefixes that
		/// have exactly VectorWidth iterations.
		///
		/// @param Node A schedule node band, which is a parent of a band node,
		/// that contains a vector loop.
		/// @return Modified isl_schedule_node.
		static __isl_give isl_schedule_node *
		isolateFullPartialTiles(__isl_take isl_schedule_node *Node, int VectorWidth);

private:		private:
/// @brief Tile a schedule node.		/// @brief Tile a schedule node.
///		///
/// @param Node The node to tile.		/// @param Node The node to tile.
/// @param Identifier An name that identifies this kind of tiling and		/// @param Identifier An name that identifies this kind of tiling and
/// that is used to mark the tiled loops in the		/// that is used to mark the tiled loops in the
/// generated AST.		/// generated AST.
/// @param TileSizes A vector of tile sizes that should be used for		/// @param TileSizes A vector of tile sizes that should be used for
▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

polly/trunk/lib/Transform/ScheduleOptimizer.cpp

	Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines

	static cl::list<int>			static cl::list<int>
	RegisterTileSizes("polly-register-tile-sizes",			RegisterTileSizes("polly-register-tile-sizes",
	cl::desc("A tile size for each loop dimension, filled "			cl::desc("A tile size for each loop dimension, filled "
	"with --polly-register-tile-size"),			"with --polly-register-tile-size"),
	cl::Hidden, cl::ZeroOrMore, cl::CommaSeparated,			cl::Hidden, cl::ZeroOrMore, cl::CommaSeparated,
	cl::cat(PollyCategory));			cl::cat(PollyCategory));

				/// @brief Create an isl_union_set, which describes the isolate option based
				/// on IsoalteDomain.
				///
				/// @param IsolateDomain An isl_set whose last dimension is the only one that
				/// should belong to the current band node.
				static __isl_give isl_union_set *
				getIsolateOptions(__isl_take isl_set *IsolateDomain) {
				auto Dims = isl_set_dim(IsolateDomain, isl_dim_set);
				auto *IsolateRelation = isl_map_from_domain(IsolateDomain);
				IsolateRelation = isl_map_move_dims(IsolateRelation, isl_dim_out, 0,
				isl_dim_in, Dims - 1, 1);
				auto *IsolateOption = isl_map_wrap(IsolateRelation);
				auto *Id = isl_id_alloc(isl_set_get_ctx(IsolateOption), "isolate", NULL);
				return isl_union_set_from_set(isl_set_set_tuple_id(IsolateOption, Id));
				}

				/// @brief Create an isl_union_set, which describes the atomic option for the
				/// dimension of the current node.
				///
				/// It may help to reduce the size of generated code.
				///
				/// @param Ctx An isl_ctx, which is used to create the isl_union_set.
				static __isl_give isl_union_set getAtomicOptions(__isl_take isl_ctx Ctx) {
				auto *Space = isl_space_set_alloc(Ctx, 0, 1);
				auto *AtomicOption = isl_set_universe(Space);
				auto *Id = isl_id_alloc(Ctx, "atomic", NULL);
				return isl_union_set_from_set(isl_set_set_tuple_id(AtomicOption, Id));
				}

				/// @brief Make the last dimension of Set to take values
				/// from 0 to VectorWidth - 1.
				///
				/// @param Set A set, which should be modified.
				/// @param VectorWidth A parameter, which determines the constraint.
				static __isl_give isl_set addExtentConstraints(__isl_take isl_set Set,
				int VectorWidth) {
				auto Dims = isl_set_dim(Set, isl_dim_set);
				auto Space = isl_set_get_space(Set);
				auto *LocalSpace = isl_local_space_from_space(Space);
				auto *ExtConstr =
				isl_constraint_alloc_inequality(isl_local_space_copy(LocalSpace));
				ExtConstr = isl_constraint_set_constant_si(ExtConstr, 0);
				ExtConstr =
				isl_constraint_set_coefficient_si(ExtConstr, isl_dim_set, Dims - 1, 1);
				Set = isl_set_add_constraint(Set, ExtConstr);
				ExtConstr = isl_constraint_alloc_inequality(LocalSpace);
				ExtConstr = isl_constraint_set_constant_si(ExtConstr, VectorWidth - 1);
				ExtConstr =
				isl_constraint_set_coefficient_si(ExtConstr, isl_dim_set, Dims - 1, -1);
				return isl_set_add_constraint(Set, ExtConstr);
				}

				/// @brief Build the desired set of partial tile prefixes.
				///
				/// We build a set of partial tile prefixes, which are prefixes of the vector
				/// loop that have exactly VectorWidth iterations.
				///
				/// 1. Get all prefixes of the vector loop.
				/// 2. Extend it to a set, which has exactly VectorWidth iterations for
				/// any prefix from the set that was built on the previous step.
				/// 3. Subtract loop domain from it, project out the vector loop dimension and
				/// get a set of prefixes, which don’t have exactly VectorWidth iterations.
				/// 4. Subtract it from all prefixes of the vector loop and get the desired
				/// set.
				///
				/// @param ScheduleRange A range of a map, which describes a prefix schedule
				/// relation.
				static __isl_give isl_set *
				getPartialTilePrefixes(__isl_take isl_set *ScheduleRange, int VectorWidth) {
				auto Dims = isl_set_dim(ScheduleRange, isl_dim_set);
				auto *LoopPrefixes = isl_set_project_out(isl_set_copy(ScheduleRange),
				isl_dim_set, Dims - 1, 1);
				auto *ExtentPrefixes =
				isl_set_add_dims(isl_set_copy(LoopPrefixes), isl_dim_set, 1);
				ExtentPrefixes = addExtentConstraints(ExtentPrefixes, VectorWidth);
				auto *BadPrefixes = isl_set_subtract(ExtentPrefixes, ScheduleRange);
				BadPrefixes = isl_set_project_out(BadPrefixes, isl_dim_set, Dims - 1, 1);
				return isl_set_subtract(LoopPrefixes, BadPrefixes);
				}

				__isl_give isl_schedule_node *ScheduleTreeOptimizer::isolateFullPartialTiles(
				__isl_take isl_schedule_node *Node, int VectorWidth) {
				assert(isl_schedule_node_get_type(Node) == isl_schedule_node_band);
				Node = isl_schedule_node_child(Node, 0);
				Node = isl_schedule_node_child(Node, 0);
				auto *SchedRelUMap = isl_schedule_node_get_prefix_schedule_relation(Node);
				auto *ScheduleRelation = isl_map_from_union_map(SchedRelUMap);
				auto *ScheduleRange = isl_map_range(ScheduleRelation);
				auto *IsolateDomain = getPartialTilePrefixes(ScheduleRange, VectorWidth);
				auto *AtomicOption = getAtomicOptions(isl_set_get_ctx(IsolateDomain));
				auto *IsolateOption = getIsolateOptions(IsolateDomain);
				Node = isl_schedule_node_parent(Node);
				Node = isl_schedule_node_parent(Node);
				auto *Options = isl_union_set_union(IsolateOption, AtomicOption);
				Node = isl_schedule_node_band_set_ast_build_options(Node, Options);
				return Node;
				}

	__isl_give isl_schedule_node *			__isl_give isl_schedule_node *
	ScheduleTreeOptimizer::prevectSchedBand(__isl_take isl_schedule_node *Node,			ScheduleTreeOptimizer::prevectSchedBand(__isl_take isl_schedule_node *Node,
	unsigned DimToVectorize,			unsigned DimToVectorize,
	int VectorWidth) {			int VectorWidth) {
	assert(isl_schedule_node_get_type(Node) == isl_schedule_node_band);			assert(isl_schedule_node_get_type(Node) == isl_schedule_node_band);

	auto Space = isl_schedule_node_band_get_space(Node);			auto Space = isl_schedule_node_band_get_space(Node);
	auto ScheduleDimensions = isl_space_dim(Space, isl_dim_set);			auto ScheduleDimensions = isl_space_dim(Space, isl_dim_set);
	isl_space_free(Space);			isl_space_free(Space);
	assert(DimToVectorize < ScheduleDimensions);			assert(DimToVectorize < ScheduleDimensions);

	if (DimToVectorize > 0) {			if (DimToVectorize > 0) {
	Node = isl_schedule_node_band_split(Node, DimToVectorize);			Node = isl_schedule_node_band_split(Node, DimToVectorize);
	Node = isl_schedule_node_child(Node, 0);			Node = isl_schedule_node_child(Node, 0);
	}			}
	if (DimToVectorize < ScheduleDimensions - 1)			if (DimToVectorize < ScheduleDimensions - 1)
	Node = isl_schedule_node_band_split(Node, 1);			Node = isl_schedule_node_band_split(Node, 1);
	Space = isl_schedule_node_band_get_space(Node);			Space = isl_schedule_node_band_get_space(Node);
	auto Sizes = isl_multi_val_zero(Space);			auto Sizes = isl_multi_val_zero(Space);
	auto Ctx = isl_schedule_node_get_ctx(Node);			auto Ctx = isl_schedule_node_get_ctx(Node);
	Sizes =			Sizes =
	isl_multi_val_set_val(Sizes, 0, isl_val_int_from_si(Ctx, VectorWidth));			isl_multi_val_set_val(Sizes, 0, isl_val_int_from_si(Ctx, VectorWidth));
	Node = isl_schedule_node_band_tile(Node, Sizes);			Node = isl_schedule_node_band_tile(Node, Sizes);
				Node = isolateFullPartialTiles(Node, VectorWidth);
	Node = isl_schedule_node_child(Node, 0);			Node = isl_schedule_node_child(Node, 0);
	// Make sure the "trivially vectorizable loop" is not unrolled. Otherwise,			// Make sure the "trivially vectorizable loop" is not unrolled. Otherwise,
	// we will have troubles to match it in the backend.			// we will have troubles to match it in the backend.
	Node = isl_schedule_node_band_set_ast_build_options(			Node = isl_schedule_node_band_set_ast_build_options(
	Node, isl_union_set_read_from_str(Ctx, "{ unroll[x]: 1 = 0 }"));			Node, isl_union_set_read_from_str(Ctx, "{ unroll[x]: 1 = 0 }"));
	Node = isl_schedule_node_band_sink(Node);			Node = isl_schedule_node_band_sink(Node);
	Node = isl_schedule_node_child(Node, 0);			Node = isl_schedule_node_child(Node, 0);
	return Node;			return Node;
	▲ Show 20 Lines • Show All 331 Lines • Show Last 20 Lines

polly/trunk/test/ScheduleOptimizer/full_partial_tile_separation.ll

				; RUN: opt -S %loadPolly -polly-vectorizer=stripmine -polly-opt-isl -polly-ast -analyze < %s \| FileCheck %s

				; CHECK: // 1st level tiling - Tiles
				; CHECK: #pragma known-parallel
				; CHECK: for (int c0 = 0; c0 <= floord(ni - 1, 32); c0 += 1)
				; CHECK: for (int c1 = 0; c1 <= floord(nj - 1, 32); c1 += 1)
				; CHECK: for (int c2 = 0; c2 <= floord(nk - 1, 32); c2 += 1) {
				; CHECK: // 1st level tiling - Points
				; CHECK: for (int c3 = 0; c3 <= min(31, ni - 32 * c0 - 1); c3 += 1) {
				; CHECK: for (int c4 = 0; c4 <= min(7, -8 * c1 + nj / 4 - 1); c4 += 1)
				; CHECK: for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1)
				; CHECK: #pragma simd
				; CHECK: for (int c6 = 0; c6 <= 3; c6 += 1)
				; CHECK: Stmt_for_body_6(32 * c0 + c3, 32 * c1 + 4 * c4 + c6, 32 * c2 + c5);
				; CHECK: if (nj >= 32 * c1 + 4 && 32 * c1 + 31 >= nj) {
				; CHECK: for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1)
				; CHECK: #pragma simd
				; CHECK: for (int c6 = 0; c6 < nj % 4; c6 += 1)
				; CHECK: Stmt_for_body_6(32 * c0 + c3, -((nj - 1) % 4) + nj + c6 - 1, 32 * c2 + c5);
				; CHECK: } else if (32 * c1 + 3 >= nj)
				; CHECK: for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1)
				; CHECK: #pragma simd
				; CHECK: for (int c6 = 0; c6 < nj - 32 * c1; c6 += 1)
				; CHECK: Stmt_for_body_6(32 * c0 + c3, 32 * c1 + c6, 32 * c2 + c5);
				; CHECK: }
				; CHECK: }

				; Function Attrs: nounwind uwtable
				define void @kernel_gemm(i32 %ni, i32 %nj, i32 %nk, double %alpha, double %beta, [1024 x double]* %C, [1024 x double]* %A, [1024 x double]* %B) #0 {
				entry:
				%cmp.27 = icmp sgt i32 %ni, 0
				br i1 %cmp.27, label %for.cond.1.preheader.lr.ph, label %for.end.22

				for.cond.1.preheader.lr.ph: ; preds = %entry
				br label %for.cond.1.preheader

				for.cond.1.preheader: ; preds = %for.cond.1.preheader.lr.ph, %for.inc.20
				%indvars.iv33 = phi i64 [ 0, %for.cond.1.preheader.lr.ph ], [ %indvars.iv.next34, %for.inc.20 ]
				%cmp2.25 = icmp sgt i32 %nj, 0
				br i1 %cmp2.25, label %for.cond.4.preheader.lr.ph, label %for.inc.20

				for.cond.4.preheader.lr.ph: ; preds = %for.cond.1.preheader
				br label %for.cond.4.preheader

				for.cond.4.preheader: ; preds = %for.cond.4.preheader.lr.ph, %for.inc.17
				%indvars.iv29 = phi i64 [ 0, %for.cond.4.preheader.lr.ph ], [ %indvars.iv.next30, %for.inc.17 ]
				%cmp5.23 = icmp sgt i32 %nk, 0
				br i1 %cmp5.23, label %for.body.6.lr.ph, label %for.inc.17

				for.body.6.lr.ph: ; preds = %for.cond.4.preheader
				br label %for.body.6

				for.body.6: ; preds = %for.body.6.lr.ph, %for.body.6
				%indvars.iv = phi i64 [ 0, %for.body.6.lr.ph ], [ %indvars.iv.next, %for.body.6 ]
				%arrayidx8 = getelementptr inbounds [1024 x double], [1024 x double]* %A, i64 %indvars.iv33, i64 %indvars.iv
				%0 = load double, double* %arrayidx8, align 8
				%arrayidx12 = getelementptr inbounds [1024 x double], [1024 x double]* %B, i64 %indvars.iv, i64 %indvars.iv29
				%1 = load double, double* %arrayidx12, align 8
				%mul = fmul double %0, %1
				%arrayidx16 = getelementptr inbounds [1024 x double], [1024 x double]* %C, i64 %indvars.iv33, i64 %indvars.iv29
				%2 = load double, double* %arrayidx16, align 8
				%add = fadd double %2, %mul
				store double %add, double* %arrayidx16, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp ne i32 %lftr.wideiv, %nk
				br i1 %exitcond, label %for.body.6, label %for.cond.4.for.inc.17_crit_edge

				for.cond.4.for.inc.17_crit_edge: ; preds = %for.body.6
				br label %for.inc.17

				for.inc.17: ; preds = %for.cond.4.for.inc.17_crit_edge, %for.cond.4.preheader
				%indvars.iv.next30 = add nuw nsw i64 %indvars.iv29, 1
				%lftr.wideiv31 = trunc i64 %indvars.iv.next30 to i32
				%exitcond32 = icmp ne i32 %lftr.wideiv31, %nj
				br i1 %exitcond32, label %for.cond.4.preheader, label %for.cond.1.for.inc.20_crit_edge

				for.cond.1.for.inc.20_crit_edge: ; preds = %for.inc.17
				br label %for.inc.20

				for.inc.20: ; preds = %for.cond.1.for.inc.20_crit_edge, %for.cond.1.preheader
				%indvars.iv.next34 = add nuw nsw i64 %indvars.iv33, 1
				%lftr.wideiv35 = trunc i64 %indvars.iv.next34 to i32
				%exitcond36 = icmp ne i32 %lftr.wideiv35, %ni
				br i1 %exitcond36, label %for.cond.1.preheader, label %for.cond.for.end.22_crit_edge

				for.cond.for.end.22_crit_edge: ; preds = %for.inc.20
				br label %for.end.22

				for.end.22: ; preds = %for.cond.for.end.22_crit_edge, %entry
				ret void
				}