This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Full/partial tile separation for vectorization
ClosedPublic

Authored by gareevroman on Oct 15 2015, 11:48 AM.

Details

Summary

We isolate full tiles from partial tiles to be able to, for example, vectorize loops with parametric lower and/or upper bounds.

If we use -polly-vectorizer=stripmine, we can see execution-time improvements: correlation from 1m7361s to 0m5720s (-67.05 %), covariance from 1m5561s to 0m5680s (-63.50 %), ary3 from 2m3201s to 1m2361s (-46.72 %), CrystalMk from 8m5565s to 7m4285s (-13.18 %). However, there is a compile-time regression, for example, for 3mm from 0m6320s to 0m9881s (56.34%), which should be eliminated in future.

Diff Detail

Repository
rL LLVM

Event Timeline

gareevroman retitled this revision from to [Polly] Full/partial tile separation for vectorization.
gareevroman updated this object.
gareevroman added reviewers: grosser, jdoerfert.
gareevroman added a subscriber: Restricted Project.
grosser edited edge metadata.Oct 15 2015, 12:18 PM

Hi Roman,

the patch looks good so far. I just have a set of minor documentation comments.

Best,
Tobias

lib/Transform/ScheduleOptimizer.cpp
165 ↗(On Diff #37503)

Maybe add a brief comment what the function is doing.

181 ↗(On Diff #37503)

Maybe add a brief comment what the function is doing.

193 ↗(On Diff #37503)

"an every" ?

202 ↗(On Diff #37503)

No empty line in between.

Also it is unclear what prefixes you are talking about both in the function name and in the @brief header. Maybe call it 'getPartialTilePrefixes()' and describe early on in the comment what kind of prefixes you are calculating.

221–222 ↗(On Diff #37503)

Again, 'prefixes' is rather generic. Maybe use 'partial tile prefixes'?

232 ↗(On Diff #37503)

Can we make this an assert?

jdoerfert edited edge metadata.Oct 15 2015, 2:27 PM

Some more comments from my side, but mostly style and simplification remarks/questions.

lib/Transform/ScheduleOptimizer.cpp
166 ↗(On Diff #37503)

isn't this equivalent to isl_set_params(IsolateDomain)? If not can sb explain me the difference or maybe show me an example where it is not.

174 ↗(On Diff #37503)

Can't we write all of the above as:

auto *IsolateRelation = isl_map_from_domain(IsolateDomain);

Btw. please add the * to the auto types as it helps to get at least that part of information and is consistent with the rest of Polly.

210 ↗(On Diff #37503)

As Tobias once told me, drop_constraints_* is a dangerous function and it is usally not what we want. I do not fully understand why it is needed here but maybe we can use project_out or nothing?

228 ↗(On Diff #37503)

This line is to much though ;)

229 ↗(On Diff #37503)

Can we make this a (static) member of the optimizer class? I would probably use it without the scheduling pass at some point.

test/ScheduleOptimizer/full_partial_tile_separation.ll
21 ↗(On Diff #37503)

Out of curiosity, why do we have 3 cases? The first and second seem clear but I don't understand the third.

Thanks Johannes for the comments. All remarks are clearly useful.

[Resend with "Johannes Doerfert w r o t e" line dropped to ensure
phabricator does not skip the inline comments]

Thanks Johannes for the comments. All remarks are clearly useful.

jdoerfert added a comment.

Some more comments from my side, but mostly style and simplification remarks/questions.

Comment at: lib/Transform/ScheduleOptimizer.cpp:166
@@ +165,3 @@
+ auto Dims = isl_set_dim(IsolateDomain, isl_dim_set);
+ auto IsolateRange =

+ isl_set_project_out(isl_set_copy(IsolateDomain), isl_dim_set, 0, Dims);

isn't this equivalent to isl_set_params(IsolateDomain)? If not can sb explain me the difference or maybe show me an example where it is not.

Seems like, indeed.

Comment at: lib/Transform/ScheduleOptimizer.cpp:174
@@ +173,3 @@
+ IsolateRelation = isl_map_intersect_domain(IsolateRelation, IsolateDomain);
+ IsolateRelation = isl_map_intersect_range(IsolateRelation, IsolateRange);

+ IsolateRelation = isl_map_move_dims(IsolateRelation, isl_dim_out, 0,

Can't we write all of the above as:

auto *IsolateRelation = isl_map_from_domain(IsolateDomain);

Btw. please add the * to the auto types as it helps to get at least that part of information and is consistent with the rest of Polly.

Good point.

Comment at: test/ScheduleOptimizer/full_partial_tile_separation.ll:21
@@ +20,3 @@
+; CHECK: } else if (32 * c1 + 3 >= nj)
+; CHECK: for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1)

+; CHECK: #pragma simd

Out of curiosity, why do we have 3 cases? The first and second seem clear but I don't understand the third.

Isl is distinguishing four cases:

  1. The part to isolate
  2. The part before the isolated part
  3. The part after the isolated part
  4. The part that is executed if there is no isolated part

In this case 1, 3 & 4 are generated while 2 is empty and left out. isl
should probably merge parts 3/4 or 2/4 if possible and beneficial, but
this optimization has not yet been implemented (I remember there were
some issues). I gonna send a test case to isl-dev for Sven to have a look.

Best,
Tobias

Hi Roman,

I also have a couple of comment regarding the compile time increase. There are two reasons for the compile-time increase:

  1. isl is generating two branches instead of one

This means we are generating more IR and as a result the LLVM backends have more code-generation work to do. I will submit a test case for Sven to have a look.

  1. We are spending a lot of time in IslAst, more than isl_codegen takes to generate the AST

Surprisingly when calling isl from Polly we spend significantly more time on AST generation then when generating the very same AST on the command line. Some of this time is due to us doing parallelism checks, but even if these checks are commented out we somehow still loose a notable amount of time for unknown reasons.

There are two steps to address this:

a) We can use 'mark' nodes to already annotate the SIMD loop during ScheduleTransformation, skip the parallelism checks and generate SIMD code/Parallelism annotations in the IslNodeBuilder when the SIMD marker is found

b) Find out where else compile time is lost.

My feeling is that at least 50% of the compile time increase is unnecessary and could be avoided.

Thank you for the comments!

lib/Transform/ScheduleOptimizer.cpp
193 ↗(On Diff #37503)

Is "for any" better?

210 ↗(On Diff #37503)

I’ve tried to get rid of this by explicit allocation of constraints in a new version of the patch.

229 ↗(On Diff #37503)

Should we make this a public static member?

gareevroman edited edge metadata.

LGTM.

lib/Transform/ScheduleOptimizer.cpp
210 ↗(On Diff #37777)

That looks better and provides a constructive desciption of what is happening (instead of the destructive one before). thx!

229 ↗(On Diff #37777)

Would be great yes.

gareevroman marked 15 inline comments as done.

isolateFullPartialTiles was made a public static member of the optimizer class.

@grosser, @jdoerfert, could we commit this patch, if there are no more issues, except for the compile-time regression?

This revision was automatically updated to reflect the committed changes.