This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Full/partial tile separation for vectorization
ClosedPublic

Authored by gareevroman on Oct 15 2015, 11:48 AM.

Details

Summary

We isolate full tiles from partial tiles to be able to, for example, vectorize loops with parametric lower and/or upper bounds.

If we use -polly-vectorizer=stripmine, we can see execution-time improvements: correlation from 1m7361s to 0m5720s (-67.05 %), covariance from 1m5561s to 0m5680s (-63.50 %), ary3 from 2m3201s to 1m2361s (-46.72 %), CrystalMk from 8m5565s to 7m4285s (-13.18 %). However, there is a compile-time regression, for example, for 3mm from 0m6320s to 0m9881s (56.34%), which should be eliminated in future.

Diff Detail

Event Timeline

gareevroman retitled this revision from to [Polly] Full/partial tile separation for vectorization.
gareevroman updated this object.
gareevroman added reviewers: grosser, jdoerfert.
gareevroman added a subscriber: Restricted Project.
grosser edited edge metadata.Oct 15 2015, 12:18 PM

Hi Roman,

the patch looks good so far. I just have a set of minor documentation comments.

Best,
Tobias

lib/Transform/ScheduleOptimizer.cpp
165

Maybe add a brief comment what the function is doing.

181

Maybe add a brief comment what the function is doing.

193

"an every" ?

202

No empty line in between.

Also it is unclear what prefixes you are talking about both in the function name and in the @brief header. Maybe call it 'getPartialTilePrefixes()' and describe early on in the comment what kind of prefixes you are calculating.

221–222

Again, 'prefixes' is rather generic. Maybe use 'partial tile prefixes'?

232

Can we make this an assert?

jdoerfert edited edge metadata.Oct 15 2015, 2:27 PM

Some more comments from my side, but mostly style and simplification remarks/questions.

lib/Transform/ScheduleOptimizer.cpp
166

isn't this equivalent to isl_set_params(IsolateDomain)? If not can sb explain me the difference or maybe show me an example where it is not.

174

Can't we write all of the above as:

auto *IsolateRelation = isl_map_from_domain(IsolateDomain);

Btw. please add the * to the auto types as it helps to get at least that part of information and is consistent with the rest of Polly.

210

As Tobias once told me, drop_constraints_* is a dangerous function and it is usally not what we want. I do not fully understand why it is needed here but maybe we can use project_out or nothing?

228

This line is to much though ;)

229

Can we make this a (static) member of the optimizer class? I would probably use it without the scheduling pass at some point.

test/ScheduleOptimizer/full_partial_tile_separation.ll
22

Out of curiosity, why do we have 3 cases? The first and second seem clear but I don't understand the third.

Thanks Johannes for the comments. All remarks are clearly useful.

[Resend with "Johannes Doerfert w r o t e" line dropped to ensure
phabricator does not skip the inline comments]

Thanks Johannes for the comments. All remarks are clearly useful.

jdoerfert added a comment.

Some more comments from my side, but mostly style and simplification remarks/questions.

Comment at: lib/Transform/ScheduleOptimizer.cpp:166
@@ +165,3 @@
+ auto Dims = isl_set_dim(IsolateDomain, isl_dim_set);
+ auto IsolateRange =

+ isl_set_project_out(isl_set_copy(IsolateDomain), isl_dim_set, 0, Dims);

isn't this equivalent to isl_set_params(IsolateDomain)? If not can sb explain me the difference or maybe show me an example where it is not.

Seems like, indeed.

Comment at: lib/Transform/ScheduleOptimizer.cpp:174
@@ +173,3 @@
+ IsolateRelation = isl_map_intersect_domain(IsolateRelation, IsolateDomain);
+ IsolateRelation = isl_map_intersect_range(IsolateRelation, IsolateRange);

+ IsolateRelation = isl_map_move_dims(IsolateRelation, isl_dim_out, 0,

Can't we write all of the above as:

auto *IsolateRelation = isl_map_from_domain(IsolateDomain);

Btw. please add the * to the auto types as it helps to get at least that part of information and is consistent with the rest of Polly.

Good point.

Comment at: test/ScheduleOptimizer/full_partial_tile_separation.ll:21
@@ +20,3 @@
+; CHECK: } else if (32 * c1 + 3 >= nj)
+; CHECK: for (int c5 = 0; c5 <= min(31, nk - 32 * c2 - 1); c5 += 1)

+; CHECK: #pragma simd

Out of curiosity, why do we have 3 cases? The first and second seem clear but I don't understand the third.

Isl is distinguishing four cases:

  1. The part to isolate
  2. The part before the isolated part
  3. The part after the isolated part
  4. The part that is executed if there is no isolated part

In this case 1, 3 & 4 are generated while 2 is empty and left out. isl
should probably merge parts 3/4 or 2/4 if possible and beneficial, but
this optimization has not yet been implemented (I remember there were
some issues). I gonna send a test case to isl-dev for Sven to have a look.

Best,
Tobias

Hi Roman,

I also have a couple of comment regarding the compile time increase. There are two reasons for the compile-time increase:

  1. isl is generating two branches instead of one

This means we are generating more IR and as a result the LLVM backends have more code-generation work to do. I will submit a test case for Sven to have a look.

  1. We are spending a lot of time in IslAst, more than isl_codegen takes to generate the AST

Surprisingly when calling isl from Polly we spend significantly more time on AST generation then when generating the very same AST on the command line. Some of this time is due to us doing parallelism checks, but even if these checks are commented out we somehow still loose a notable amount of time for unknown reasons.

There are two steps to address this:

a) We can use 'mark' nodes to already annotate the SIMD loop during ScheduleTransformation, skip the parallelism checks and generate SIMD code/Parallelism annotations in the IslNodeBuilder when the SIMD marker is found

b) Find out where else compile time is lost.

My feeling is that at least 50% of the compile time increase is unnecessary and could be avoided.

Thank you for the comments!

lib/Transform/ScheduleOptimizer.cpp
193

Is "for any" better?

210

I’ve tried to get rid of this by explicit allocation of constraints in a new version of the patch.

229

Should we make this a public static member?

gareevroman edited edge metadata.

LGTM.

lib/Transform/ScheduleOptimizer.cpp
210

That looks better and provides a constructive desciption of what is happening (instead of the destructive one before). thx!

229

Would be great yes.

gareevroman marked 15 inline comments as done.

isolateFullPartialTiles was made a public static member of the optimizer class.

@grosser, @jdoerfert, could we commit this patch, if there are no more issues, except for the compile-time regression?

This revision was automatically updated to reflect the committed changes.