[Unroll/UnrollAndJam/Vectorizer/Distribute] Add followup loop attributes.
Needs ReviewPublic

Authored by Meinersbur on Jul 12 2018, 10:15 PM.

Details

Summary

When multiple loop transformation are defined in a loop's metadata, their order of execution is defined by the order of their respective passes in the pass pipeline. For instance, e.g.

#pragma clang loop unroll_and_jam(enable)
#pragma clang loop distribute(enable)

is the same as

#pragma clang loop distribute(enable)
#pragma clang loop unroll_and_jam(enable)

and will try to loop-distribute before Unroll-And-Jam because the LoopDistribute pass is scheduled after UnrollAndJam pass. UnrollAndJamPass only supports one inner loop, i.e. it will necessarily fail after loop distribution. It is not possible to specify another execution order. Also,t the order of passes in the pipeline is subject to change between versions of LLVM, optimization options and which pass manager is used.

This patch adds 'followup' attributes to various loop transformation passes. These attributes define which attributes the resulting loop of a transformation should have. For instance,

!0 = !{!0, !1, !2}
!1 = !{!"llvm.loop.unroll_and_jam.enable"}
!2 = !{!"llvm.loop.unroll_and_jam.followup_inner", !3}
!3 = !{!"llvm.loop.distribute.enable"}

defines a loop ID (!0) to be unrolled-and-jammed (!1) and then the attribute !3 to be added to the jammed inner loop, which contains the instruction to distribute the inner loop.

Currently, in both pass managers, pass execution is in a fixed order and UnrollAndJamPass will not execute again after LoopDistribute. We hope to fix this in the future by allowing pass managers to run passes until a fixpoint is reached, use Polly to perform these transformations, or add a loop transformation pass which takes the order issue into account.

For mandatory/forced transformations (e.g. by having been declared by #pragma omp simd), the user must be notified when a transformation could not be performed. It is not possible that the responsible pass emits such a warning because the transformation might be 'hidden' in a followup attribute when it is executed, or it is not present in the pipeline at all. For this reason, this patche introduces a WarnMissedTransformations pass, to warn about orphaned transformations.

To ensure that no other transformation is executed before the intended one, the attribute llvm.loop.transformations.disable_nonforced can be added which should disable transformation heuristics before the intended transformation is applied. E.g. it would be surprising if a loop is distributed before a #pragma unroll_and_jam is applied.

With more supported code transformations (loop fusion, interchange, stripmining, offloading, etc.), transformations can be used as building blocks for more complex transformations (e.g. stripmining+stripmining+interchange -> tiling).

This approach deviates from proposal in the RFC at http://lists.llvm.org/pipermail/llvm-dev/2018-May/123690.html. There are three reasons:

  • In the RFC approach, when a pass wants to determine whether it should perform a transformation, it must search the list of transformations for the first transformation on its loop. When inlining, the two transformation list must be combined.
  • For compatibility, the current approach using loop IDs must still be supported, either each pass looks up both metadata formats, or using the AutoUpgrade mechanism. This patch's approach keeps the current mechanism.
  • The "loop IDs" can change. When a attribute is added/removed from the Loop ID metata (e.g. adding "llvm.loop.isvectorized" after vectorizing), the loop is assigned a new, distrinct MDNode. Every reference to these nodes needs to be updated as well, and metadata nodes do not support RAUW. Some of the current passes (e.g. LoopDistribution) keep the same loop ID for multiple output loops such that the loop ID is not uniquely identifying loops anymore. This patch's approach does not reference loop IDs in the attribute values.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
... As such, how to fall back when the transformation doesn't happen is almost equally important as what to do next when the transformation happens.

Hideki, I think that LLVM does the right thing here: We provide a separation between the hint and the mandate (i.e., using assume_safety or not). By default, the compiler should still provide safety conditions. The alternative to providing the functionality proposed here is, in reality, source-to-source code generators (which often do semantically-incorrect things and are hard to use in production for a large number of reasons). Also, the client for this functionality is not just programmers directly, but other tools (e.g., autotuners and other higher-level languages), which is also why generating safety predicates, or at least having that option, is important. I think that LLVM also currently does the right thing regarding transformations that don't apply: we issue a warning (which bubbles up from the optimizer via the optimization-remark interface). We should certainly continue to do that (and I know that Michael has experimented with ways of making this continue to happen reliably).

The idea behind this is so powerful such that even if we start from "best effort basis", programmers will quickly jump on and say make this more robust/predictable. We'd rather spend time to design this as a robust/predictable feature from the beginning than having to work on it under the customer pressure.

I don't think that we can ever really having something in this space which isn't best effort, but, I think that providing a warning is both necessary (because silent failure is poor user experience) and sufficient (it's not clear to me what kind of fallback we could provide that would be more robust in practice). If people start providing us with bug reports about loops that we couldn't transform, but should have transformed, that will be great data on what to improve. That having been said, if you can suggest pragmas that have semantics that allow us to control loop transformations in a way that's more robust than the current ones, then please, of course, suggest them.

I feel obligated to note, however, that the motivation for this work comes entirely from our experience supporting HPC users. I'm convinced that it will provide a significantly better user experience over the current state of the art. I'm sure that you've seen code that comes out of higher-level code generators and the like. These tend to be hard to maintain, inflexible, and buggy, and the code produced is difficult for both humans and compilers to understand. I'm sure you've also seen cases where people implement these transformations by hand (do I need to go on?). The compiler can perform these transformations and having the user able to direct the compiler to do so if a much better option. To some extent, a compiler can have cost models and heuristics to apply these automatically, but only occasionally do we actually have enough static information to do so (even with some PGO capabilities).

Some extra tests for nonforced + a pragma would be good to see.

Any transformations in particular?

I'm not much of an expert on the vectoriser changes here.

There is a call for a vectorizer person, and here I am,

Thank you!!!

but before going there, ever since this patch was uploaded, I've been thinking whether the original intent of this patch can be really accomplished ---- especially so with the set of "hints" being used here. So, I'd like to go back and start there if that's okay. I think this patch is in some sense based on the optimism of "programmer won't abuse this and transformation will happen". Reality is more like programmer will try using those pragmas to arm-twist the compiler to get the set of transformations he/she wants w/o thinking deep enough about what happens at each step of the way. As such, how to fall back when the transformation doesn't happen is almost equally important as what to do next when the transformation happens. From what I've read --- granted that I haven't gone through this very deep, the fall back aspect isn't handled well. If we don't start from "programmer specified transformation may fail to kick-in", providing this feature to the programmers would quickly backfire and we'll get tons of this doesn't work that doesn't work problem reports ---- which is a big mess/disaster. If we are doing this for research purposes, that may be fine. I'm looking at this from a production compiler development perspective.

There are two different approaches we can think of.

  1. Start from defining the transformation directives that actually transforms in most cases. For example, Intel compiler's implementation of OpenMP SIMD is in this position, and we are trying to bring the same position to LLVM LV. Then, failing to transform is a compiler bug (or a programmer has a bug in the source code to fix). OpenMP SIMD is defined in such a way for compilers to be able to take this positioning. In this approach, we can stop processing transformation directives after the first failed transformation. Even in this case, for example, programmer probably won't know under certain circumstances, vectorizer won't produce remainder loop, or under certain cases, vectorizer completely unrolls the loop so that there aren't any loops left after vectorization. So, controlling a single transformation in a programmer predictable manner --- enough to describe what should happen next --- is a big task.
  2. Based on transformation hint directives. Since what the programmer is using is a hint, for each transformation, programmer needs to tell the next step 1) if the transformation kicks in and also 2) if the transformation does not kick-in. This will be very messy and I'm not sure how practical it would be for the programmer to specify the behaviors for all situations.

    If these kinds of stuff have been already discussed, please give me the pointers and I'll try to digest. Else, we can talk about this in this review or in llvm-dev. Hope my argument makes sense to you/others, but please feel free to ask for clarification. I've been working on making SIMD directive programmer predictable for many years. So, I may be skipping some explanations that became natural enough to me over the years.

All these issues apply to the current loop metadata/#pragma clang loop as well. A user can specify #pragma clang loop distribute(enable) vectorize_width(4) with the expectation that these will be carried out. If not, this can be a bug (or semantically incorrect) just as if the distribution/vecorization was explicitly orderder using followup-attributes.
In contrast, currently there is no way to even express that the transformations should be carried out in there reverse order (first vectorization, then distribution; let's ignore for the moment whether this makes sense, it definitely makes sense with loop transformations other than those for which LLVM currently has metadata). Our longer-term plans are to support this in LLVM and the first step is to make sequences of transformations expressible in IR. Of course our goal is also to make transformations more applicable/robust. I see these as two orthogonal problems.

I don't think we need a fallback loop transformation. If a transformation cannot be applied, the user's reactions is probably not "let's do a different transformation then" (which will in most cases also fail for the same reasons the primary transformation failed), but "unfortunately the compiler cannot do my transformation I need to get the best performance, will implement it by hand then." Even if the first option is what the programmer wantsm they will implement it using preprocessor switches as it is common today with different compilers that support/do not support specific pragmas.

I had already some discussions on the reliability of transformations in the compiler. Some groups do not want to rely at all on the compiler being able to do a specific optimization and user libraries instead. Those libraries will 'miscompile' the input if certain preconditions are not met which is the desired outcome where slow execution just is no option. With incorrect results it at least becomes obvious that there is a problem. However, this is not an option for a compiler where correctness is the most important aspect (and only relaxed by attributes such as assume_safety). For some use cases, such as autotuning, being able to rely on the compiler producing correct output makes it possible in first place.
We could argue about whether we want high-level transformations in a low-level compiler in the first place. I think this has been answered a long time ago: LoopVectorizer, LoopUnroll, LoopInterchange, LoopDistribution, LoopUnswitch, etc. Since we have this kind of transformation, why not mkaing them as good as possible?

In short, doing this even for one transformation (in my case SIMD) is difficult enough.

I'd be happy to extract-out attributes for specific transformations in separate reviews. However, to be come useful, I think we need the entire set.

If we are trying to expand to multiple transformations, we should try doing so in baby steps.

I understood that you are working on making the loop vectorizer predictable, i.e. we are working towards the same goal.

The idea behind this is so powerful such that even if we start from "best effort basis", programmers will quickly jump on and say make this more robust/predictable. We'd rather spend time to design this as a robust/predictable feature from the beginning than having to work on it under the customer pressure.

I think the current one-pass-per-transformation is indeed very fragile and I am working on something that should be more robust.

lib/Transforms/Scalar/LoopUnrollPass.cpp
771

If llvm.loop.unroll.enable is not set, interpretation here is that the transformation is 'non-forced', that is, llvm.loop.unroll.count is a hint to the compiler that if it unrolls, then it should unroll by that amount. llvm.loop.disable_nonforced overrride the decision whether to unroll, i.e. the unroll factor does not matter.

I am aware that the concept of 'forced' transformations is not consistent between passes, but I am trying to give it some consistency. Passes could query shared code such as hasUnrollTransformation in LoopUtils. hasUnrollTransformation currently follows your interpretation of llvm.loop.unroll.count / llvm.loop.disable_nonforced. I am happy to implement either definition, as long as we find a consistent rule.

test/Transforms/LoopUnroll/unroll-pragmas_transform.ll
2

This is a copy of unroll-pragmas.ll and any ambiguous metadata replaced by follow-up attributes.

An hope is to generally make 'multiple transformation attributes on the same loop' illegal and rejected by the IR verifier (since the result depends on an implementation detail: the order in the pass manager). In this case this file would replace unroll-pragmas.ll.

But my expectation is that we cannot break backwards-compatibility this way.

6

Yes: When no follow-up attributes are specified, the default ones are added (here: llvm.loop.unroll.disable to disable further unrolling). In case there are follow-up attribute lists, there is no default and the transformation-disabling must be added explicitly (MDNode !18) and of course added after unrolling and recognized by the second LoopUnroll.

I agree that this is a very powerful idea (something I wish I'd had back when I was writing psuedo-hpc applications). I think it's well worth having, but equally worth making sure we get it right. The clang side is very important for that.

Some extra tests for nonforced + a pragma would be good to see.

Any transformations in particular?

There seem to be tests that nonforced disables things, but not that nonforced + an attribute keeps it enabled. e.g. the unroll.count metadata.

lib/Transforms/Scalar/LoopUnrollPass.cpp
771

I would expect that if a loop has any metadata for a pass, that would mean disable_nonforced doesn't apply. As if the user has specified some metadata, it likely wants something to happen.

I think in this specific case llvm.loop.unroll.count implies llvm.loop.unroll.enable, and we wouldn't put both on a loop for "#pragma unroll(4)" or "#pragma clang loop unroll_count(4)"

test/Transforms/LoopUnroll/unroll-pragmas_transform.ll
2

Ah, I missed the "followup" here. Is it worth replicating this entire file, or should it just be an extra test in the old file.
The "followup" on unroll_1 seems to be the only test changed here? To add unroll.disable as a followup attribute? I'm not sure I see why. Would we expect "#pragma unroll(1)" to not work as it did before? (disable unroll)

... As such, how to fall back when the transformation doesn't happen is almost equally important as what to do next when the transformation happens.

Hideki, I think that LLVM does the right thing here: We provide a separation between the hint and the mandate (i.e., using assume_safety or not).

That, I know. I wasn't questioning about that.

What I'm not seeing from this RFC/patch is that, if the programmer specifies transformation behavior A -> B -> C, what happens if transformation A does not kick-in? Should we just warn that "A did not happen" and stop processing the request B and C?
Also, if the programmer requests that the loop to be distribute in three ways and specify different transformations for each, what should the latter transformation do if the loop is distributed in two ways or four ways? If we are serious about introducing this kind of features, we should clearly define what should happen when the programmer intention cannot be satisfied well enough ---- when we should continue honoring and when we should stop honoring. If we say "we should stop in all those circumstances", that should simplify the problem a lot. If we say "we should allow to continue on subset of those cases", we should clearly state which subset and why. If there are any prior discussions (or descriptions within this patch) along this lines, please point me to that.

... As such, how to fall back when the transformation doesn't happen is almost equally important as what to do next when the transformation happens.

Hideki, I think that LLVM does the right thing here: We provide a separation between the hint and the mandate (i.e., using assume_safety or not).

That, I know. I wasn't questioning about that.

What I'm not seeing from this RFC/patch is that, if the programmer specifies transformation behavior A -> B -> C, what happens if transformation A does not kick-in? Should we just warn that "A did not happen" and stop processing the request B and C?
Also, if the programmer requests that the loop to be distribute in three ways and specify different transformations for each, what should the latter transformation do if the loop is distributed in two ways or four ways? If we are serious about introducing this kind of features, we should clearly define what should happen when the programmer intention cannot be satisfied well enough ---- when we should continue honoring and when we should stop honoring. If we say "we should stop in all those circumstances", that should simplify the problem a lot. If we say "we should allow to continue on subset of those cases", we should clearly state which subset and why. If there are any prior discussions (or descriptions within this patch) along this lines, please point me to that.

I certainly agree that we should document this.

What I'm not seeing from this RFC/patch is that, if the programmer specifies transformation behavior A -> B -> C, what happens if transformation A does not kick-in? Should we just warn that "A did not happen" and stop processing the request B and C?

Yes. A warning will be emitted by the -transform-warning pass (Please see Passes.rst). B and C cannot apply on a loop that does not exist.

Also, if the programmer requests that the loop to be distribute in three ways and specify different transformations for each, what should the latter transformation do if the loop is distributed in two ways or four ways?

The current LoopDistribution pass unfortunately does not support this, by a goal is to make the user able to define what code should become their own loop. See [[ A Proposal for Loop-Transformation Pragmas | https://arxiv.org/abs/1805.03374 ]].
For the current LoopDistribution pass, only two categories of followup-attributes can be defined noncyclic and cyclic. The noncyclic category can be added to multiple loops.

If we are serious about introducing this kind of features, we should clearly define what should happen when the programmer intention cannot be satisfied well enough ---- when we should continue honoring and when we should stop honoring. If we say "we should stop in all those circumstances", that should simplify the problem a lot. If we say "we should allow to continue on subset of those cases", we should clearly state which subset and why. If there are any prior discussions (or descriptions within this patch) along this lines, please point me to that.

Documented in TransformMetadata.rst line 57ff.

docs/TransformMetadata.rst
164

Yes, maybe, but they are also already documented in the LangRef.rst. Please understand that the goal in this patch is to define a transformation model for each pass such that it is clear what are those followup-loops, not to write an exhaustive documentation.

What I'm not seeing from this RFC/patch is that, if the programmer specifies transformation behavior A -> B -> C, what happens if transformation A does not kick-in? Should we just warn that "A did not happen" and stop processing the request B and C?

Yes. A warning will be emitted by the -transform-warning pass (Please see Passes.rst).

This part, I know you did.

B and C cannot apply on a loop that does not exist.

I don't think this is explicitly written. Here's an example. Suppose A is vectorization and B is unroll. If a loop is somehow not vectorized. Unrolling can still happen to the non-vectorized loop. Whether we stop unrolling in this situation is what I'd like to see us being explicit about.

Also, if the programmer requests that the loop to be distribute in three ways and specify different transformations for each, what should the latter transformation do if the loop is distributed in two ways or four ways?

The current LoopDistribution pass unfortunately does not support this, by a goal is to make the user able to define what code should become their own loop. See [[ A Proposal for Loop-Transformation Pragmas | https://arxiv.org/abs/1805.03374 ]].
For the current LoopDistribution pass, only two categories of followup-attributes can be defined noncyclic and cyclic. The noncyclic category can be added to multiple loops.

Whether distribution currently supports that is a different issue. I'm sure we will be expanding the features in the futures. This composability discussion should encapsulate the baseline behaviors for enough of possible future situations ---- else we have to keep revising baseline behaviors, which is very bad.

If we are serious about introducing this kind of features, we should clearly define what should happen when the programmer intention cannot be satisfied well enough ---- when we should continue honoring and when we should stop honoring. If we say "we should stop in all those circumstances", that should simplify the problem a lot. If we say "we should allow to continue on subset of those cases", we should clearly state which subset and why. If there are any prior discussions (or descriptions within this patch) along this lines, please point me to that.

Documented in TransformMetadata.rst line 57ff.

I only see the warning behavior there. I'd like to see us explicitly saying that any subsequent explicit transformation metadata will be ignored for the given loop ---- if that's what we'll agree on, or be explicit about something else we'll agree on in the terms that can be clearly explainable to the programmers. "Compiler will skip all remaining transformations after the first failed transform" is pretty straightforward to the programmers. If anyone is proposing other behaviors, I'd like to also see how to explain those behaviors to the programmers.

Meinersbur updated this revision to Diff 160051.Aug 9 2018, 8:42 PM
Meinersbur marked 2 inline comments as done.
  • Explicitly document followup of not applied transformations to be ignored
  • Unroll/UnrollAndJam: Interpret enable/count/full as forced
  • Unroll/UnrollAndJam: Add tests for disable_nonforced combined with enable/count/full
  • Reduce size of unroll-pragmas_transform.ll

I'd like to see us explicitly saying that any subsequent explicit transformation metadata will be ignored for the given loop ---- if that's what we'll agree on, or be explicit about something else we'll agree on in the terms that can be clearly explainable to the programmers. "Compiler will skip all remaining transformations after the first failed transform" is pretty straightforward to the programmers. If anyone is proposing other behaviors, I'd like to also see how to explain those behaviors to the programmers.

I added a paragraph to TransformMetadata.rst. (I was assuming it was obvious from the definition: A transformation in a followup-attribute only becomes assigned to a loop by the loop transformation pass. Before that, it is not associated with any loop)

I'd like to see us explicitly saying that any subsequent explicit transformation metadata will be ignored for the given loop ---- if that's what we'll agree on, or be explicit about something else we'll agree on in the terms that can be clearly explainable to the programmers. "Compiler will skip all remaining transformations after the first failed transform" is pretty straightforward to the programmers. If anyone is proposing other behaviors, I'd like to also see how to explain those behaviors to the programmers.

I added a paragraph to TransformMetadata.rst. (I was assuming it was obvious from the definition: A transformation in a followup-attribute only becomes assigned to a loop by the loop transformation pass. Before that, it is not associated with any loop)

The added paragraph looks good to me on the implementation side specification. Looking forward to see the programmers (i.e., compiler users, not compiler writers) side pragma description, but that will not gate my review of this patch. There is a difference between specification forcing one behavior versus implementation choice ends up in the same behavior. I wanted the former, not the latter. With this specification, we can have another implementation choice ---- attaching all those metadata to the loop, to be updated by the successful transformation, and let failed transform drop subsequent ones. I'm not saying it's better to go that way. What I'm saying is that if, for some reason, we later choose to implement this differently, there is a specification to guide how to implement the feature correctly. Hope I don't sound too picky. I just want to provide consistent experiences to the programmers.

The added paragraph looks good to me on the implementation side specification. Looking forward to see the programmers (i.e., compiler users, not compiler writers) side pragma description, but that will not gate my review of this patch. There is a difference between specification forcing one behavior versus implementation choice ends up in the same behavior. I wanted the former, not the latter.

Different behavior of different implementations is also a serious concern for me. I have three different implementations in mind (the current loop transformations, an extension to Polly to use this metadata, and an idealized loop-transformation pass; the latter two being more powerful is one of the motivtions for this path). Given the prototypical transformations in TransformMetadata.rst, I think the model is applicable to other implementations as well.

With this specification, we can have another implementation choice ---- attaching all those metadata to the loop, to be updated by the successful transformation, and let failed transform drop subsequent ones. I'm not saying it's better to go that way. What I'm saying is that if, for some reason, we later choose to implement this differently, there is a specification to guide how to implement the feature correctly.

This would unfortunately break existing behavior. E.g. llvm.loop.distribute.enable and llvm.loop.vectorize,enable can both be specified in the same loop attributes. Currently, if LoopDistribution fails, the attribute llvm.loop.vectorize.enable will be left untouched. If we change LoopDistribution to remove it, the loop would not be vectorized anymore (assuming the heuristic does not deem it profitable).
It also does do what motivates this patch: Neither the order of transformations be specified, nor can the same transformation be applied multiple times.

I was testing the code and ran into some problems with debug metadata on the loop nodes (actually using -Rpass=unroll in that case). Can you make sure that works as expected?

docs/TransformMetadata.rst
111

Nit: never be added

include/llvm/Transforms/Utils/LoopUtils.h
176

Nit: transformations->transformation

184

Nit: inherit

196

Nit: choose

226

Nit: warning

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
206

This code will need rebasing. There is a check earlier that looks for disable metadata that could be replaced by this. Look for HasUnrollDisablePragma/HasUnrollAndJamDisablePragma. If the same was done for unrolling, I think that would remove the need for the IgnoreUser (although your comment about it is probably still true).

lib/Transforms/Utils/LoopUtils.cpp
285

Maybe InheritSomeAttrs -> InheritNonExceptAttrs?

hiraditya added inline comments.Aug 14 2018, 5:13 AM
lib/Transforms/Scalar/LoopUnrollPass.cpp
1101

nit: maybe put the string literals as a separate declaration?

lib/Transforms/Utils/LoopUnrollRuntime.cpp
542

What is the rationale of using pointer to a pointer here? If we want to assign to ResultLoop, then maybe we can just return ResultLoop and bool as a pair.

925

nit: space

I am thinking about adding a LoopMetadataTacker (sort of a combination of LoopVectorizeHints and AssumptionTracker) analysis pass which would centralize the interpretation of that metadata and avoid the linear search through the metadata list when looking up a specific attribute.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
206

I'd push towards refectoring-out the common parts of computeUnrollCount used by LoopUnroll and LoopUnrollAndJam. Currently computeUnrollCount uses lots of settings meant for LoopUnroll (llvm.loop.unroll. metadata which should not exist anymore, OptimizationRemarkMissed specific to LoopUnroll, -unroll-count, PartialThreshold, handling of full unroll, loop peeling that UnrollAndJam does not support, being used in a single call by UnrollAndJam for two different things: determining ExplicitUnroll (i.e. is normal unroll is forced) and the unroll-and-jam count). It's hard to understand the subtleties between those codes.

I gave up at some point and added the IgnoreUser flag to make test cases pass.

lib/Transforms/Utils/LoopUnrollRuntime.cpp
542

If the Result loop is not needed, one can pass nullptr (which is the default argument). Returning std::pair will require more changes.

925

This is done by clang-format. It try not to fight its decisions and hope for future improvement.

lib/Transforms/Utils/LoopUtils.cpp
285

Avoiding double negation here.

Meinersbur added inline comments.Aug 14 2018, 3:46 PM
lib/Transforms/Utils/LoopUnrollRuntime.cpp
925

I will try to remove the space in patch updates, but may sneak in again when I re-run clang-format and forget about it before submission.

hsaito added inline comments.Aug 15 2018, 5:33 PM
docs/LangRef.rst
5185

I understand that the RST file update should talk about what happens today, but for the sake of code review, it's good to discuss what could happen in reasonably foreseeable future so that we don't under-design things.

I think we should be thinking ahead about

  1. vectorizer peeling the loop, for example, for alignment optimization. Such peeled loop could be fully unrolled if the trip count is known, or vectorized with mask.
  2. main vector loop could be fully unrolled
  3. there may be more than one remainder loop, e.g., vectorized remainder followed by scalar remainder.
  4. remainder loop may be fully unrolled.

All those situations could happen w/o programmer knowing it'll happen that way.

Some of the questions we want to think before the real need arises:

Will the loop attribute get dropped if the "loop" is fully unrolled?
How do we designate more than one remainder loop?
Will the loop attribute applicable for vectorized peel/remainder?
Should we have a way to designate runtime-DD non-vectorizable loop separately from remainder?
5282

Remainder here may be unrolled again or fully unrolled (see the comments on vectorize metadata). What do we do for that?

5342

Is there an assumption of unroll_and_jam operating only on a double loop and/or a perfect loop? Technically speaking, we can unroll_and_jam a loop if we can legally outerloop-vectorize. So, there can be multiple inner loops.

5411

Looks rather centric to distribute-for-vectorization.

Loop distribution can happen for many reasons (and it may be more than one reasons). Are we going to define followup_ Metadata for each of those reasons? What'll happen if a loop matches the characteristics of more than one Metadata?

Meinersbur marked 15 inline comments as done.
  • Report unroll-and-jam as not applied even if unroll is present as well.
    • rename followup_cyclic/followup_noncyclic to followup_sequential/followup_coincident
    • Move hasUnrollAndJamTransformation in LoopUnrollAndJamPass to different place
    • Remove some unrelated whitespace changes made by clang-format
    • Extract followup attribute names into constant
docs/LangRef.rst
5185
  1. I think a prologue/peel is analogous to epilogue/remainder. That is, a new llvm.loop.vectorize.followup_peel can be added.
  2. Should be handled as two separate transformations (such as vectorize/interleave). That is llvm.loop.disable_nonforced would ensure that a loop does not unexpectedly disappear
  3. llvm.loop.followup_remainder should apply on any of the remainder loops. If a finer distinction is required, we can add more specific attributes.
  4. This can already happen, at least with LoopUnroll/LoopUnrollandJam. The docs mentions that in this case the followup_remainder is dropped.

However, changing the model a transformation transform to can indeed raise some backward-compatibility issues. This also applies to the user-interface. If a programmer added

#pragma clang loop vectorize(enable)

do they expect it to be unrolled as well? Loop peeling? D50480 is interesting here: At -Os, it uses masking instead of an epilogue to avoid a code copy. In this case followup_remainder explicitly states that there's not necessarily a remainder loop, so I don't see a problem here. But a programmer might expect more control over what the output structure is.

We can add more attributes to control this behaviour, such as llvm.loop.vectorize.peel.enable, llvm.loop.vectorize.remainder.enable, llvm.loop.vectorize.allow_versioning. The interesting question is, what is the default setting?
If we go by the current behavior to maximize backwards-compatibility, remainder and versioning would be enabled by default (if not in -Os), peeling disabled because it is not yet implemented.

On the other side we probably do not want frontends to emit the most recent enable-metadata to get the best vectorization. So we would enable all features by default, but the output loops might be different from what the programmer intended before the feature is introduced.

We can enable all features unless the transformation is forced, in which case all deviations from the current transformation model needs to be explicitly enabled.

IMHO, we can decide this case-by-case, weighting compatibility concerns and optimization levels. Then again, such transformations does not influence the correctness of the output.

To be less concerned about compatibility issues, I could for now remove all followup-attributes except those that are 'central' to the transformation, and followup_all. For vectorization, there will always be the performance-critical vectorized loop (i.e. followup_vectorized), independent of whether there is a prologue, epilogue or fallback. For partial unrolling, it is always a unrolled loop.

  • Will the loop attribute get dropped if the "loop" is fully unrolled?

Yes. But it should not happen if llvm.loop.disable_nonforced is used and the unroll is not explicitly specified.

  • How do we designate more than one remainder loop?

Using different attributes. Like followup_all it is possibly to address a group of loops.

  • Will the loop attribute applicable for vectorized peel/remainder?

Only for the followup that addresses them

  • Should we have a way to designate runtime-DD non-vectorizable loop separately from remainder?

As mentioned sometime before, the typical reaction to 'loop not vectorized' is not 'ok, let's unroll it instead', but 'how can I make it vectorize'. So I don't think fallbacks are necessary (unless we can apply a sequence of transformation to multiple loop), but I am open if you think there is a need for such.

5282

followup_remainder is ignored. If this it is not clear from the section in TransformMetadata.rst, please tell me.

5296–5297

@dmgreen This directly contradicts the nounroll_plus_unroll_and_jam test case in Transforms/LoopUnrollAndJam/pragma.ll

5342

There's still an outermost (unrolled) loop and an innermost (jammed) loop. We could also adds followups for middle loops.

If it is the naming that concerns you: Would you prefer followup_unrolled and followup_jammed?

5411

There is no overlap between cyclic and noncyclic. For the extended loop-transformations, the user would name the loops they want distributed.

Indeed, these followup are are specific to the current distribution pass. However, I think it is easy for any distribution to determine whether a loop has cyclic dependences and and those attributes to any output loop that matches.

makeFollowupLoopID can already combine attributes from multiple followups.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
206

IgnoreUser exists for the nounroll_plus_unroll_and_jam in LoopUnrollAndJam\pragma.ll.

llvm.loop.unroll.disable causeshasUnrollTransformation` in computeUnrollCount to return TM_Disable. Unrolling inside computeUnrollCount is disabled setting the unroll factor to 0. UnrollAndJam then tries to use the that unroll factor.

dmgreen added inline comments.Aug 19 2018, 2:35 AM
docs/LangRef.rst
5296–5297

The way this should be working at the moment is:

If there is any unroll_and_jam metadata
  do that thing             (the user explicitly asked for a thing -> do it)
if there is any unroll metadata
  disable unrollandjam      (leave the loop to the unroller)
normal heiristics

I think with "but no `llvm.loop.unroll_and_jam` metadata", that is what this is saying. Correct me if I'm wrong and it's not working like this. Or feel free to update it if it's unclear. Or if you think this should work another way...? ;)

I originally invisioned unrollandjam as an extension to the unroll pass, so I sometimes see the two things as interrelated. If a user specifies loop.unroll.disable, they almost certainly wanted to disable all unrolling, not just that in the unroll pass.

5342

I believe he meant this being unrolled and jammed:

for i {
  for j
    A(i,j)
  for k
    B(i,k)
}

This is not something we currently support as I didn't think it would ever be likely to be profitable. Users specifying metadata might change that. The pass could be able to be expanded to work on this (I think), but it's not something that it currently does.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
206

Have you considered moving the "disable" check out of computeUnrollCount and into tryUnrollLoop, where the existing "HasUnrollDisablePragma" check is? Hopefully that could be replaced with the new method, much like HasUnrollAndJamDisablePragma has been, and would mean this computeUnrollCount function would just work as it used to.

Meinersbur added inline comments.Aug 20 2018, 10:16 AM
docs/LangRef.rst
5296–5297
If there is any unroll_and_jam metadata
  do that thing             (the user explicitly asked for a thing -> do it)

There is a comment in LoopUnrollAndJamPass.cpp:

// We have already checked that the loop has no unroll.* pragmas.

According to this, this is not true (since only checked afterwards) and computeUnrollCount will consider e.g. llvm.loop.unroll.count even when used for unroll-and-jam.

I was concerned about the last phrase plus llvm.loop.unroll.disable metadata will disable unroll and jam too., but it might be a misinterpretation in that it will disable unroll-and-jam, but only if unroll-and-jam is not explicitly enabled.

5342

The j- and k-loops are both inner loops, so followup_inner should apply to both of them.
distinguishing them might be possible when introducing a mechanism like for naming the output loops of loop distribution.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
206

There are multiple mechanisms in computeUnrollCount that disable unrolling (such as UnrollCountPragmaValue returning zero).

If I was to fix this method, I'd do it cleanly by refactoring-out the mechanism that computes the unroll factor when pragmas/options are absent (and not emit any LoopUnroll-specific diagnostics).

hfinkel added inline comments.Sep 21 2018, 8:44 AM
docs/LangRef.rst
5117

This is too strong (see comment below).

5196

preserving -> preserve

5196

e.g., because two ...

5198

vector lane -> set of vector lanes

5204

added to both the vectorized and remainder loop

docs/TransformMetadata.rst
15

By default, transformation passes use heuristics to determine whether or not to perform transformations, and when doing so, other details of how the transformations are applied (e.g., which vectorization factor to select).

18

As stated, this is untrue (for -O3). For -O3, we only require a likely speedup across many workloads (and slowdowns be unlikely). This is why, for example, under -O3, we can vectorize with runtime checks. How about this wording:

Unless the optimizer is otherwise directed, transformations are applied conservatively. This conservatism generally allows the optimizer to avoid unprofitable transformations, but in practice, this results in the optimizer not applying transformations that would be highly profitable.

23

it -> they

25

for -> of

52

Unrolling, etc. - no need to capitalize.

55

We should be careful with the language here. As any of these can be dropped without changing the semantics of the code, nothing here is "mandatory". How about saying, ", or convey a specific request from the user"

59

optimization-missed warning

61

I know what you mean by "separate", but I think it's better to say:

is separate. -> can be unused by the optimizer without generating a warning.

69

This wording is too strong. I think that we need to draw a distinction here between (at least) three classes of transformations:

  1. canonicalizing transformations
  2. (cost-model-driven) restructuring transformations
  3. low-level (target-specific) transformations (e.g., using ctr-register-based loops on PowerPC)

I believe that this pragma should only affect those in class (2). Canonizalizing transformations are always performed (when optimizing at all), and low-level transformations are beyond the reach of this kind of metadata.

I'd recommend using the wording that this metadata disables, "optional, high-level, restructuring transformations."

71

avoids that the loop is altered -> avoids the loop being altered

88

Why is this a useful feature? Should we allow only one transformation per node?

91

loop being vectorized -> loop to be vectorized

104

This leaves open the question of whether the vectorizer adds the 'isvectorized' attribute when a follow-up is specific. It should, right?

121

for -> to

122

comma after following

153

Why would isvectorized not always be provided?

301

where 'rtc' is the generated runtime safety check.

391

must be -> should be

392

responsible for this reporting

393

they might -> there might

393

being able - > able

395

is -> may be

(keep the entire list in the hypothetical)

411

in a fixedpoint loop -> using a dynamic ordering

(not to be too prescriptive)

include/llvm/Transforms/Utils/LoopUtils.h
223

We can't have metadata necessary for correctness. 'ForcedByUser' is fine to indicate that the user should receive a warning if the transformation cannot be performed.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
365

So each inner loop gets the same id? That doesn't sound right.

Meinersbur marked 26 inline comments as done.Fri, Sep 28, 4:21 AM
Meinersbur added inline comments.
docs/TransformMetadata.rst
88

It's not a new feature, but effectively the current behavior. ("is possible for compatibility
reasons"). I'd indeed prefer to crash if multiple transformations are applied to avoid undefined behavior, which unfortunately is a breaking change.

104

I think it should not, but the metadata gives complete control over which attributes will be in the new loop to the metadata. If a frontend wants to apply a transformation twice, it should be able to do so.

I think the paragraph says that no such additional implicit attributes are added when a followup is specified:

If no followup is specified, ...

I added "If, and only if," to make it even clearer.

153

I don't know how LoopVectorize behaves when encountering vector instructions, but I want to avoid the vagueness of some passes adding implicit metadata in some situation. The IR should have the control over whether a transformation is applied multiple times.

include/llvm/Transforms/Utils/LoopUtils.h
223

[comment] This indicates that using metadata for user-directed loop-transformation #pragmas is a leaky abstraction.

lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
365

LoopID is a misnomer. A LoopID is neither unique (multiple loops having the same LoopID, e.g. because LoopVersioning of any pass not aware of loops copied the BBs of a loop; there is even a regression test for the behaviour with non-unique LoopID) nor identifying (adding/removing attributes in the LoopID MDNode will create a new MDNode; see D52116 for a fix for llvm.loop.parallel_accesses assuming this property).

Fixing LoopID to be identifier-like is not possible with the current MDNode structure and would require to make any pass that copies code to be aware of LoopIDs. It would be easier to not assume that LoopID has any identifying properties. I am open to rename 'LoopID' to something else.

Meinersbur marked an inline comment as done.