Page MenuHomePhabricator

[LoopDist] Add llvm.loop.distribute.enable loop metadata
ClosedPublic

Authored by anemet on Apr 22 2016, 12:24 PM.

Details

Summary

D19403 adds a new pragma for loop distribution. This change adds
support for the corresponding metadata that the pragma is translated to
by the FE.

As part of this I had to rethink the flag -enable-loop-distribute. My
goal was to be backward compatible with the existing behavior:

A1. pass is off by default from the optimization pipeline
unless -enable-loop-distribute is specified

A2. pass is on when invoked directly from opt (e.g. for unit-testing)

The new pragma/metadata overrides these defaults so the new behavior is:

B1. A1 + enable distribution for individual loop with the pragma/metadata

B2. A2 + disable distribution for individual loop with the pragma/metadata

The default value whether the pass is on or off comes from the initiator
of the pass. From the PassManagerBuilder the default is off, from opt
it's on.

I moved -enable-loop-distribute under the pass. If the flag is
specified it overrides the default from above.

Then the pragma/metadata can further modifies this per loop.

As a side-effect, we can now also use -enable-loop-distribute=0 from opt
to emulate the default from the optimization pipeline. So to be precise
this is the new behavior:

C1. pass is off by default from the optimization pipeline
unless -enable-loop-distribute or the pragma/metadata enables it

C2. pass is on when invoked directly from opt
unless -enable-loop-distribute=0 or the pragma/metadata disables it

Diff Detail

Repository
rL LLVM

Event Timeline

anemet updated this revision to Diff 54702.Apr 22 2016, 12:24 PM
anemet retitled this revision from to [LoopDist] Add llvm.loop.distribute.enable loop metadata.
anemet updated this object.
anemet added a reviewer: hfinkel.
anemet added a subscriber: llvm-commits.
hfinkel edited edge metadata.Apr 26 2016, 9:53 AM

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

  1. What is preventing us from enabling this pass by default?
  2. Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

  1. What is preventing us from enabling this pass by default?

The big piece is profitability. We need to prove that no ILP or MLP is impeded by distribution. Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case). I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).

ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much. This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate. Obviously we can start with the simple model initially.

We also need to ensure that the loop without the unsafe deps will be vectorized at the end. I.e. factor out the legality and profitability checks from the vectorizer. (There are some other efforts in this area, so this may happen independently.)

And of course independently from all this the user may still want to force distribution because it's known to be profitable. So even if the pass is on by default this feature in some form is still necessary.

  1. Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place. Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading. I think at this point I am more comfortable going with the more low-level control for distribution. What do you think?

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

  1. What is preventing us from enabling this pass by default?

The big piece is profitability. We need to prove that no ILP or MLP is impeded by distribution. Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case). I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).

Can you describe this in greater detail? There are definitely cases where we want to split loops based on available hardware prefetcher resource exhaustion, but I suspect that's a separate matter. Are you concerned here with finding likely high-latency loads and making sure we can hide them (to the extent possible) with other work?

ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much. This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate. Obviously we can start with the simple model initially.

This all makes sense. I suspect that we should really be doing the same thing for the vectorizer's interleaving.

We also need to ensure that the loop without the unsafe deps will be vectorized at the end. I.e. factor out the legality and profitability checks from the vectorizer. (There are some other efforts in this area, so this may happen independently.)

Makes sense.

And of course independently from all this the user may still want to force distribution because it's known to be profitable. So even if the pass is on by default this feature in some form is still necessary.

I agree.

  1. Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place. Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading. I think at this point I am more comfortable going with the more low-level control for distribution. What do you think?

For the vectorizer, we have a much higher limit on runtime checks when the user explicitly requests vectorization. Should we do the same here? [this is the only question here actually pertinent to this change; the other discussion can be moved elsewhere as desired].

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

  1. What is preventing us from enabling this pass by default?

The big piece is profitability. We need to prove that no ILP or MLP is impeded by distribution. Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case). I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).

Can you describe this in greater detail? There are definitely cases where we want to split loops based on available hardware prefetcher resource exhaustion, but I suspect that's a separate matter. Are you concerned here with finding likely high-latency loads and making sure we can hide them (to the extent possible) with other work?

Yes that or overlapping misses from the different loads. If you split missing loads into separate loops they are now all exposed whereas originally some were executed in the shadow of other misses.

Are you also asking about the ILP case? I can quickly describe it here if you haven't seen the slides or the talk.

ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much. This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate. Obviously we can start with the simple model initially.

This all makes sense. I suspect that we should really be doing the same thing for the vectorizer's interleaving.

We also need to ensure that the loop without the unsafe deps will be vectorized at the end. I.e. factor out the legality and profitability checks from the vectorizer. (There are some other efforts in this area, so this may happen independently.)

Makes sense.

And of course independently from all this the user may still want to force distribution because it's known to be profitable. So even if the pass is on by default this feature in some form is still necessary.

I agree.

  1. Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place. Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading. I think at this point I am more comfortable going with the more low-level control for distribution. What do you think?

For the vectorizer, we have a much higher limit on runtime checks when the user explicitly requests vectorization. Should we do the same here? [this is the only question here actually pertinent to this change; the other discussion can be moved elsewhere as desired].

Currently there is no limit on the number memchecks we emit for loop distribution. That said I certainly agree that when we add the threshold we should drastically increase it with the pragma just like we do it for the vectorizer.

I have no problem with adding metadata to control this, just as we do with the vectorizer. However:

  1. What is preventing us from enabling this pass by default?

The big piece is profitability. We need to prove that no ILP or MLP is impeded by distribution. Without this we can actually regress performance even if we vectorize some of the resulting loops (in my DevMeeting talk, the hmmer loop in SPECint without LoopLoadElimination is an example for the ILP case). I have some ideas for MLP (essentially to prove that the loop HW prefetcher friendly so misses should occur before and after).

Can you describe this in greater detail? There are definitely cases where we want to split loops based on available hardware prefetcher resource exhaustion, but I suspect that's a separate matter. Are you concerned here with finding likely high-latency loads and making sure we can hide them (to the extent possible) with other work?

Yes that or overlapping misses from the different loads. If you split missing loads into separate loops they are now all exposed whereas originally some were executed in the shadow of other misses.

Are you also asking about the ILP case? I can quickly describe it here if you haven't seen the slides or the talk.

ILP is trickier because I would have to estimate the length of the critical path and see if we still have enough independent instructions in the resulting loop to avoid exposing the critical loop by much. This may again be easier for the simple cases but for hmmer for example this needs to be fairly accurate. Obviously we can start with the simple model initially.

This all makes sense. I suspect that we should really be doing the same thing for the vectorizer's interleaving.

We also need to ensure that the loop without the unsafe deps will be vectorized at the end. I.e. factor out the legality and profitability checks from the vectorizer. (There are some other efforts in this area, so this may happen independently.)

Makes sense.

And of course independently from all this the user may still want to force distribution because it's known to be profitable. So even if the pass is on by default this feature in some form is still necessary.

I agree.

  1. Do we actually need to key it this way? Given that this metadata does not force anything, and is intended to enable vectorization, why not just key this off of !{!"llvm.loop.vectorize.enable", i1 1}. That way, should the user explicitly request vectorization, and we detect that this won't be possible, we try extra hard to provide it? That is, is there a use case where a user might want #pragma clang loop distribute(enable) without also adding vectorize(enable)?

That's an interesting idea but I think at the end we also want to give the user full control to know what transformation are taking place. Imagine we had many transformations that could turn a loop from non-vectorizable into vectorizable (e.g. peeling, data-layout modifications, etc.), the user may want to choose the particular transformation rather than just "vectorize at any cost".

Also as I mentioned it to your first point, we don't actually know if we are going to vectorize the resulting loop, so tying this to vectorize.enable can be misleading. I think at this point I am more comfortable going with the more low-level control for distribution. What do you think?

For the vectorizer, we have a much higher limit on runtime checks when the user explicitly requests vectorization. Should we do the same here? [this is the only question here actually pertinent to this change; the other discussion can be moved elsewhere as desired].

Currently there is no limit on the number memchecks we emit for loop distribution. That said I certainly agree that when we add the threshold we should drastically increase it with the pragma just like we do it for the vectorizer.

Isn't DistributeSCEVCheckThreshold just such a limit?

Isn't DistributeSCEVCheckThreshold just such a limit?

Ah good point, I was overly focused on memchecks ;). Let me update the patch.

Are you OK with the overall reasoning that we should have dedicate loop distribution pragma/metadata?

Thanks for review.

Adam

Isn't DistributeSCEVCheckThreshold just such a limit?

Ah good point, I was overly focused on memchecks ;). Let me update the patch.

Are you OK with the overall reasoning that we should have dedicate loop distribution pragma/metadata?

Yes, I think this makes sense.

Thanks for review.

Adam

anemet updated this revision to Diff 55144.Apr 26 2016, 5:48 PM
anemet edited edge metadata.
  • Added new SCEVCheckThreshold for the pragma enable case
  • Rebased on top of rL267643 which makes it easier to cache if distribution

was forced in the new per-loop class

hfinkel accepted this revision.Apr 26 2016, 6:14 PM
hfinkel edited edge metadata.

LGTM.

docs/LangRef.rst
4709 ↗(On Diff #55144)

You should probably add a sentence or two here, at least, explaining that "loop distribution" means in this context.

This revision is now accepted and ready to land.Apr 26 2016, 6:14 PM
anemet added inline comments.Apr 26 2016, 6:19 PM
docs/LangRef.rst
4709 ↗(On Diff #55144)

Thanks, will do.

This revision was automatically updated to reflect the committed changes.