This is an archive of the discontinued LLVM Phabricator instance.

Add a Scalarize pass
ClosedPublic

Authored by rsandifo on Nov 6 2013, 4:04 AM.
Tokens
"Like" token, awarded by pekka.jaaskelainen.

Details

Summary

This patch adds a pass to decompose vectors into scalar operations
at the IR level. I wanted this for System z because the architecture
does not have vector instructions and because scalarising at the IR
level exposes more optimisation opportunities.

It sounded like others are interested too, e.g.:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066790.html

and:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066798.html

where the pass could be used to scalarise OpenCL kernels and then
revectorise them to match the target.

In this initial version the patch does nothing unless -enable-decompose-vectors
is passed or unless the target says that it is beneficial. At the moment,
no targets say that, but I have a follow-on patch to make it the default
for SystemZ. You can force the pass off using -enable-decompose-vectors=false.
(This was all based on the -combiner-alias-analysis option.)

This is my first IR pass, so please let me know if I'm doing it wrong.

Diff Detail

Event Timeline

rsandifo added a subscriber: Unknown Object (MLST).Nov 6 2013, 4:08 AM

This looks good to me. A previous version of this pass is now in use in pocl and passes all tests (and accelerates many benchmarks) in its test suite when enabled. I will synch this version to the pocl kernel compiler soon.

include/llvm/Analysis/TargetTransformInfo.h
327

This is now a target choice which should be OK for starters. Later we might want to extend this to look into the code as well. In OpenCL case, for example, sometimes the original vectored code is better than the autovectorized decomposed code, sometimes not. Same goes for all code that uses vector datatypes.

lib/Transforms/Scalar/DecomposeVectors.cpp
286 ↗(On Diff #5376)

I'm not sure if it's safe to blindly transfer all metadata. Isn't it against the MD principles: if a pass doesn't understand the metadata's purpose, it should drop it, and it should be the safe thing to do. I do not have an example in mind, but it might be safer to white list the known OK MD (which you already list in the function comment)?

Does this handle vector GEPs? %A = getelementptr <4 x i8*> %ptrs, <4 x i64> %offsets,

lib/Transforms/Scalar/DecomposeVectors.cpp
286 ↗(On Diff #5376)

I'd like to second this. Only metadata that is currently understood can be transferred.

nadav added a comment.Nov 6 2013, 9:52 AM

Hi Pekka,

The name of the pass should be "scalarizer" because it scalarizes vectors. Can you add more tests ?

Did you try to run it with llvm-stress ? It may expose bugs.

Do you have plans for scalarizing function calls ? You won't get very far in OpenCL without scalarizing function calls.

Nadav

nadav added a comment.Nov 6 2013, 9:54 AM

Hi,

The name of the pass should be "scalarizer" because it scalarizes vectors. Can you add more tests ?

Did you try to run it with llvm-stress ? It may expose bugs.

Do you have plans for scalarizing function calls ? You won't get very far in OpenCL without scalarizing function calls.

Nadav

nadav added a comment.Nov 6 2013, 10:02 AM

It looks like the alignment of load/store is not handled the right way. The vector load may have an alignment of 1 byte so using the element alignment is incorrect. The new alignment should be the min of the original alignment and the element alignment.

rsandifo updated this revision to Unknown Object (????).Nov 7 2013, 6:34 AM

Thanks for the reviews. New in this version:

  • The pass is called Scalarizer rather than DecomposeVectors.
  • Alignment is taken into account.
  • Only known metadata is kept.
  • Loads and stores are kept as-is if the elements are not a whole number of bytes in size. It's still useful to decompose vectors of <N x i1> comparison results though -- in fact that was one of the original motivations.
  • Vector GEPs are scalarised too. (I hadn't realised they were allowed, thanks.)
  • There are more tests. Let me know if there's something else I should be testing for though.

I tried adapting the llvm-stress wrapper in:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-March/047876.html

to run opt at -O3 with -enable-scalarizer -scalarize-load-store. The number of runs and instruction sizes were both 1000. No problems were reported.

The pass doesn't yet try to scalarise function calls. At the moment we just gather/reassemble vectors whenever we have a use that isn't understood. That's enough for my target but it could be extended in future. (In the SystemZ llvmpipe code that I'm working with, the only function call is to sqrt.v4f32, but these testcases wouldn't benefit by scalarising that early.)

nadav added a comment.Nov 7 2013, 10:17 AM

On Nov 7, 2013, at 6:34, Richard Sandiford <rsandifo@linux.vnet.ibm.com> wrote:

Thanks for the reviews. New in this version:

  • The pass is called Scalarizer rather than DecomposeVectors.
  • Alignment is taken into account.
  • Only known metadata is kept.
  • Loads and stores are kept as-is if the elements are not a whole number of bytes in size. It's still useful to decompose vectors of <N x i1> comparison results though -- in fact that was one of the original motivations.

Excellent. Notice that we don't support bit packing of vectors to memory yet. So I am glad that you forbid vectors of sub byte elements.

  • Vector GEPs are scalarised too. (I hadn't realised they were allowed, thanks.)
  • There are more tests. Let me know if there's something else I should be testing for though.

    I tried adapting the llvm-stress wrapper in:

    http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-March/047876.html

    to run opt at -O3 with -enable-scalarizer -scalarize-load-store. The number of runs and instruction sizes were both 1000. No problems were reported.

Great!

The pass doesn't yet try to scalarise function calls. At the moment we just gather/reassemble vectors whenever we have a use that isn't understood. That's enough for my target but it could be extended in future. (In the SystemZ llvmpipe code that I'm working with, the only function call is to sqrt.v4f32, but these testcases wouldn't benefit by scalarising that early.)

Thanks for making the changes. I am currently at the llvm dev mtg but I will try to review the patch ASAP.

Nadav

http://llvm-reviews.chandlerc.com/D2112

CHANGE SINCE LAST DIFF
http://llvm-reviews.chandlerc.com/D2112?vs=5376&id=5399#toc

Files:
include/llvm-c/Transforms/Scalar.h
include/llvm/Analysis/TargetTransformInfo.h
include/llvm/InitializePasses.h
include/llvm/LinkAllPasses.h
include/llvm/Transforms/Scalar.h
lib/Analysis/TargetTransformInfo.cpp
lib/Transforms/IPO/PassManagerBuilder.cpp
lib/Transforms/Scalar/CMakeLists.txt
lib/Transforms/Scalar/Scalar.cpp
lib/Transforms/Scalar/Scalarizer.cpp
test/Transforms/Scalarizer/basic.ll
test/Transforms/Scalarizer/dbginfo.ll
test/Transforms/Scalarizer/no-data-layout.ll
<D2112.2.patch>

nadav added a comment.Nov 9 2013, 8:13 PM

Hi Richard,

I think that it is a good idea to have a scalarizer pass for people who want to build llvm-based compilers, but I don’t think that this pass should be a part of the default pass manager. Targets that want to scalarize the code should do it as part of instruction-selection (just declare the types as illegal). Why do you want to control scalatization from the target ? IMHO scalarization is only useful in undoing domain specific input IR.

Nadav

I can see use for this pass in two cases:

  1. when the original code uses vectors that the target does not support, example when compiling OpenCL into CPU code, or SIMD intrinsics into scalar CPU code
  2. when some optimization breaks the IR for your specific target/language/options combinations, but you still need some of the benefits of those passes

Number 2 should be fixed elsewhere, but number 1 is a valid, and very important case.

In the OpenCL case, the front-end should add this pass, since it's already changing the passes anyway. I'd guess that after complete scalarization, the vectorizers should take care of putting up SIMD code back in shape by re-building CPU-valid vectors.

In the SIMD to scalar case, it might not be that trivial. With this pass, it should be possible to compile SIMD intrinsics, even though the specific CPU doesn't support them, supposing all intrinsics will be converted to vector code, not intrinsic calls. In this case, inspecting the target info would be the place to take the decision, BUT, there are other things to take into account first, for instance, if we hit an intrinsic that cannot be expanded, it'll be an assert, rather than a user error, a problem that will have to be fixed first.

If you're interested in other cases where the TTI is the right place to add this info, other than the one above, please let us know.

--renato

Hi,

It doesn't have to be a "domain specific language" for one to want to use vector datatypes/intrinsics.
One point of view is that if the programmer has used them, he/she has done a target-specific optimization, assuming it is profitable to use those -- and it might be so for the original target at hand. Or the vector datatypes might be natural for the algorithm/problem at hand but not map optimally to the platform at hand.

To port the performance of that code to a machine with different SIMD or other fine grained parallel hardware (VLIW/superscalar), it might be profitable to first undo the user's vector code to map the code better to the parallel resources of the target at hand. In that case, the idea is to scalarize the explicit vector intrinsics and then revectorize with the target specific properties.

In the current set of LLVM passes, I think this could be done selectively from the vectorizers: if they see that the programmer's vectorization decisions render the code less vectorizable as a whole (e.g. if the loop vectorizer cannot horizontally vectorize the loops just because of it, or if the local vectorizers could find better pairings from scalar code), they can try to scalarize the code first and then vectorize more efficiently for the target at hand.

I agree this should not be on by default. I see it most beneficial when done selectively based on both the code and the target at hand. But it should not be left as a "backend pass" either because it might help the vectorizers.

BR,
Pekka

Hi Pekka and Rinato,

The proposed "scalarizer" pass is only useful for domain specific languages - as part of a non traditional optimization pipe. For example, scalarization in LLVM IR allows re-vectorization in OpenCL. However, the current implementation of the pass is not very useful for OpenCL because it does not scalarize function calls (such as DOT). But maybe someone can add this missing functionality one day. I am in favor of adding it to LLVM because LLVM is a collection of modular and reusable compiler libraries. I am worried about the increase in code size for people who don’t care about this functionality, but that’s a different story.

The scalarizer pass is not useful for the traditional optimization pipe because the LLVM codegen can already scalarize vectors. It happens automatically for targets that don’t support vectors. The vectorizer will not generate new vector instructions for processors with no vector instructions. People who use intrinsics or inline assembly are responsible for their optimizations and the compiler is not expected to save them when they port their code to targets that don’t support SIMD.

Some people use LLVM as a portable bytecode (SPIR, PNaCL). Vectorization is not something that should be done without target information, just like other lowering phases in the compiler (like LSR, CGP, legalization).

Thanks,
Nadav

IMO this could be committed, especially because it's not enabled by default. I added this latest version to the pocl's kernel compiler where it seems to work fine.

Hi Pekka,

The number of tests seem good, but it is enabled by default on the PassManagerBuilder, and that's not a good thing.

I think what you need here is to enable that as a command-line option (opt, later clang, etc), use opt to test the pass, and let specific front-ends enable them by deafult if they so wish, in their private implementations.

Renato Golin <renato.golin@linaro.org> writes:

Hi Pekka,

Not sure if this was really directed at Pekka or me, but...

The number of tests seem good, but it is enabled by default on the

PassManagerBuilder, and that's not a good thing.

I think what you need here is to enable that as a command-line option

(opt, later clang, etc), use opt to test the pass, and let specific
front-ends enable them by deafult if they so wish, in their private
implementations.

...the point is that we never want vector operations to be left around
on SystemZ, regardless of the frontend. In my use case, this is target-
specific rather than a frontend-specific property. We're always going
to scalarise eventually, whether it's at the IR level or at the CodeGen
level, and doing it at the IR level is preferable because it exposes more
optimisation opportunities.

Although the pass is added to the PassManagerBuilder by default, it doesn't
do anything unless the target wants it to.

Thanks,
Richard

Renato Golin <renato.golin@linaro.org> writes:

We're always going

to scalarise eventually, whether it's at the IR level or at the CodeGen
level, and doing it at the IR level is preferable because it exposes more
optimisation opportunities.

This is a good point, but still strikes me as a front-end option. There are
many front-end options that are enabled by default on certain targets.

Do you mean front ends enable the options by default for specific targets,
or that LLVM presents the options to front ends but sets their defaults
based on the target? If the latter, that'd certainly be fine with me.
If the former, then it seems like we'd have to duplicate knowledge about
the (lack of) vector capabilities of a target in LLVM and in any front
end that might need to handle vectorised input.

Since scalarisation happens anyway, I think this is more about knowing the
best order for doing things rather than deciding whether something is done
at all.

Although the pass is added to the PassManagerBuilder by default, it

doesn't

do anything unless the target wants it to.

It should be a matter of features, not targets. Anyway, is there any other
pass that has this kind of target-specific behaviour?

Like you say, isn't the vectoriser itself one of them? We have passes
that convert scalar operations to vector operations based on target
properties, and this patch is adding a pass that converts vector
operations to scalar operations based on target properties.

Thanks,
Richard

rsandifo updated this revision to Unknown Object (????).Nov 14 2013, 3:53 AM

Add a test with a variable insertelement index.

rsandifo updated this revision to Unknown Object (????).Nov 15 2013, 7:15 AM

Remove use of TargetTransformInfo. Do not add the pass to PassManagerBuilder.

We also talked about whether to use a reverse post-order traversal, but I don't think it's worth it in the use cases I have. It could easily be added later it we find a counterexample though.

rsandifo closed this revision.Nov 22 2013, 9:03 AM

r195471, thanks