This is an archive of the discontinued LLVM Phabricator instance.

Scalarizer: limit scalarization for small element types
ClosedPublic

Authored by nhaehnle on May 4 2023, 4:42 AM.

Details

Summary

Scalarization can expose optimization opportunities for the individual
elements of a vector, and can therefore be beneficial on targets like
GPUs that tend to operate on scalars anyway.

However, notably with 16-bit operations it is often beneficial to keep
<2 x i16 / half> vectors around since there are packed instructions for
those.

Refactor the code to operate on "fragments" of split vectors. The
fragments are usually scalars, but may themselves be smaller vectors
when the scalarizer-min-bits option is used. If the split is uneven,
the last fragment is a shorter remainder.

This is almost NFC when the new option is unused, but it happens to
clean up some code in the fully scalarized case as well.

Diff Detail

Event Timeline

nhaehnle created this revision.May 4 2023, 4:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2023, 4:42 AM
nhaehnle requested review of this revision.May 4 2023, 4:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2023, 4:42 AM
Herald added a subscriber: wdng. · View Herald Transcript
fhahn added inline comments.May 4 2023, 5:48 AM
llvm/test/Transforms/Scalarizer/basic-inseltpoison.ll
11–14

Would it be possible to split off the type change i32 -> i64 to reduce the diff?

llvm/test/Transforms/Scalarizer/min-bits.ll
2

Those are the main tests for the new functionality, right? Would it be possible to pre-commit those and then just show the diff here without the new flag?

nhaehnle added inline comments.May 4 2023, 6:21 AM
llvm/test/Transforms/Scalarizer/basic-inseltpoison.ll
11–14

I'll see what I can do.

llvm/test/Transforms/Scalarizer/min-bits.ll
2

Yes, these are the main tests. The intention was to show both versions side-by-side with different prefixes, but I can also go the precommit route.

arsenm added inline comments.May 4 2023, 7:56 AM
llvm/lib/Transforms/Scalar/Scalarizer.cpp
67–69

Should this go based on legal vector operations from TTI?

Alternatively, would we just scalarize everything and then run SLPVectorizer?

439

No else after return

nhaehnle updated this revision to Diff 519785.May 5 2023, 2:49 AM
nhaehnle marked an inline comment as done.
  • rebase on top of preparatory commits
  • remove else after return
  • add an explicit test for binary ops
nhaehnle marked 2 inline comments as done.
nhaehnle added inline comments.
llvm/lib/Transforms/Scalar/Scalarizer.cpp
67–69

I wouldn't want to rely on SLPVectorizer. Even assuming that it is able to undo all the scalarization (which would need to be investigated), I expect that it would increase compile-time cost.

As for TTI -- I'm not a fan of it in general. Going by legal instructions is tricky, because which legal instructions do you look at? And adding yet another hook isn't great. Also, the ScalarizerPass is not included in any default pass builder pipeline as far as I can see. We use it in LLPC, where we are well aware of the target and can just explicitly pass options to the pass constructor, and I assume all other users are analogous.

bjope added a subscriber: bjope.May 6 2023, 12:38 AM

ping again after two weeks

bjope added inline comments.Jun 1 2023, 1:02 AM
llvm/lib/Transforms/Scalar/Scalarizer.cpp
66

I think I misunderstood this at first.

My interpretation was that if setting this to 16 it would not scalarize vectors with element sizes up to 16 bits.
So it wouldn't scalarize <16 x i8> or <4 x i16> while it would scalarize <2 x i24> and <2 x i32>.

But this size is not mapping to the element sizes, right?
We could not get some kind of vector split/re-partition from <16 x i8> to <2 x i8>. So it is not really scalarizing as the value still will be a vector.

Not sure exactly how to rephrase it to make that clearer (considering that I misunderstood this to be an element size).

Maybe I got fooled by the slogan for this patch "limit scalarization for small element types". I actually thought that I would see something that prevented scalarization from happening when the element size was smaller than a threshold. But what the patch actually seem to do is to prevent scalarization to happen for large vector factors (it just splits it up into using smaller vectors instead of scalarizing).

So everywhere in this pass when it says "scalarize" I guess one should read it as "split" (or "resize" or something similar). For example code comments saying "Perform actual scalarization" could be followed by code that emit vector operations.

204

nit: You've changed I to Frag in other places where we no longer refer to a vector index. I think it doesn't hurt to do it here as well for consistency.

389–390

So this is now "component Frag"?

Although generally it's a little bit messy with terminology. Is a "component" also a "fragment"?

424

I think this should say "looking for element Frag".

Thank you for taking a look. I'm going to address the low-level issues you pointed out immediately.

llvm/lib/Transforms/Scalar/Scalarizer.cpp
66

Yeah, this is a fair point and naming is difficult. This is related to the fact that this pass is really meant for GPUs, where we use vector types in a way that's a bit different from CPUs.

On CPUs, the intention of vector types is that they ultimately get mapped to dedicated vector registers.

On GPUs (at least all modern GPUs that I'm aware of), there are no CPU-style vector registers. Instead, the intention of vector types is that they get mapped to contiguous sequences of "scalar" registers (itself a somewhat problematic term because of SIMT, but let's go with that for now).

What this change aims to do with min-bits=32 is essentially scalarization in that sense: vector types are broken up until they are either scalar types or they are vectors that fit in a single "scalar" register.

Does that make sense?

204

Sure, makes sense.

389–390

Yeah, I'm changing this comment.

424

Yes.

nhaehnle updated this revision to Diff 527335.Jun 1 2023, 2:41 AM

Address some review comments

bjope added inline comments.Jun 1 2023, 3:04 AM
llvm/lib/Transforms/Scalar/Scalarizer.cpp
66

Ok, I see. And replacing all "scalarize" by "reduce vector factor" would be a rather large change. Maybe not worth it as long as it is obvious that the pass is splitting vectors by reducing the vector factor. And sometimes it stops "before reaching vector factor 1" (which kind of would be the same as having fully scalarized the vector).

We use the Scalarizer downstream. And we run it in beginning of llc to scalarize most operations while for example leaving wide loads/stores. Otherwise we would need to for example deal with legalizing lots of vector operations at ISel instead (although I think it would also impact passes run before ISel in the backend so it's a bit). So our goal with the scalarizer is just to get rid of (most) vector operations in beginning of the backend.
We could perhaps make benefit from this new functionality in the future, for example leaving <2 x iN> around for certain operations when that would match with the target instruction set.

foad accepted this revision.Jun 6 2023, 1:11 AM

LGTM overall, though I have only reviewed some of it carefully and skimmed the rest.

llvm/lib/Transforms/Scalar/Scalarizer.cpp
66

I think a comment here would help, along the lines of "split vectors larger than this size into fragments, where each fragment is either a vector no larger than this size or a scalar".

Also can you say something (either here or somewhere else prominent) about the implications for whether or not we split an operation with different vector sized operands or result, like zext <4 x i8> %val to <4 x i32>?

608

Do you have any test cases for when this division is not exact?

This revision is now accepted and ready to land.Jun 6 2023, 1:11 AM
piotr added a comment.Jun 6 2023, 7:16 AM

Having looked at some real-world graphics content on AMDGPU (with ScalarizeMinBits = 32), I can confirm the usefulness of this patch. I can see more packed instructions generated (for example v_pk_add_f16, v_pk_mul_f16, v_pk_fma_f16).

arsenm accepted this revision.Jun 8 2023, 4:19 PM

LGTM with nit.

Also a TTI control and / or pass parameter would be better than cl::opt

llvm/lib/Transforms/Scalar/Scalarizer.cpp
220

I swear this function exists somewhere else but I can't seem to find it

409

Don't need pointer bitcast anymore

nhaehnle marked 2 inline comments as done.Jun 13 2023, 12:14 PM

Thank you all for the reviews. I've addressed the remaining small comments as part of the commit.

Also a TTI control and / or pass parameter would be better than cl::opt

All of the existing cl::opts as well as this new one can be set as a pass parameter.

llvm/lib/Transforms/Scalar/Scalarizer.cpp
220

Yeah. I was looking for it but couldn't find one. A version exists for SelectionDAG and I believe for G-MIR

608

I have now :)

This revision was automatically updated to reflect the committed changes.