This is an archive of the discontinued LLVM Phabricator instance.

Add more efficient vector bitcast for AArch64
ClosedPublic

Authored by lawben on Mar 4 2023, 1:45 AM.

Details

Summary

Adds a DAG combine checks for vector comparisons followed by a bitcast or truncating store to a scalar value. Previously, this resulted in an expand. Now, this is done with a constant number of instructions that take one bit per vector value (via an AND mask) and perfom a horizontal add to get a single value. This is especially useful for Clang's __builtin_convertvector() to a bool vector.

Issue: https://github.com/llvm/llvm-project/issues/59829

Diff Detail

Event Timeline

lawben created this revision.Mar 4 2023, 1:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2023, 1:45 AM
lawben requested review of this revision.Mar 4 2023, 1:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2023, 1:45 AM

@t.p.northover As this is my first patch submitted to LLVM, this is just a short ping to check if there is something that I have missed or forgotten to do. I'm not yet familiar with the procedure.

lawben updated this revision to Diff 506884.Mar 21 2023, 2:40 AM

Apply the bitcast to truncating stores.

The first commit only covers the case with an explicit bitcast to a scalar value. When directly storing the comparison result as a scalar value, this can be done via a truncating
store. In that case, we also want to apply the combine step to get the more efficient conversion.

lawben edited the summary of this revision. (Show Details)Mar 21 2023, 2:40 AM
david-arm added inline comments.Mar 23 2023, 6:20 AM
llvm/test/CodeGen/AArch64/vec-combine-compare-and-store.ll
1 ↗(On Diff #506884)

Please can you re-run the update_llc_test_checks.py script for the new version and let it generate the CHECK lines for the new test?

Also, it would be good if you can pre-commit these tests before this patch, so that in the diff here we can see the changes in generated code? That helps the reviewers to see what effect your new DAG combines have.

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
1

Please can you re-run the update_llc_test_checks.py script for the new version and let it generate the CHECK lines for the new test?

I think this needs updating to include the whole patch, similar to the other one in D146212. I think multiple commits need to be squashed into one by the look of it.

lawben updated this revision to Diff 508032.Mar 24 2023, 3:50 AM

Rebase onto main

@david-arm @dmgreen I rebased this into a single commit. I hope the changes are shown correctly now. All changes here are new, I did not modify any tests. I guess this was just shown incorrectly because I used multiple commits.

Sp00ph added a subscriber: Sp00ph.Mar 25 2023, 7:45 AM

Does "vector comparison followed by bitcast" mean that something like this would not get optimized?

define i16 @cast(<16 x i1> %vec) {
    %ret = bitcast <16 x i1> %vec to i16
    ret i16 %ret
}

Forgive me if I'm wrong, but I don't see why the preceding comparison is necessary. Couldn't you just do the bit-and trick on any mask vector, regardless of how it was created?

lawben added a comment.EditedMar 25 2023, 7:56 AM

@Sp00ph Your example would not be optimized. The issue with that example is: how is a bitcast to i1 defined? The current logic in LLVM uses the least significant bit. But this trick does not work in that case, as we use bits 0 to n for lanes 0 to n, so we only use the least significant one for lane 0. If we have a comparison, we know that all bits are 1 or all bits are 0, so the least significant one is equal to all others. Without a comparison, we could shift the least significant bit and then do the rest, but that would need an extra instruction. Maybe this could be added in a follow-up? I'm happy to discuss options here. The current approach is a bit more conservative.

If we have a comparison, we know that all bits are 1 or all bits a 0, so if the least significant one is equal to all others.

Aren't the elements in a <N x i1> guaranteed to be 0 or -1 (so all zeros or all ones) anyways? And even if there was always an extra instruction emitted so that for compare + bitcast the flow would look like this: <initial compare> -> <compare returned bitmask> -> <use and-trick on the result of that>, I would assume that LLVM would just trivially optimize out the second compare if it knows that the result of the first compare already contains all zeros/all ones.

If we have a comparison, we know that all bits are 1 or all bits a 0, so if the least significant one is equal to all others.

Aren't the elements in a <N x i1> guaranteed to be 0 or -1 (so all zeros or all ones) anyways? And even if there was always an extra instruction emitted so that for compare + bitcast the flow would look like this: <initial compare> -> <compare returned bitmask> -> <use and-trick on the result of that>, I would assume that LLVM would just trivially optimize out the second compare if it knows that the result of the first compare already contains all zeros/all ones.

Thats a fair point and might actually be a cleaner solution, given that two consecutive comparisons are actually "merged". I've been looking at this primarily from the Clang/C++ side to optimize the __builtin_convertvector() function, which always adds the comparison. I did not know that <n x i1> guarantees that all bits are 0 or 1 if the physical type is larger than i1.

I'll have a look into this next week. I'm not 100% sure yet where this optimization would need to be located (maybe in LowerBITCAST or some bitcast combine). I played around with a few options when writing this code, and depending on where in the optimization I was, the vector type was different, as the <n x i1> vector is not a legal type that get's promoted. It may be a bit tricky to find the correct time to detect the bitcast from <n x i1>. If you have some suggestions/ideas where this could be done, feel free to share. Otherwise, I'll just dig around a bit.

I unfortunately have barely any experience with LLVM internals and its codebase, so I probably won't be of too much help here. I believe that the main culprit behind the bad codegen are load/store operations on vectors whose elements are smaller than one byte. From what I understand, bitcasts most of the time get lowered to alloca; store as SrcType; load as DstType, and the loads/stores are fully scalarized for 1 bit elements. So optimizing the loads/stores would then probably also fix the bitcasts as a byproduct. If I'm not mistaken, the locations to introduce the store/load optimizations would be here: https://github.com/llvm/llvm-project/blob/cb96eba27cd18ecf8041bf1b9a5c7e197f7a2749/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp#L6236 and here: https://github.com/llvm/llvm-project/blob/cb96eba27cd18ecf8041bf1b9a5c7e197f7a2749/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp#L5148

Actually no, those functions I linked to might not be doing what I think they're doing. There's also this, which might be the right one? https://github.com/llvm/llvm-project/blob/cb96eba27cd18ecf8041bf1b9a5c7e197f7a2749/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp#L692-L701

Either way, the right move is probably to either add a target specific lowering that uses the bit-and method to AArch64ISel or to add it as a general fallback to all targets.

I haven't looks into the details here, just high level. If we can custom legalize the bitcast then that sounds like it could handle quite a few cases. Although it might not automatically optimize quite so nicely in all of them.

One thing worth remembering is that bitcast work a bit funny under bigendian. They are defined as storing the value then loading it again, which can result in a different lane order.

lawben updated this revision to Diff 510006.EditedMar 31 2023, 6:03 AM

Changed large parts of where this conversion takes place.

It is now located in a) lowering BITCASTs and b) combining truncating stores. This is now more generic than my old
appraoch. In most cases, there is no difference in generted code. In one or two places, we lose information about the original vector type, so the SETCC is truncated and we then
perform the conversion on that truncated vector. So whiel this adds a vector extract instruction in some cases, it is more general overall and handles more cases.

Now also handles the cases described by @Sp00ph, i.e., we don't need a comparison for this to work, as we might add one ourselves.

This still needs handling of big endian systems. But I think this is in a stage that we can discuss the overall design: @dmgreen @Sp00ph @david-arm

Sp00ph added inline comments.Mar 31 2023, 6:34 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19685–19688

Does this always produce the same result as the current implementation? If in general a trunc store would get lowered to something like this:

%trunc = trunc <N x i8> %vec to <N x i1>
store <N x i1> %trunc, ptr %dst

then I think the result would differ if e.g. the input was <1 x i8> <i8 2> (current implementation would store a 0, this PR would store a 1). This should be easily fixable though by doing vec & splat(1) != splat(0) instead of just vec != splat(0).

I think in general it would be way more convenient if LLVM just stored a flag for every vector that remembers whether the vector is guaranteed to be a "mask vector" where all elements are either 0 or -1 (where the flag would be true for all i1 vectors and would be preserved by trunc or sext operations). Then we wouldn't need to manually search whether the vector is a comparison result or not. Adding such a flag seems way out of scope for this PR though, so I think your strategy seems good considering the current constraints.

Sp00ph added inline comments.Mar 31 2023, 6:49 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19649

There's probably quite a few more operations that preserve the mask property (xor, min and max come to mind).

Can we split the store part into a separate patch? As far as I understand they can be treated separately and that would help keep the two parts simpler.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19672

Could we use computeNumSignBits to check if all the bits will be the same? It might not work very well at this point in the lowering, but llvm has fairly decent support for NumSignBits.

Or always include the setcc and optimize it away if it is not required, which we might know more clearly after everything has been legalized.

There are some other tests that need updating too, by the look of it.

@dmgreen I can split this into two patches. I'll remove the truncate store part and only focus on the bitcast for now.

Re: tests. I'm not sure what the "LLVM-way" to proceed is. For example in

  • dag-combine-setcc.ll there are explicit tests to cover the "old" expanded case. I guess I can just delete those, as I have new tests somewhere else? Or should I also change them there.
  • in vec_umulo.ll, there are also cases with this old pattern, but they are interleaved with other logic that I am not familiar with. I can try to change them to the best of my knowledge. Would it make sense to add the original author of that test to review this patch?

@Sp00ph there seems to be a way to get this meta-information about all bits being -1 or 0. See my inline comment about computeNumSignBits. Unfortunately, it does not seem to work well here.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19672

I looked into ComputeNumSignBits a bit but I am turning in circles. We only have a <n x i1> type here, so we get 1 here, which is not helpful. We need the "original" vector type/size for this to work, which we only have if we traverse through the logical ops and get to the SETCC. If there are any other helper functions that can do this, please let me know. I did not find any, but there are so many, I might have missed it.

I've tried to always add the comparison, but it does not get removed again. So I'd need to add some logic that combines arbitrary SETCCs with boolean operators in-between. I'd like to avoid that, because it seems (way) out of scope for the change I'm proposing here.

19685–19688

Actually, I don't think we need it here. For us to get here, we either go through:

a) replaceBoolVectorBitcast, which guarantees that the target type is <n x i1>
b) combineVectorCompareAndTruncateStore, which explicitly does what you mentioned.

In a), we know that the type is a vector of i1s, so there must have been some operation to generate/truncate that. If it is truncated, the code for AND 1 is added in the truncate.

I can still add this defensively, but I think we don't need it. What do you think?

@dmgreen I can split this into two patches. I'll remove the truncate store part and only focus on the bitcast for now.

Re: tests. I'm not sure what the "LLVM-way" to proceed is. For example in

  • dag-combine-setcc.ll there are explicit tests to cover the "old" expanded case. I guess I can just delete those, as I have new tests somewhere else? Or should I also change them there.
  • in vec_umulo.ll, there are also cases with this old pattern, but they are interleaved with other logic that I am not familiar with. I can try to change them to the best of my knowledge. Would it make sense to add the original author of that test to review this patch?

I think you can just update the test checks in them (with update_llc_test_checks.py). So long as the code looks better (smaller, but still correct), all is good.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19672

OK I was worried that might not work. How sure are we that vxi1 types will always become all-one or zero laned vectors? It feels like we might be hoping they will become them, as opposed to proving it is true (and then adding an optimization where that isn't, at least for the bitcast).

19685–19688

What happens if the isChainOfComparesAndLogicalOps path is removed? I'm not sure all the tests would still look OK.

lawben updated this revision to Diff 511335.Apr 6 2023, 2:28 AM

Addressed some review comments.

Main changes:

  • Removed the truncate store code. I'll add this in a follow-up patch.
  • Defensively added a AND 1 check if we dont have a SETCC chain for correct truncation.

The failing tests from before are solved for now. Most were related to truncating stores, which is not part of this patch anymore.
The dag-combine-setcc tests were adapted in a recent patch, so they no longer apply.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19672

With this "chain" check, we can guarantee that all bits are 1s or 0s as the chain _must_ start with SETCC. Adding AND/OR/XORs with other vectors that _must_ start with SETCC does not change that property. If we don't find this chain (either is is not there or too long), we defensively add a check for != 0, which again guarantees that all bits are 1s or 0s. So after both the if and else branch, we know that all bits are 1s or 0s. I've added another test that checks if this check is added for a "bad" chain.

19685–19688

I'm not saying that the isChainOf check should be removed. I was arguing for for not needing the AND 1 truncation. I've decided to add it defensively anyway to avoid producing wrong results. It has no impact on the current 19 new test, as it is either created here or somewhere else during lowering, so there is no additional cost here in all cases I've looked at.

This also allows me to remove the somewhat duplicated logic in combineVectorCompareAndTruncateStore (see follow-up patch), so this is probably the cleaner way anyways.

lawben added inline comments.Apr 6 2023, 2:31 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19649

Adding XOR is reasonable. I'd not add MIN/MAX, as they are not logical operators. So while they technically apply and preserve the all-1/0 property, I feel like they don't "belong" here.

dmgreen added inline comments.Apr 9 2023, 8:07 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19685–19688

Sorry - my point was that if we take this patch and remove the if (isChainOfComparesAndLogicalOps(..)) part, some of the tests do not look correct any more, which suggests this path isn't always valid.

As far as I understand VecVT is always a vXi1 type, which makes the And with 1 a bit odd and we are assuming that a setcc vXi1 N, vXi1 0 will always become a, say, v16i8 vector with all the lanes either 0 or 1. I think instead it would be better to produce a sext vXi1 N to vXi?, where the new type is a legal size. This is essentially what getBoolExtOrTrunc is doing already, but it might be better to be explicit as this transform really requires the lanes to be all-one of all-zero.

The best type (vXi?) needs to be guessed at, which might best come from the setcc if there is one (the same as the existing isChainOfComparesAndLogicalOps, but it only needs to be a guess now to get the best type).

This way we make sure that even in isolation, this transform is "correct" and is not making assumptions about what will happen in other places with vXi1 types.

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
7

It may be good to add more tests for the combinations of [2, 4, 8, 16] x [i8, i16, i32], to make sure they are performing OK and not running into any problems.

lawben updated this revision to Diff 512204.Apr 10 2023, 11:42 AM

Changed approach as suggested by @dmgreen. We now use an explicit sign-extend and ignore the vector compare. The sign-extend is removed in later steps if there is a vector compare,
so there is no overhead. This change allows us to determine the original type in more cases, as we can detect both SETCC and TRUNC.

Added a few more tests, also explicitly checking <64-bit vectors and <8-bit elements.

Just as an FYI for the tests: the sign-extend is implemented as a shift left by size-1 bits and a signed comparison <0, which is true if the MSB is set.

lawben added inline comments.Apr 10 2023, 11:45 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19685–19688

I've implemented it like you suggested with a sign-extend. I'm "guessing" the vector type to be 64-bit. This may add an xtn instruction for 128-bit vectors, but in a few cases, the vector is reduced to 64-bit anyways, so we don't loose anything there. The alternative is to assume 128-bit vectors, but that resulted in a few expansions that we don't need.

dmgreen accepted this revision.Apr 11 2023, 5:10 AM

LGTM

Thanks for the patch. Let me know if you want me to submit it.

This revision is now accepted and ready to land.Apr 11 2023, 5:10 AM

Thanks for your effort in reviewing this patch. I think this solution is nicer than my original approach.

Could you please merge it with "Lawrence Benson <github@lawben.com>".

FYI: Once this is merged, I'll create the follow-up for the truncating store.

Sorry - me again. I was running some precommit tests, compiling ffmpeg and it ran into problems (it is quite good at that with vector operations). I think because there is a bitcast(fcmp v4f32). Can you add a test for that case too?

lawben updated this revision to Diff 512857.Apr 12 2023, 8:52 AM

Add test for float vector. This required a single-line change to convert the VecVT to an integer vector for sign-extend to work.

This revision was automatically updated to reflect the committed changes.
efriedma added inline comments.Apr 21 2023, 12:53 PM
llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42

Instead of addv.8b+addv.8b+fmov+fmov+orr, you could use zip1+addv.8h+fmov, I think?

lawben added inline comments.Apr 22 2023, 4:05 AM
llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42

I did a quick implementation with NEON intrinsics. Your idea is correct, but it is combined into a different set of instructions in the end.

The gist of it being: if we use vzip_u8 to combine both halves, this returns a uin8x8x2_t, which we need to combine into a uint8x16_t for the vadd.8h. But this is essentially the same as just shuffling the input bytes of the original comparison result in the form 0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15. As far as I know, there is no instruction to zip two "smaller" vectors into a "larger" one, so we need the shuffle (as tbl) here.

On my M1 MacBook Pro, this is actually ~50% faster than my original code with two addv. We are replacing an extract + addv + fmov + or with adrp + ldr + tbl. This seems to be a good trade-off, at least on an M1. I read somewhere, that addv is quite expensive, so maybe removing one for a tbl is good.

@efriedma @dmgreen What are your thoughts on this? I'm currently building on this patch in D148316. I would suggest merging that one first and then updating the v16i8 strategy.

efriedma added inline comments.Apr 24 2023, 9:35 AM
llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42

ext+zip1 vs. tbl isn't a huge difference in most cases. (Maybe we're combining the NEON intrinsics a little too aggressively, though? tbl is sort of slow on some chips.)

Fixing this as a followup to D148316 seems fine.

lawben added inline comments.Apr 28 2023, 7:19 AM
llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42

I did some benchmarks on my M1, Graviton 2, and Graviton 3. The results indicate the following things:

  • Doing this for v2i64 is beneficial on M1 and Graviton 2. On Graviton 3, it's slightly slower. So this puts us in a bit of a pickle. The gains on M1 and Graviton 2 are larger (+50% and +20%) than the loss on Graviton 3 (-10%), so I think it makes sense to keep this as is. What do you think?
  • The tbl variant (the combination that I mentioned based on the zip approach proposed by @efriedma) is faster by ~50% on all three CPUs. So I think it makes sense to change the implementation to use a shufflevector + addv instead of two addv, as addv seems to be more expensive across the board.
  • As soon as there are 4 elements or more, it pays off on all measured CPUs.

I'll submit a patch in the next few days to update the v16i8 implementation, and possibly the v2i64 one, depending on your comments.00

ext+zip1 vs. tbl isn't a huge difference in most cases

I'm still not quite sure if I fully understood your suggested approach though. I don't see how we can get this down to a single zip1 instruction. As fas as I can tell, we need 2x ext, zip1 and zip2 + ins to combine the two again. In that case, tbl is probably a decent choice. But if you have a suggestion to actually get this down to a single zip, please let me know so I can test it.

efriedma added inline comments.Apr 28 2023, 9:14 AM
llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42
uint16_t ext_zip(uint8x16_t a, uint8x16_t b) {
    constexpr uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
    auto matches = vceqq_u8(a, b);
    auto masked_matches = vandq_u8(matches, mask);
    auto zipped = vzip1q_u8(masked_matches, vextq_u8(masked_matches, masked_matches, 8));
    return vaddvq_u16(vreinterpretq_u16_u8(zipped));
}

https://godbolt.org/z/P9b8qWT3Y

Apparently there's a bug somewhere that makes the "vzip1q_u8" not produce the right instruction with clang, but it works fine with gcc.

efriedma added inline comments.Apr 28 2023, 9:25 AM
llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42

Filed https://github.com/llvm/llvm-project/issues/62434 for the odd vzip1q_u8 behavior.