This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
6/12
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
2/7
vec-combine-compare-to-bitmask.ll

Differential D145301

Add more efficient vector bitcast for AArch64
ClosedPublic

Authored by lawben on Mar 4 2023, 1:45 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
dmgreen
david-arm
bcl5980

Commits

rG3653722ce6d9: [AArch64] Add more efficient vector bitcast for AArch64

Summary

Adds a DAG combine checks for vector comparisons followed by a bitcast or truncating store to a scalar value. Previously, this resulted in an expand. Now, this is done with a constant number of instructions that take one bit per vector value (via an AND mask) and perfom a horizontal add to get a single value. This is especially useful for Clang's __builtin_convertvector() to a bool vector.

Issue: https://github.com/llvm/llvm-project/issues/59829

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lawben created this revision.Mar 4 2023, 1:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2023, 1:45 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

lawben requested review of this revision.Mar 4 2023, 1:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2023, 1:45 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

lawben added a reviewer: t.p.northover.Mar 4 2023, 1:48 AM

Harbormaster completed remote builds in B217358: Diff 502361.Mar 4 2023, 2:50 AM

@t.p.northover As this is my first patch submitted to LLVM, this is just a short ping to check if there is something that I have missed or forgotten to do. I'm not yet familiar with the procedure.

Apply the bitcast to truncating stores.

The first commit only covers the case with an explicit bitcast to a scalar value. When directly storing the comparison result as a scalar value, this can be done via a truncating
store. In that case, we also want to apply the combine step to get the more efficient conversion.

lawben edited the summary of this revision. (Show Details)Mar 21 2023, 2:40 AM

Harbormaster completed remote builds in B220650: Diff 506884.Mar 21 2023, 2:40 AM

efriedma added reviewers: dmgreen, david-arm, bcl5980.Mar 21 2023, 11:35 AM

efriedma added a subscriber: efriedma.

david-arm added inline comments.Mar 23 2023, 6:20 AM

llvm/test/CodeGen/AArch64/vec-combine-compare-and-store.ll
1 ↗	(On Diff #506884)	Please can you re-run the update_llc_test_checks.py script for the new version and let it generate the CHECK lines for the new test? Also, it would be good if you can pre-commit these tests before this patch, so that in the diff here we can see the changes in generated code? That helps the reviewers to see what effect your new DAG combines have.
llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
1	Please can you re-run the update_llc_test_checks.py script for the new version and let it generate the CHECK lines for the new test?

I think this needs updating to include the whole patch, similar to the other one in D146212. I think multiple commits need to be squashed into one by the look of it.

Rebase onto main

@david-arm @dmgreen I rebased this into a single commit. I hope the changes are shown correctly now. All changes here are new, I did not modify any tests. I guess this was just shown incorrectly because I used multiple commits.

Harbormaster completed remote builds in B221537: Diff 508032.Mar 24 2023, 4:50 AM

Does "vector comparison followed by bitcast" mean that something like this would not get optimized?

define i16 @cast(<16 x i1> %vec) {
    %ret = bitcast <16 x i1> %vec to i16
    ret i16 %ret
}

Forgive me if I'm wrong, but I don't see why the preceding comparison is necessary. Couldn't you just do the bit-and trick on any mask vector, regardless of how it was created?

@Sp00ph Your example would not be optimized. The issue with that example is: how is a bitcast to i1 defined? The current logic in LLVM uses the least significant bit. But this trick does not work in that case, as we use bits 0 to n for lanes 0 to n, so we only use the least significant one for lane 0. If we have a comparison, we know that all bits are 1 or all bits are 0, so the least significant one is equal to all others. Without a comparison, we could shift the least significant bit and then do the rest, but that would need an extra instruction. Maybe this could be added in a follow-up? I'm happy to discuss options here. The current approach is a bit more conservative.

In D145301#4221481, @lawben wrote:

If we have a comparison, we know that all bits are 1 or all bits a 0, so if the least significant one is equal to all others.

Aren't the elements in a <N x i1> guaranteed to be 0 or -1 (so all zeros or all ones) anyways? And even if there was always an extra instruction emitted so that for compare + bitcast the flow would look like this: <initial compare> -> <compare returned bitmask> -> <use and-trick on the result of that>, I would assume that LLVM would just trivially optimize out the second compare if it knows that the result of the first compare already contains all zeros/all ones.

In D145301#4221490, @Sp00ph wrote:

In D145301#4221481, @lawben wrote:

If we have a comparison, we know that all bits are 1 or all bits a 0, so if the least significant one is equal to all others.

Aren't the elements in a <N x i1> guaranteed to be 0 or -1 (so all zeros or all ones) anyways? And even if there was always an extra instruction emitted so that for compare + bitcast the flow would look like this: <initial compare> -> <compare returned bitmask> -> <use and-trick on the result of that>, I would assume that LLVM would just trivially optimize out the second compare if it knows that the result of the first compare already contains all zeros/all ones.

Thats a fair point and might actually be a cleaner solution, given that two consecutive comparisons are actually "merged". I've been looking at this primarily from the Clang/C++ side to optimize the __builtin_convertvector() function, which always adds the comparison. I did not know that <n x i1> guarantees that all bits are 0 or 1 if the physical type is larger than i1.

I'll have a look into this next week. I'm not 100% sure yet where this optimization would need to be located (maybe in LowerBITCAST or some bitcast combine). I played around with a few options when writing this code, and depending on where in the optimization I was, the vector type was different, as the <n x i1> vector is not a legal type that get's promoted. It may be a bit tricky to find the correct time to detect the bitcast from <n x i1>. If you have some suggestions/ideas where this could be done, feel free to share. Otherwise, I'll just dig around a bit.

I unfortunately have barely any experience with LLVM internals and its codebase, so I probably won't be of too much help here. I believe that the main culprit behind the bad codegen are load/store operations on vectors whose elements are smaller than one byte. From what I understand, bitcasts most of the time get lowered to alloca; store as SrcType; load as DstType, and the loads/stores are fully scalarized for 1 bit elements. So optimizing the loads/stores would then probably also fix the bitcasts as a byproduct. If I'm not mistaken, the locations to introduce the store/load optimizations would be here: https://github.com/llvm/llvm-project/blob/cb96eba27cd18ecf8041bf1b9a5c7e197f7a2749/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp#L6236 and here: https://github.com/llvm/llvm-project/blob/cb96eba27cd18ecf8041bf1b9a5c7e197f7a2749/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp#L5148

Actually no, those functions I linked to might not be doing what I think they're doing. There's also this, which might be the right one? https://github.com/llvm/llvm-project/blob/cb96eba27cd18ecf8041bf1b9a5c7e197f7a2749/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp#L692-L701

Either way, the right move is probably to either add a target specific lowering that uses the bit-and method to AArch64ISel or to add it as a general fallback to all targets.

I haven't looks into the details here, just high level. If we can custom legalize the bitcast then that sounds like it could handle quite a few cases. Although it might not automatically optimize quite so nicely in all of them.

One thing worth remembering is that bitcast work a bit funny under bigendian. They are defined as storing the value then loading it again, which can result in a different lane order.

Changed large parts of where this conversion takes place.

It is now located in a) lowering BITCASTs and b) combining truncating stores. This is now more generic than my old
appraoch. In most cases, there is no difference in generted code. In one or two places, we lose information about the original vector type, so the SETCC is truncated and we then
perform the conversion on that truncated vector. So whiel this adds a vector extract instruction in some cases, it is more general overall and handles more cases.

Now also handles the cases described by @Sp00ph, i.e., we don't need a comparison for this to work, as we might add one ourselves.

This still needs handling of big endian systems. But I think this is in a stage that we can discuss the overall design: @dmgreen @Sp00ph @david-arm

Sp00ph added inline comments.Mar 31 2023, 6:34 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19685–19688	Does this always produce the same result as the current implementation? If in general a trunc store would get lowered to something like this: %trunc = trunc <N x i8> %vec to <N x i1> store <N x i1> %trunc, ptr %dst then I think the result would differ if e.g. the input was `<1 x i8> <i8 2>` (current implementation would store a 0, this PR would store a 1). This should be easily fixable though by doing `vec & splat(1) != splat(0)` instead of just `vec != splat(0)`.

I think in general it would be way more convenient if LLVM just stored a flag for every vector that remembers whether the vector is guaranteed to be a "mask vector" where all elements are either 0 or -1 (where the flag would be true for all i1 vectors and would be preserved by trunc or sext operations). Then we wouldn't need to manually search whether the vector is a comparison result or not. Adding such a flag seems way out of scope for this PR though, so I think your strategy seems good considering the current constraints.

Sp00ph added inline comments.Mar 31 2023, 6:49 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19649	There's probably quite a few more operations that preserve the mask property (xor, min and max come to mind).

Harbormaster completed remote builds in B222992: Diff 510006.Mar 31 2023, 12:40 PM

Can we split the store part into a separate patch? As far as I understand they can be treated separately and that would help keep the two parts simpler.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19672	Could we use computeNumSignBits to check if all the bits will be the same? It might not work very well at this point in the lowering, but llvm has fairly decent support for NumSignBits. Or always include the setcc and optimize it away if it is not required, which we might know more clearly after everything has been legalized.

There are some other tests that need updating too, by the look of it.

@dmgreen I can split this into two patches. I'll remove the truncate store part and only focus on the bitcast for now.

Re: tests. I'm not sure what the "LLVM-way" to proceed is. For example in

dag-combine-setcc.ll there are explicit tests to cover the "old" expanded case. I guess I can just delete those, as I have new tests somewhere else? Or should I also change them there.
in vec_umulo.ll, there are also cases with this old pattern, but they are interleaved with other logic that I am not familiar with. I can try to change them to the best of my knowledge. Would it make sense to add the original author of that test to review this patch?

@Sp00ph there seems to be a way to get this meta-information about all bits being -1 or 0. See my inline comment about computeNumSignBits. Unfortunately, it does not seem to work well here.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19672	I looked into `ComputeNumSignBits` a bit but I am turning in circles. We only have a `<n x i1>` type here, so we get 1 here, which is not helpful. We need the "original" vector type/size for this to work, which we only have if we traverse through the logical ops and get to the `SETCC`. If there are any other helper functions that can do this, please let me know. I did not find any, but there are so many, I might have missed it. I've tried to always add the comparison, but it does not get removed again. So I'd need to add some logic that combines arbitrary `SETCC`s with boolean operators in-between. I'd like to avoid that, because it seems (way) out of scope for the change I'm proposing here.
19685–19688	Actually, I don't think we need it here. For us to get here, we either go through: a) `replaceBoolVectorBitcast`, which guarantees that the target type is `<n x i1>` b) `combineVectorCompareAndTruncateStore`, which explicitly does what you mentioned. In a), we know that the type is a vector of `i1`s, so there must have been some operation to generate/truncate that. If it is `truncate`d, the code for `AND 1` is added in the truncate. I can still add this defensively, but I think we don't need it. What do you think?

In D145301#4241206, @lawben wrote:

@dmgreen I can split this into two patches. I'll remove the truncate store part and only focus on the bitcast for now.

Re: tests. I'm not sure what the "LLVM-way" to proceed is. For example in

dag-combine-setcc.ll there are explicit tests to cover the "old" expanded case. I guess I can just delete those, as I have new tests somewhere else? Or should I also change them there.

in vec_umulo.ll, there are also cases with this old pattern, but they are interleaved with other logic that I am not familiar with. I can try to change them to the best of my knowledge. Would it make sense to add the original author of that test to review this patch?

I think you can just update the test checks in them (with update_llc_test_checks.py). So long as the code looks better (smaller, but still correct), all is good.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19672	OK I was worried that might not work. How sure are we that vxi1 types will always become all-one or zero laned vectors? It feels like we might be hoping they will become them, as opposed to proving it is true (and then adding an optimization where that isn't, at least for the bitcast).
19685–19688	What happens if the isChainOfComparesAndLogicalOps path is removed? I'm not sure all the tests would still look OK.

Addressed some review comments.

Main changes:

Removed the truncate store code. I'll add this in a follow-up patch.
Defensively added a AND 1 check if we dont have a SETCC chain for correct truncation.

The failing tests from before are solved for now. Most were related to truncating stores, which is not part of this patch anymore.
The dag-combine-setcc tests were adapted in a recent patch, so they no longer apply.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19672	With this "chain" check, we can guarantee that all bits are 1s or 0s as the chain _must_ start with `SETCC`. Adding `AND/OR/XOR`s with other vectors that _must_ start with `SETCC` does not change that property. If we don't find this chain (either is is not there or too long), we defensively add a check for `!= 0`, which again guarantees that all bits are 1s or 0s. So after both the `if` and `else` branch, we know that all bits are 1s or 0s. I've added another test that checks if this check is added for a "bad" chain.
19685–19688	I'm not saying that the `isChainOf` check should be removed. I was arguing for for not needing the `AND 1` truncation. I've decided to add it defensively anyway to avoid producing wrong results. It has no impact on the current 19 new test, as it is either created here or somewhere else during lowering, so there is no additional cost here in all cases I've looked at. This also allows me to remove the somewhat duplicated logic in `combineVectorCompareAndTruncateStore` (see follow-up patch), so this is probably the cleaner way anyways.

lawben added inline comments.Apr 6 2023, 2:31 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19649	Adding `XOR` is reasonable. I'd not add `MIN/MAX`, as they are not logical operators. So while they technically apply and preserve the all-1/0 property, I feel like they don't "belong" here.

Harbormaster completed remote builds in B223957: Diff 511335.Apr 6 2023, 3:05 AM

dmgreen added inline comments.Apr 9 2023, 8:07 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19685–19688	Sorry - my point was that if we take this patch and remove the `if (isChainOfComparesAndLogicalOps(..))` part, some of the tests do not look correct any more, which suggests this path isn't always valid. As far as I understand VecVT is always a vXi1 type, which makes the And with 1 a bit odd and we are assuming that a `setcc vXi1 N, vXi1 0` will always become a, say, v16i8 vector with all the lanes either 0 or 1. I think instead it would be better to produce a `sext vXi1 N to vXi?`, where the new type is a legal size. This is essentially what getBoolExtOrTrunc is doing already, but it might be better to be explicit as this transform really requires the lanes to be all-one of all-zero. The best type (vXi?) needs to be guessed at, which might best come from the setcc if there is one (the same as the existing isChainOfComparesAndLogicalOps, but it only needs to be a guess now to get the best type). This way we make sure that even in isolation, this transform is "correct" and is not making assumptions about what will happen in other places with vXi1 types.
llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
7	It may be good to add more tests for the combinations of `[2, 4, 8, 16] x [i8, i16, i32]`, to make sure they are performing OK and not running into any problems.

Changed approach as suggested by @dmgreen. We now use an explicit sign-extend and ignore the vector compare. The sign-extend is removed in later steps if there is a vector compare,
so there is no overhead. This change allows us to determine the original type in more cases, as we can detect both SETCC and TRUNC.

Added a few more tests, also explicitly checking <64-bit vectors and <8-bit elements.

Just as an FYI for the tests: the sign-extend is implemented as a shift left by size-1 bits and a signed comparison <0, which is true if the MSB is set.

lawben added inline comments.Apr 10 2023, 11:45 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19685–19688	I've implemented it like you suggested with a sign-extend. I'm "guessing" the vector type to be 64-bit. This may add an `xtn` instruction for 128-bit vectors, but in a few cases, the vector is reduced to 64-bit anyways, so we don't loose anything there. The alternative is to assume 128-bit vectors, but that resulted in a few expansions that we don't need.

Harbormaster completed remote builds in B224612: Diff 512204.Apr 10 2023, 12:33 PM

LGTM

Thanks for the patch. Let me know if you want me to submit it.

This revision is now accepted and ready to land.Apr 11 2023, 5:10 AM

Thanks for your effort in reviewing this patch. I think this solution is nicer than my original approach.

Could you please merge it with "Lawrence Benson <github@lawben.com>".

FYI: Once this is merged, I'll create the follow-up for the truncating store.

Sorry - me again. I was running some precommit tests, compiling ffmpeg and it ran into problems (it is quite good at that with vector operations). I think because there is a bitcast(fcmp v4f32). Can you add a test for that case too?

Add test for float vector. This required a single-line change to convert the VecVT to an integer vector for sign-extend to work.

Harbormaster completed remote builds in B225091: Diff 512857.Apr 12 2023, 9:53 AM

Thanks

Closed by commit rG3653722ce6d9: [AArch64] Add more efficient vector bitcast for AArch64 (authored by lawben, committed by dmgreen). · Explain WhyApr 13 2023, 8:44 AM

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG3653722ce6d9: [AArch64] Add more efficient vector bitcast for AArch64.

lawben mentioned this in D148316: [AArch64] Add support for efficient bitcast in vector truncate store..Apr 14 2023, 2:27 AM

efriedma added inline comments.Apr 21 2023, 12:53 PM

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42	Instead of addv.8b+addv.8b+fmov+fmov+orr, you could use zip1+addv.8h+fmov, I think?

lawben added inline comments.Apr 22 2023, 4:05 AM

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42	I did a quick implementation with NEON intrinsics. Your idea is correct, but it is combined into a different set of instructions in the end. The gist of it being: if we use `vzip_u8` to combine both halves, this returns a `uin8x8x2_t`, which we need to combine into a `uint8x16_t` for the `vadd.8h`. But this is essentially the same as just shuffling the input bytes of the original comparison result in the form `0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15`. As far as I know, there is no instruction to zip two "smaller" vectors into a "larger" one, so we need the shuffle (as `tbl`) here. On my M1 MacBook Pro, this is actually ~50% faster than my original code with two `addv`. We are replacing an `extract + addv + fmov + or` with `adrp + ldr + tbl`. This seems to be a good trade-off, at least on an M1. I read somewhere, that `addv` is quite expensive, so maybe removing one for a `tbl` is good. @efriedma @dmgreen What are your thoughts on this? I'm currently building on this patch in D148316. I would suggest merging that one first and then updating the `v16i8` strategy.

efriedma added inline comments.Apr 24 2023, 9:35 AM

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42	ext+zip1 vs. tbl isn't a huge difference in most cases. (Maybe we're combining the NEON intrinsics a little too aggressively, though? tbl is sort of slow on some chips.) Fixing this as a followup to D148316 seems fine.

dmgreen mentioned this in rGcd68e17bc2f9: [AArch64] Add support for efficient bitcast in vector truncate store..Apr 28 2023, 3:19 AM

lawben added inline comments.Apr 28 2023, 7:19 AM

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42	I did some benchmarks on my M1, Graviton 2, and Graviton 3. The results indicate the following things: Doing this for `v2i64` is beneficial on M1 and Graviton 2. On Graviton 3, it's slightly slower. So this puts us in a bit of a pickle. The gains on M1 and Graviton 2 are larger (+50% and +20%) than the loss on Graviton 3 (-10%), so I think it makes sense to keep this as is. What do you think? The `tbl` variant (the combination that I mentioned based on the `zip` approach proposed by @efriedma) is faster by ~50% on all three CPUs. So I think it makes sense to change the implementation to use a `shufflevector` + `addv` instead of two `addv`, as `addv` seems to be more expensive across the board. As soon as there are 4 elements or more, it pays off on all measured CPUs. I'll submit a patch in the next few days to update the `v16i8` implementation, and possibly the `v2i64` one, depending on your comments.00 ext+zip1 vs. tbl isn't a huge difference in most cases I'm still not quite sure if I fully understood your suggested approach though. I don't see how we can get this down to a single `zip1` instruction. As fas as I can tell, we need 2x `ext`, `zip1` and `zip2` + `ins` to combine the two again. In that case, `tbl` is probably a decent choice. But if you have a suggestion to actually get this down to a single `zip`, please let me know so I can test it.

efriedma added inline comments.Apr 28 2023, 9:14 AM

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll

uint16_t ext_zip(uint8x16_t a, uint8x16_t b) {
    constexpr uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
    auto matches = vceqq_u8(a, b);
    auto masked_matches = vandq_u8(matches, mask);
    auto zipped = vzip1q_u8(masked_matches, vextq_u8(masked_matches, masked_matches, 8));
    return vaddvq_u16(vreinterpretq_u16_u8(zipped));
}

https://godbolt.org/z/P9b8qWT3Y

Apparently there's a bug somewhere that makes the "vzip1q_u8" not produce the right instruction with clang, but it works fine with gcc.

efriedma added inline comments.Apr 28 2023, 9:25 AM

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll
42	Filed https://github.com/llvm/llvm-project/issues/62434 for the odd vzip1q_u8 behavior.

lawben mentioned this in D156544: Add more efficient vector bitcast for v16i8 on AArch64..Jul 28 2023, 7:45 AM

lawben mentioned this in rGc7b537bf0923: [AArch64] Add more efficient vector bitcast for v16i8.Aug 11 2023, 1:11 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

157 lines

test/

CodeGen/

AArch64/

vec-combine-compare-to-bitmask.ll

274 lines

Diff 511335

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,205 Lines • ▼ Show 20 Lines	for (auto Op :
setOperationAction(Op, Ty, Legal);		setOperationAction(Op, Ty, Legal);
if (Subtarget->hasFullFP16())		if (Subtarget->hasFullFP16())
for (MVT Ty : {MVT::v4f16, MVT::v8f16})		for (MVT Ty : {MVT::v4f16, MVT::v8f16})
setOperationAction(Op, Ty, Legal);		setOperationAction(Op, Ty, Legal);
}		}

setTruncStoreAction(MVT::v4i16, MVT::v4i8, Custom);		setTruncStoreAction(MVT::v4i16, MVT::v4i8, Custom);

		setOperationAction(ISD::BITCAST, MVT::i2, Custom);
		setOperationAction(ISD::BITCAST, MVT::i4, Custom);
		setOperationAction(ISD::BITCAST, MVT::i8, Custom);
		setOperationAction(ISD::BITCAST, MVT::i16, Custom);

setLoadExtAction(ISD::EXTLOAD, MVT::v4i16, MVT::v4i8, Custom);		setLoadExtAction(ISD::EXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
setLoadExtAction(ISD::SEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);		setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i8, Custom);		setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i8, Custom);
setLoadExtAction(ISD::SEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);
setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);		setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);

// ADDP custom lowering		// ADDP custom lowering
▲ Show 20 Lines • Show All 18,399 Lines • ▼ Show 20 Lines	static SDValue performLOADCombine(SDNode *N,
SDValue ExtractSubVector =		SDValue ExtractSubVector =
DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MemVT,		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MemVT,
{ConcatVectors, DAG.getVectorIdxConstant(0, DL)});		{ConcatVectors, DAG.getVectorIdxConstant(0, DL)});
SDValue TokenFactor =		SDValue TokenFactor =
DAG.getNode(ISD::TokenFactor, DL, MVT::Other, LoadOpsChain);		DAG.getNode(ISD::TokenFactor, DL, MVT::Other, LoadOpsChain);
return DAG.getMergeValues({ExtractSubVector, TokenFactor}, DL);		return DAG.getMergeValues({ExtractSubVector, TokenFactor}, DL);
}		}

		// Small helper to check if a node chain consists entirely of comparisons
		// combined with logical operations. This guarantees that all elements' bits are
		// either 1 or 0. `BaseVT` contains the type of the base comparison operand if
		// valid or something unspecified otherwise.
		static bool isChainOfComparesAndLogicalOps(SDValue Op, EVT &BaseVT,
		int Depth = 0) {
		if (Depth > 3)
		return false;

		if (Op.getOpcode() == ISD::SETCC) {
		BaseVT = Op.getOperand(0).getValueType();
		return true;
		}

		unsigned Opcode = Op.getOpcode();
		if (Opcode == ISD::OR \|\| Opcode == ISD::AND \|\| Opcode == ISD::XOR)
		Sp00phUnsubmitted Not Done Reply Inline Actions There's probably quite a few more operations that preserve the mask property (xor, min and max come to mind). Sp00ph: There's probably quite a few more operations that preserve the mask property (xor, min and max…
		lawbenAuthorUnsubmitted Done Reply Inline Actions Adding `XOR` is reasonable. I'd not add `MIN/MAX`, as they are not logical operators. So while they technically apply and preserve the all-1/0 property, I feel like they don't "belong" here. lawben: Adding `XOR` is reasonable. I'd not add `MIN/MAX`, as they are not logical operators. So while…
		return isChainOfComparesAndLogicalOps(Op.getOperand(0), BaseVT,
		Depth + 1) &&
		isChainOfComparesAndLogicalOps(Op.getOperand(1), BaseVT, Depth + 1);

		return false;
		}

		// When converting a <N x iX> vector to <N x i1> to store or use as a scalar
		// iN, we can use a trick that extracts the i^th bit from the i^th element and
		// then performs a vector add to get a scalar bitmask. This requires that each
		// element's bits are either all 1 or all 0.
		static SDValue vectorToScalarBitmask(SDNode *N, SelectionDAG &DAG) {
		SDLoc DL(N);
		SDValue ComparisonResult(N, 0);
		EVT VecVT = ComparisonResult.getValueType();
		assert(VecVT.isVector() && "Must be a vector type");

		unsigned NumElts = VecVT.getVectorNumElements();
		if (NumElts != 2 && NumElts != 4 && NumElts != 8 && NumElts != 16)
		return SDValue();

		EVT BaseVT;
		if (isChainOfComparesAndLogicalOps(ComparisonResult, BaseVT)) {
		dmgreenUnsubmitted Not Done Reply Inline Actions Could we use computeNumSignBits to check if all the bits will be the same? It might not work very well at this point in the lowering, but llvm has fairly decent support for NumSignBits. Or always include the setcc and optimize it away if it is not required, which we might know more clearly after everything has been legalized. dmgreen: Could we use computeNumSignBits to check if all the bits will be the same? It might not work…
		lawbenAuthorUnsubmitted Done Reply Inline Actions I looked into `ComputeNumSignBits` a bit but I am turning in circles. We only have a `<n x i1>` type here, so we get 1 here, which is not helpful. We need the "original" vector type/size for this to work, which we only have if we traverse through the logical ops and get to the `SETCC`. If there are any other helper functions that can do this, please let me know. I did not find any, but there are so many, I might have missed it. I've tried to always add the comparison, but it does not get removed again. So I'd need to add some logic that combines arbitrary `SETCC`s with boolean operators in-between. I'd like to avoid that, because it seems (way) out of scope for the change I'm proposing here. lawben: I looked into `ComputeNumSignBits` a bit but I am turning in circles. We only have a `<n x i1>`…
		dmgreenUnsubmitted Not Done Reply Inline Actions OK I was worried that might not work. How sure are we that vxi1 types will always become all-one or zero laned vectors? It feels like we might be hoping they will become them, as opposed to proving it is true (and then adding an optimization where that isn't, at least for the bitcast). dmgreen: OK I was worried that might not work. How sure are we that vxi1 types will always become all…
		lawbenAuthorUnsubmitted Done Reply Inline Actions With this "chain" check, we can guarantee that all bits are 1s or 0s as the chain _must_ start with `SETCC`. Adding `AND/OR/XOR`s with other vectors that _must_ start with `SETCC` does not change that property. If we don't find this chain (either is is not there or too long), we defensively add a check for `!= 0`, which again guarantees that all bits are 1s or 0s. So after both the `if` and `else` branch, we know that all bits are 1s or 0s. I've added another test that checks if this check is added for a "bad" chain. lawben: With this "chain" check, we can guarantee that all bits are 1s or 0s as the chain _must_ start…
		// If we have a comparison, we can get the original types to work on instead
		// of a vector of i1, which may remove conversion instructions.
		ComparisonResult =
		DAG.getBoolExtOrTrunc(ComparisonResult, DL, BaseVT, VecVT);
		VecVT = BaseVT;
		} else {
		// We need to ensure correct truncation semantics here, i.e., only use the
		// least significant bit. So we mask it and set all bits to that value.
		ComparisonResult = DAG.getNode(
		ISD::AND, DL, VecVT, ComparisonResult,
		DAG.getSplatBuildVector(VecVT, DL, DAG.getConstant(1, DL, MVT::i64)));

		// We may not have the original vector type here anymore but only one
		// consisting of i1's, which gets promoted to <n x i8> (or i16 depending on
		// n). So if the original vector is 16-byte, we may only get an 8-byte
		// comparison here, which means there will be an extra vector extract
		Sp00phUnsubmitted Not Done Reply Inline Actions Does this always produce the same result as the current implementation? If in general a trunc store would get lowered to something like this: %trunc = trunc <N x i8> %vec to <N x i1> store <N x i1> %trunc, ptr %dst then I think the result would differ if e.g. the input was `<1 x i8> <i8 2>` (current implementation would store a 0, this PR would store a 1). This should be easily fixable though by doing `vec & splat(1) != splat(0)` instead of just `vec != splat(0)`. Sp00ph: Does this always produce the same result as the current implementation? If in general a trunc…
		lawbenAuthorUnsubmitted Done Reply Inline Actions Actually, I don't think we need it here. For us to get here, we either go through: a) `replaceBoolVectorBitcast`, which guarantees that the target type is `<n x i1>` b) `combineVectorCompareAndTruncateStore`, which explicitly does what you mentioned. In a), we know that the type is a vector of `i1`s, so there must have been some operation to generate/truncate that. If it is `truncate`d, the code for `AND 1` is added in the truncate. I can still add this defensively, but I think we don't need it. What do you think? lawben: Actually, I don't think we need it here. For us to get here, we either go through: a)…
		dmgreenUnsubmitted Not Done Reply Inline Actions What happens if the isChainOfComparesAndLogicalOps path is removed? I'm not sure all the tests would still look OK. dmgreen: What happens if the isChainOfComparesAndLogicalOps path is removed? I'm not sure all the tests…
		lawbenAuthorUnsubmitted Done Reply Inline Actions I'm not saying that the `isChainOf` check should be removed. I was arguing for for not needing the `AND 1` truncation. I've decided to add it defensively anyway to avoid producing wrong results. It has no impact on the current 19 new test, as it is either created here or somewhere else during lowering, so there is no additional cost here in all cases I've looked at. This also allows me to remove the somewhat duplicated logic in `combineVectorCompareAndTruncateStore` (see follow-up patch), so this is probably the cleaner way anyways. lawben: I'm not saying that the `isChainOf` check should be removed. I was arguing for for not needing…
		dmgreenUnsubmitted Not Done Reply Inline Actions Sorry - my point was that if we take this patch and remove the `if (isChainOfComparesAndLogicalOps(..))` part, some of the tests do not look correct any more, which suggests this path isn't always valid. As far as I understand VecVT is always a vXi1 type, which makes the And with 1 a bit odd and we are assuming that a `setcc vXi1 N, vXi1 0` will always become a, say, v16i8 vector with all the lanes either 0 or 1. I think instead it would be better to produce a `sext vXi1 N to vXi?`, where the new type is a legal size. This is essentially what getBoolExtOrTrunc is doing already, but it might be better to be explicit as this transform really requires the lanes to be all-one of all-zero. The best type (vXi?) needs to be guessed at, which might best come from the setcc if there is one (the same as the existing isChainOfComparesAndLogicalOps, but it only needs to be a guess now to get the best type). This way we make sure that even in isolation, this transform is "correct" and is not making assumptions about what will happen in other places with vXi1 types. dmgreen: Sorry - my point was that if we take this patch and remove the `if…
		lawbenAuthorUnsubmitted Done Reply Inline Actions I've implemented it like you suggested with a sign-extend. I'm "guessing" the vector type to be 64-bit. This may add an `xtn` instruction for 128-bit vectors, but in a few cases, the vector is reduced to 64-bit anyways, so we don't loose anything there. The alternative is to assume 128-bit vectors, but that resulted in a few expansions that we don't need. lawben: I've implemented it like you suggested with a sign-extend. I'm "guessing" the vector type to be…
		// somewhere along the way.
		ComparisonResult = DAG.getSetCC(
		DL, VecVT, ComparisonResult,
		DAG.getSplatBuildVector(VecVT, DL, DAG.getConstant(0, DL, MVT::i64)),
		ISD::CondCode::SETNE);
		}

		// Larger vectors don't map directly to this conversion, so to avoid too many
		// edge cases, we don't apply it here. The conversion will likely still be
		// applied later via multiple smaller vectors, whose results are concatenated.
		if (VecVT.getSizeInBits() > 128)
		return SDValue();

		SDValue VectorBits;
		if (VecVT == MVT::v16i8) {
		// v16i8 is a special case, as we need to split it into two halves and
		// combine, perform the mask+addition twice, and then combine them.
		SmallVector<SDValue, 16> MaskConstants;
		for (unsigned Half = 0; Half < 2; ++Half) {
		for (unsigned MaskBit = 1; MaskBit <= 128; MaskBit *= 2) {
		MaskConstants.push_back(DAG.getConstant(MaskBit, DL, MVT::i32));
		}
		}
		SDValue Mask = DAG.getNode(ISD::BUILD_VECTOR, DL, VecVT, MaskConstants);
		SDValue RepresentativeBits =
		DAG.getNode(ISD::AND, DL, VecVT, ComparisonResult, Mask);

		EVT HalfVT = VecVT.getHalfNumVectorElementsVT(*DAG.getContext());
		unsigned NumElementsInHalf = HalfVT.getVectorNumElements();

		SDValue LowHalf =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, RepresentativeBits,
		DAG.getConstant(0, DL, MVT::i64));
		SDValue HighHalf =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HalfVT, RepresentativeBits,
		DAG.getConstant(NumElementsInHalf, DL, MVT::i64));

		SDValue ReducedLowBits =
		DAG.getNode(ISD::VECREDUCE_ADD, DL, MVT::i16, LowHalf);
		SDValue ReducedHighBits =
		DAG.getNode(ISD::VECREDUCE_ADD, DL, MVT::i16, HighHalf);

		SDValue ShiftedHighBits =
		DAG.getNode(ISD::SHL, DL, MVT::i16, ReducedHighBits,
		DAG.getConstant(NumElementsInHalf, DL, MVT::i32));
		VectorBits =
		DAG.getNode(ISD::OR, DL, MVT::i16, ShiftedHighBits, ReducedLowBits);
		} else {
		SmallVector<SDValue, 16> MaskConstants;
		unsigned MaxBitMask = 1u << (VecVT.getVectorNumElements() - 1);
		for (unsigned MaskBit = 1; MaskBit <= MaxBitMask; MaskBit *= 2) {
		MaskConstants.push_back(DAG.getConstant(MaskBit, DL, MVT::i64));
		}

		SDValue Mask = DAG.getNode(ISD::BUILD_VECTOR, DL, VecVT, MaskConstants);
		SDValue RepresentativeBits =
		DAG.getNode(ISD::AND, DL, VecVT, ComparisonResult, Mask);
		EVT ResultVT = MVT::getIntegerVT(std::max<unsigned>(
		NumElts, VecVT.getVectorElementType().getSizeInBits()));
		VectorBits =
		DAG.getNode(ISD::VECREDUCE_ADD, DL, ResultVT, RepresentativeBits);
		}

		return VectorBits;
		}

static SDValue performSTORECombine(SDNode *N,		static SDValue performSTORECombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG,		SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
StoreSDNode *ST = cast<StoreSDNode>(N);		StoreSDNode *ST = cast<StoreSDNode>(N);
SDValue Chain = ST->getChain();		SDValue Chain = ST->getChain();
SDValue Value = ST->getValue();		SDValue Value = ST->getValue();
SDValue Ptr = ST->getBasePtr();		SDValue Ptr = ST->getBasePtr();
▲ Show 20 Lines • Show All 2,514 Lines • ▼ Show 20 Lines	bool AArch64TargetLowering::getPostIndexedAddressParts(
// Post-indexing updates the base, so it's not a valid transform		// Post-indexing updates the base, so it's not a valid transform
// if that's not the same as the load's pointer.		// if that's not the same as the load's pointer.
if (Ptr != Base)		if (Ptr != Base)
return false;		return false;
AM = ISD::POST_INC;		AM = ISD::POST_INC;
return true;		return true;
}		}

		static void replaceBoolVectorBitcast(SDNode *N,
		SmallVectorImpl<SDValue> &Results,
		SelectionDAG &DAG) {
		SDLoc DL(N);
		SDValue Op = N->getOperand(0);
		EVT VT = N->getValueType(0);
		EVT SrcVT = Op.getValueType();
		assert(SrcVT.isVector() && SrcVT.getVectorElementType() == MVT::i1 &&
		"Must be bool vector.");

		// Special handling for Clang's __builtin_convertvector. For vectors with <8
		// elements, it adds a vector concatenation with undef(s). If we encounter
		// this here, we can skip the concat.
		if (Op.getOpcode() == ISD::CONCAT_VECTORS && Op.hasOneUse() &&
		!Op.getOperand(0).isUndef()) {
		bool AllUndef = true;
		for (unsigned I = 1; I < Op.getNumOperands(); ++I)
		AllUndef &= Op.getOperand(I).isUndef();

		if (AllUndef)
		Op = Op.getOperand(0);
		}

		SDValue VectorBits = vectorToScalarBitmask(Op.getNode(), DAG);
		if (VectorBits)
		Results.push_back(DAG.getZExtOrTrunc(VectorBits, DL, VT));
		}

void AArch64TargetLowering::ReplaceBITCASTResults(		void AArch64TargetLowering::ReplaceBITCASTResults(
SDNode *N, SmallVectorImpl<SDValue> &Results, SelectionDAG &DAG) const {		SDNode *N, SmallVectorImpl<SDValue> &Results, SelectionDAG &DAG) const {
SDLoc DL(N);		SDLoc DL(N);
SDValue Op = N->getOperand(0);		SDValue Op = N->getOperand(0);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
EVT SrcVT = Op.getValueType();		EVT SrcVT = Op.getValueType();

if (VT.isScalableVector() && !isTypeLegal(VT) && isTypeLegal(SrcVT)) {		if (VT.isScalableVector() && !isTypeLegal(VT) && isTypeLegal(SrcVT)) {
assert(!VT.isFloatingPoint() && SrcVT.isFloatingPoint() &&		assert(!VT.isFloatingPoint() && SrcVT.isFloatingPoint() &&
"Expected fp->int bitcast!");		"Expected fp->int bitcast!");

// Bitcasting between unpacked vector types of different element counts is		// Bitcasting between unpacked vector types of different element counts is
// not a NOP because the live elements are laid out differently.		// not a NOP because the live elements are laid out differently.
// 01234567		// 01234567
// e.g. nxv2i32 = XX??XX??		// e.g. nxv2i32 = XX??XX??
// nxv4f16 = X?X?X?X?		// nxv4f16 = X?X?X?X?
if (VT.getVectorElementCount() != SrcVT.getVectorElementCount())		if (VT.getVectorElementCount() != SrcVT.getVectorElementCount())
return;		return;

SDValue CastResult = getSVESafeBitCast(getSVEContainerType(VT), Op, DAG);		SDValue CastResult = getSVESafeBitCast(getSVEContainerType(VT), Op, DAG);
Results.push_back(DAG.getNode(ISD::TRUNCATE, DL, VT, CastResult));		Results.push_back(DAG.getNode(ISD::TRUNCATE, DL, VT, CastResult));
return;		return;
}		}

		if (SrcVT.isVector() && SrcVT.getVectorElementType() == MVT::i1)
		return replaceBoolVectorBitcast(N, Results, DAG);

if (VT != MVT::i16 \|\| (SrcVT != MVT::f16 && SrcVT != MVT::bf16))		if (VT != MVT::i16 \|\| (SrcVT != MVT::f16 && SrcVT != MVT::bf16))
return;		return;

Op = DAG.getTargetInsertSubreg(AArch64::hsub, DL, MVT::f32,		Op = DAG.getTargetInsertSubreg(AArch64::hsub, DL, MVT::f32,
DAG.getUNDEF(MVT::i32), Op);		DAG.getUNDEF(MVT::i32), Op);
Op = DAG.getNode(ISD::BITCAST, DL, MVT::i32, Op);		Op = DAG.getNode(ISD::BITCAST, DL, MVT::i32, Op);
Results.push_back(DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, Op));		Results.push_back(DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, Op));
}		}
▲ Show 20 Lines • Show All 2,424 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll

This file was added.

				; RUN: llc -mtriple=aarch64-apple-darwin -mattr=+neon -verify-machineinstrs < %s \| FileCheck %s

				; Basic tests from input vector to bitmask
				; IR generated from clang for:
				; __builtin_convertvector + reinterpret_cast<uint16&>

				define i16 @convert_to_bitmask16(<16 x i8> %vec) {
				dmgreenUnsubmitted Not Done Reply Inline Actions It may be good to add more tests for the combinations of `[2, 4, 8, 16] x [i8, i16, i32]`, to make sure they are performing OK and not running into any problems. dmgreen: It may be good to add more tests for the combinations of `[2, 4, 8, 16] x [i8, i16, i32]`, to…
				; Bits used in mask
				; CHECK-LABEL: lCPI0_0
				; CHECK-NEXT: .byte 1
				; CHECK-NEXT: .byte 2
				; CHECK-NEXT: .byte 4
				; CHECK-NEXT: .byte 8
				; CHECK-NEXT: .byte 16
				; CHECK-NEXT: .byte 32
				; CHECK-NEXT: .byte 64
				; CHECK-NEXT: .byte 128
				; CHECK-NEXT: .byte 1
				; CHECK-NEXT: .byte 2
				; CHECK-NEXT: .byte 4
				; CHECK-NEXT: .byte 8
				; CHECK-NEXT: .byte 16
				; CHECK-NEXT: .byte 32
				; CHECK-NEXT: .byte 64
				; CHECK-NEXT: .byte 128

				; Actual conversion
				; CHECK-LABEL: convert_to_bitmask16
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh0:
				; CHECK-NEXT: adrp x8, lCPI0_0@PAGE
				; CHECK-NEXT: cmeq.16b v0, v0, #0
				; CHECK-NEXT: Lloh1:
				; CHECK-NEXT: ldr q1, [x8, lCPI0_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: ext.16b v1, v0, v0, #8
				; CHECK-NEXT: addv.8b b0, v0
				; CHECK-NEXT: addv.8b b1, v1
				; CHECK-NEXT: fmov w9, s0
				; CHECK-NEXT: fmov w8, s1
				; CHECK-NEXT: orr w0, w9, w8, lsl #8
				; CHECK-NEXT: ret
				efriedmaUnsubmitted Not Done Reply Inline Actions Instead of addv.8b+addv.8b+fmov+fmov+orr, you could use zip1+addv.8h+fmov, I think? efriedma: Instead of addv.8b+addv.8b+fmov+fmov+orr, you could use zip1+addv.8h+fmov, I think?
				lawbenAuthorUnsubmitted Done Reply Inline Actions I did a quick implementation with NEON intrinsics. Your idea is correct, but it is combined into a different set of instructions in the end. The gist of it being: if we use `vzip_u8` to combine both halves, this returns a `uin8x8x2_t`, which we need to combine into a `uint8x16_t` for the `vadd.8h`. But this is essentially the same as just shuffling the input bytes of the original comparison result in the form `0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15`. As far as I know, there is no instruction to zip two "smaller" vectors into a "larger" one, so we need the shuffle (as `tbl`) here. On my M1 MacBook Pro, this is actually ~50% faster than my original code with two `addv`. We are replacing an `extract + addv + fmov + or` with `adrp + ldr + tbl`. This seems to be a good trade-off, at least on an M1. I read somewhere, that `addv` is quite expensive, so maybe removing one for a `tbl` is good. @efriedma @dmgreen What are your thoughts on this? I'm currently building on this patch in D148316. I would suggest merging that one first and then updating the `v16i8` strategy. lawben: I did a [quick implementation with NEON intrinsics](https://godbolt.org/z/nz5P8TYn4). Your idea…
				efriedmaUnsubmitted Not Done Reply Inline Actions ext+zip1 vs. tbl isn't a huge difference in most cases. (Maybe we're combining the NEON intrinsics a little too aggressively, though? tbl is sort of slow on some chips.) Fixing this as a followup to D148316 seems fine. efriedma: ext+zip1 vs. tbl isn't a huge difference in most cases. (Maybe we're combining the NEON…
				lawbenAuthorUnsubmitted Done Reply Inline Actions I did some benchmarks on my M1, Graviton 2, and Graviton 3. The results indicate the following things: Doing this for `v2i64` is beneficial on M1 and Graviton 2. On Graviton 3, it's slightly slower. So this puts us in a bit of a pickle. The gains on M1 and Graviton 2 are larger (+50% and +20%) than the loss on Graviton 3 (-10%), so I think it makes sense to keep this as is. What do you think? The `tbl` variant (the combination that I mentioned based on the `zip` approach proposed by @efriedma) is faster by ~50% on all three CPUs. So I think it makes sense to change the implementation to use a `shufflevector` + `addv` instead of two `addv`, as `addv` seems to be more expensive across the board. As soon as there are 4 elements or more, it pays off on all measured CPUs. I'll submit a patch in the next few days to update the `v16i8` implementation, and possibly the `v2i64` one, depending on your comments.00 ext+zip1 vs. tbl isn't a huge difference in most cases I'm still not quite sure if I fully understood your suggested approach though. I don't see how we can get this down to a single `zip1` instruction. As fas as I can tell, we need 2x `ext`, `zip1` and `zip2` + `ins` to combine the two again. In that case, `tbl` is probably a decent choice. But if you have a suggestion to actually get this down to a single `zip`, please let me know so I can test it. lawben: I did some benchmarks on my M1, Graviton 2, and Graviton 3. The results indicate the following…
				efriedmaUnsubmitted Not Done Reply Inline Actions uint16_t ext_zip(uint8x16_t a, uint8x16_t b) { constexpr uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128}; auto matches = vceqq_u8(a, b); auto masked_matches = vandq_u8(matches, mask); auto zipped = vzip1q_u8(masked_matches, vextq_u8(masked_matches, masked_matches, 8)); return vaddvq_u16(vreinterpretq_u16_u8(zipped)); } https://godbolt.org/z/P9b8qWT3Y Apparently there's a bug somewhere that makes the "vzip1q_u8" not produce the right instruction with clang, but it works fine with gcc. efriedma: ``` uint16_t ext_zip(uint8x16_t a, uint8x16_t b) { constexpr uint8x16_t mask = {1, 2, 4, 8…
				efriedmaUnsubmitted Not Done Reply Inline Actions Filed https://github.com/llvm/llvm-project/issues/62434 for the odd vzip1q_u8 behavior. efriedma: Filed https://github.com/llvm/llvm-project/issues/62434 for the odd vzip1q_u8 behavior.

				%cmp_result = icmp ne <16 x i8> %vec, zeroinitializer
				%bitmask = bitcast <16 x i1> %cmp_result to i16
				ret i16 %bitmask
				}

				define i16 @convert_to_bitmask8(<8 x i16> %vec) {
				; CHECK-LABEL: lCPI1_0:
				; CHECK-NEXT: .short 1
				; CHECK-NEXT: .short 2
				; CHECK-NEXT: .short 4
				; CHECK-NEXT: .short 8
				; CHECK-NEXT: .short 16
				; CHECK-NEXT: .short 32
				; CHECK-NEXT: .short 64
				; CHECK-NEXT: .short 128

				; CHECK-LABEL: convert_to_bitmask8
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh2:
				; CHECK-NEXT: adrp x8, lCPI1_0@PAGE
				; CHECK-NEXT: cmeq.8h v0, v0, #0
				; CHECK-NEXT: Lloh3:
				; CHECK-NEXT: ldr q1, [x8, lCPI1_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: addv.8h h0, v0
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: and w0, w8, #0xff
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <8 x i16> %vec, zeroinitializer
				%bitmask = bitcast <8 x i1> %cmp_result to i8
				%extended_bitmask = zext i8 %bitmask to i16
				ret i16 %extended_bitmask
				}

				define i4 @convert_to_bitmask4(<4 x i32> %vec) {
				; CHECK-LABEL: lCPI2_0:
				; CHECK-NEXT: .long 1
				; CHECK-NEXT: .long 2
				; CHECK-NEXT: .long 4
				; CHECK-NEXT: .long 8

				; CHECK-LABEL: convert_to_bitmask4
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh4:
				; CHECK-NEXT: adrp x8, lCPI2_0@PAGE
				; CHECK-NEXT: cmeq.4s v0, v0, #0
				; CHECK-NEXT: Lloh5:
				; CHECK-NEXT: ldr q1, [x8, lCPI2_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: addv.4s s0, v0
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <4 x i32> %vec, zeroinitializer
				%bitmask = bitcast <4 x i1> %cmp_result to i4
				ret i4 %bitmask
				}

				define i8 @convert_to_bitmask2(<2 x i64> %vec) {
				; CHECK-LABEL: lCPI3_0:
				; CHECK-NEXT: .quad 1
				; CHECK-NEXT: .quad 2

				; CHECK-LABEL: convert_to_bitmask2
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh6:
				; CHECK-NEXT: adrp x8, lCPI3_0@PAGE
				; CHECK-NEXT: cmeq.2d v0, v0, #0
				; CHECK-NEXT: Lloh7:
				; CHECK-NEXT: ldr q1, [x8, lCPI3_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: addp.2d d0, v0
				; CHECK-NEXT: fmov x8, d0
				; CHECK-NEXT: and w0, w8, #0x3
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <2 x i64> %vec, zeroinitializer
				%bitmask = bitcast <2 x i1> %cmp_result to i2
				%extended_bitmask = zext i2 %bitmask to i8
				ret i8 %extended_bitmask
				}

				; Clang's __builtin_convertvector adds an undef vector concat for vectors with <8 elements.
				define i8 @clang_builtins_undef_concat_convert_to_bitmask4(<4 x i32> %vec) {
				; CHECK-LABEL: lCPI4_0:
				; CHECK-NEXT: .long 1
				; CHECK-NEXT: .long 2
				; CHECK-NEXT: .long 4
				; CHECK-NEXT: .long 8

				; CHECK-LABEL: clang_builtins_undef_concat_convert_to_bitmask4
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh8:
				; CHECK-NEXT: adrp x8, lCPI4_0@PAGE
				; CHECK-NEXT: cmeq.4s v0, v0, #0
				; CHECK-NEXT: Lloh9:
				; CHECK-NEXT: ldr q1, [x8, lCPI4_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: addv.4s s0, v0
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret

				%cmp_result = icmp ne <4 x i32> %vec, zeroinitializer
				%vector_pad = shufflevector <4 x i1> %cmp_result, <4 x i1> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
				%bitmask = bitcast <8 x i1> %vector_pad to i8
				ret i8 %bitmask
				}


				define i4 @convert_to_bitmask_no_compare(<4 x i32> %vec1, <4 x i32> %vec2) {
				; CHECK-LABEL: lCPI5_0:
				; CHECK-NEXT: .short 1
				; CHECK-NEXT: .short 2
				; CHECK-NEXT: .short 4
				; CHECK-NEXT: .short 8

				; CHECK-LABEL: convert_to_bitmask_no_compare
				; CHECK: ; %bb.0:
				; CHECK-NEXT: movi.4h v2, #1
				; CHECK-NEXT: Lloh10:
				; CHECK-NEXT: adrp x8, lCPI5_0@PAGE
				; CHECK-NEXT: and.16b v0, v0, v1
				; CHECK-NEXT: xtn.4h v0, v0
				; CHECK-NEXT: Lloh11:
				; CHECK-NEXT: ldr d1, [x8, lCPI5_0@PAGEOFF]
				; CHECK-NEXT: and.8b v0, v0, v2
				; CHECK-NEXT: cmeq.4h v0, v0, #0
				; CHECK-NEXT: bic.8b v0, v1, v0
				; CHECK-NEXT: addv.4h h0, v0
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret

				%cmp = and <4 x i32> %vec1, %vec2
				%trunc = trunc <4 x i32> %cmp to <4 x i1>
				%bitmask = bitcast <4 x i1> %trunc to i4
				ret i4 %bitmask
				}

				define i4 @convert_to_bitmask_with_compare_chain(<4 x i32> %vec1, <4 x i32> %vec2) {
				; CHECK-LABEL: lCPI6_0:
				; CHECK-NEXT: .long 1
				; CHECK-NEXT: .long 2
				; CHECK-NEXT: .long 4
				; CHECK-NEXT: .long 8

				; CHECK-LABEL: convert_to_bitmask_with_compare_chain
				; CHECK: ; %bb.0:
				; CHECK-NEXT: Lloh12:
				; CHECK-NEXT: adrp x8, lCPI6_0@PAGE
				; CHECK-NEXT: cmeq.4s v2, v0, #0
				; CHECK-NEXT: cmeq.4s v0, v0, v1
				; CHECK-NEXT: Lloh13:
				; CHECK-NEXT: ldr q1, [x8, lCPI6_0@PAGEOFF]
				; CHECK-NEXT: bic.16b v0, v0, v2
				; CHECK-NEXT: and.16b v0, v0, v1
				; CHECK-NEXT: addv.4s s0, v0
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret

				%cmp1 = icmp ne <4 x i32> %vec1, zeroinitializer
				%cmp2 = icmp eq <4 x i32> %vec1, %vec2
				%cmp3 = and <4 x i1> %cmp1, %cmp2
				%bitmask = bitcast <4 x i1> %cmp3 to i4
				ret i4 %bitmask
				}

				define i4 @convert_to_bitmask_with_defensive_compare_for_bad_chain(<4 x i32> %vec1, <4 x i32> %vec2) {
				; CHECK-LABEL: lCPI7_0:
				; CHECK-NEXT: .short 1
				; CHECK-NEXT: .short 2
				; CHECK-NEXT: .short 4
				; CHECK-NEXT: .short 8

				; CHECK-LABEL: convert_to_bitmask_with_defensive_compare_for_bad_chain
				; CHECK: ; %bb.0:
				; CHECK-NEXT: cmeq.4s v0, v0, #0
				; CHECK-NEXT: Lloh14:
				; CHECK-NEXT: adrp x8, lCPI7_0@PAGE
				; CHECK-NEXT: movi.4h v2, #1
				; CHECK-NEXT: bic.16b v0, v1, v0
				; CHECK-NEXT: xtn.4h v0, v0
				; CHECK-NEXT: Lloh15:
				; CHECK-NEXT: ldr d1, [x8, lCPI7_0@PAGEOFF]
				; CHECK-NEXT: and.8b v0, v0, v2
				; CHECK-NEXT: cmeq.4h v0, v0, #0
				; CHECK-NEXT: bic.8b v0, v1, v0
				; CHECK-NEXT: addv.4h h0, v0
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret

				%cmp1 = icmp ne <4 x i32> %vec1, zeroinitializer
				%trunc_vec = trunc <4 x i32> %vec2 to <4 x i1>
				%and_res = and <4 x i1> %cmp1, %trunc_vec
				%bitmask = bitcast <4 x i1> %and_res to i4
				ret i4 %bitmask
				}

				; TODO(lawben): Change this in follow-up patch to #D145301, as truncating stores fix this.
				; Larger vector types don't map directly.
				define i8 @no_convert_large_vector(<8 x i32> %vec) {
				; CHECK-LABEL: convert_large_vector:
				; CHECK: cmeq.4s v1, v1, #0
				; CHECK-NOT: addv

				%cmp_result = icmp ne <8 x i32> %vec, zeroinitializer
				%bitmask = bitcast <8 x i1> %cmp_result to i8
				ret i8 %bitmask
				}

				define i16 @no_convert_bad_concat(<4 x i32> %vec) {
				; CHECK-LABEL: no_convert_bad_concat:
				; CHECK: cmtst.4s v0, v0, v0
				; CHECK-NOT: addv.4s s0, v0

				%cmp_result = icmp ne <4 x i32> %vec, zeroinitializer
				%vector_pad = shufflevector <4 x i1> poison, <4 x i1> %cmp_result, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 4, i32 5, i32 6, i32 7>
				%bitmask = bitcast <8 x i1> %vector_pad to i8
				%extended_bitmask = zext i8 %bitmask to i16
				ret i16 %extended_bitmask
				}

				define <8 x i1> @no_convert_without_direct_bitcast(<8 x i16> %vec) {
				; CHECK-LABEL: no_convert_without_direct_bitcast:
				; CHECK: cmtst.8h v0, v0, v0
				; CHECK-NOT: addv.4s s0, v0

				%cmp_result = icmp ne <8 x i16> %vec, zeroinitializer
				ret <8 x i1> %cmp_result
				}

This is an archive of the discontinued LLVM Phabricator instance.

Add more efficient vector bitcast for AArch64ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 511335

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/vec-combine-compare-to-bitmask.ll

Add more efficient vector bitcast for AArch64
ClosedPublic