Fixes #57502
Details
Diff Detail
Event Timeline
Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!
A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.
Also, I found it helpful to keep these in mind while working on an implementation
- the caveats of BITCAST (across vectors, or across vectors and scalars)
- memory layout of LLVM IR vectors
https://reviews.llvm.org/D94964 answers the above two questions perfectly.
[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18060–18065 | nit: If I read correctly, this uses MVT::Other as a sentinel. To limit the scope of the problem to be solved, an alternative option (without multiplexing the semantics of MVT::Other) // This won't generalize the solution as commented, but just to illustrate another way of limiting the scope if OrinOp.getValueType().getSimpleVT() != MVT::v2i64 return false; | |
18062 | nit: Bail early (return false) if OrigOp0.getValueType() is not a simple type (getSimpleVT() will assert if OrigOp0.getValueType() is not a simple type. |
In this context, D133495 may also be interesting. It's using tbl to lower i32->i8 truncation and could also be extended to handle i64->i8. This would allow to do the conversion with one instructions plus a load that materializes the mask, which is why D133495 limits this to cases in loops, where the load can be hoisted out.
Your implementation is more comprehensive than mine. Can I proceed the implementation with yours ?
Feel free to go ahead (we won't step on each others toes as long as only one of us is working on this and let the other people know). I think it could more general than the unfinished work (only optimizing two test cases).
I have a question : how does the SIMD instruction view the vector register in big endian mode ?
This question is raised from the confusing execution result of qemu-aarch64_be with the following instructions
fmov d0, x0 mov v0.d[1], x1
Suppose the content of $x0 and $x1 is 0x102030405060708 and 0x90a0b0c0e0f00 respectively. Here is the content of the $v0 after the above instructions is executed:
(gdb) p $v0.b $14 = {u = {9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8}...
This confuses me. The $x1 is stored to the last element, but it shows up in the first. In my understanding, it should be:
{u = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0}...
What goes wrong here ? Does the SIMD instruction view the last element of the vector register as the 0th one in big endian mode ? Where can I find information related to this ?
Thanks :)
I have no idea regarding how gdb decides to print it... but for the general issue, hopefully https://llvm.org/docs/BigEndianNEON.html helps?
bitcast is handled in this diff.
To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.
For example, given the following register with type v2i64:
LSB______MSB
x0 x1 | x2 x3 |
Applying xtn on it we get:
x0 | x2 |
This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:
x0 | x1 | x2 | x3 |
=== uzp1 ===>
x0 | x2 | <value from other register> |
We can transform xtn to uzp1 by this observation, and vice versa.
This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:
LSB________MSB
x0 | x1 | x2 | x3 |
On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?
Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.
Take this with a grain of salt
My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.
Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L
For little-endian,
- A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as
0x0 0x4 0x8 0xc A0L A0H A1L A1H
- After bitcast <2 x i64> to <4 x i32> (which is a store followed by a load), the q0 register is still A0L A0H A1L A1H and LLVM IR <4 x i32> element 0 is A0L
For big-endian, the memory layout of <2 x i64> is
0x0 0x4 0x8 0xc A0H A0L A1H A1L
So after a bitcast to <4 x i32>, q0 register becomes A0H A0L A1H A1L -> for LLVM IR <4 x i32>, element 0 is A0H -> this changes the shuffle result.
p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18058–18059 | nit: maybe use DataLayout::isLittleEndian [1] directly, even if they expands to the same code if isLittleEndian is inlined. | |
18063–18065 | (I wrote assert in the previous work as well) On a second thought, it's more future proof to bail out if type is not one of {v4i116, v2i32, v8i8} in this context, given that UZP1 SDNode definition doesn't require vector element type to be integer (i.e. v4f16 is ok for compilation) Something like Type val; switch (SimpleVT) { case valid-case1: val = ...; break; case valid-case2; val = ... break; default: break; } if val is not set bail out | |
18068 | nit pick: it's more idiomatic to call getOpcode() in the LLVM codebase (even if compiler should do CSE) const unsigned Opcode = Operand.getOpcode(); if (Opcode == ISD::TRUNCATE) ... if (Opcode == ISD::BITCAST) ... | |
18104–18107 | If we see through BITCAST for Op0 and Op1 respectively but UzpOp0 and UzpOp1 have different return type (that could be casted to the same type), the current approach will feed AArch64ISD::UZP1 with two SDValue of different type(see test case in https://godbolt.org/z/TT4ErT5Mf) On the other hand, AArch64ISD::UZP1 expects two operands have the same value type (SDTypeProfile) For little-endian, BITCAST both operands to the type of return value here should work. |
Address kindly feedback.
Most of them are minor issues. No major change regarding the core algorithm.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18063–18065 | I use switch to implement this logic at line 17892 in the latest diff. | |
18104–18107 | My algorithm requires the two trunc operates on the same type. If not, it won't work correctly. Take this for example: x0 x1 x2 x3 x4 x5 x6 x7 y0 y1 y2 y3 y4 y5 y6 y7 Assume we trunc the left one from v2i32 to v2i16, the right one from v4i16 to v4i18, we get these: x0 x1 x2 x3 x4 x5 x6 x7 ______ y0 y1 y2 y3 y4 y5 y6 y7 These two are asymmetric, we cannot reproduce the same result with uzp1 since it operates on the two operands with the same action, i.e. it can only produce this: x0 x1 x2 x3 x4 x5 x6 x7______y0 y1 y2 y3 y4 y5 y6 y7 or this: x0 x1 x2 x3 x4 x5 x6 x7______y0 y1 y2 y3 y4 y5 y6 y7 But not both at the same time. I bail out if the type of UzpOp0 and UzpOp1 has discrepancy. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18063–18065 | Correct: 17896. |
Just for curious: This optimization involves a lot of bitcasts. Does the benefit of less xtn outweigh the copious bitcast instructions, i.e. rev(16|32|64) and ext ?
If no, maybe we can just implement this only on little endian ?
Thanks for the work. This LGTM overall. However, I don't consider myself a qualified reviewer as activity history shows, but willing to share and help by reviewing :-)
Following https://llvm.org/docs/CodeReview.html#lgtm-how-a-patch-is-accepted, I think we could wait a little bit more for feedback (or approval :)) from other reviewers.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18059–18060 | nit pick about style: It's more idiomatic to bail out early, see https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code | |
18092–18097 | nittest nit: I'd probably move these before auto HalfElementSize = lambda function, and move 'HalfElementSize' closer to where they are used. | |
18099–18103 | nit pick: This is effectively require Uzp1 Op0 and Op1 have the same result type for correctness. To avoid nested if and reduce indentation, probably move this out of if (HalfElementSize(SourceOp0, UzpOp0) &&HalfElementSize(SourceOp1, UzpOp1)), something like: if (SourceOp0.getSimpleValueType() != SourceOp1.getSimpleValueType()) return SDValue(); if (HalfElementSize(Op0..) && HalfElementSize(Op1..)) { assert (UzpOp0.getValueType() == UzpOp1.getValueType()); ... } | |
18104–18107 | Thanks for the example. The update to require the same operand type sounds reasonable. | |
18112 | I wonder if it makes code more readable and succinct by combining this switch with the switch inside HalfElementSize above, given that BitcastResultTy and Uzp1ResultTy are co-related (i.e. 'BitcastResultTy' == 'TruncOperandTy', and 'TruncOperandTy' == bitcast 'Uzp1ResultTy' to double-element-size) | |
18122–18123 | nit pick: this default should be 'llvm_unreachable' based on the context (since HalfElementSize switch rules out invalid cases). (See the other comment) I wonder if merging two switches makes the code shorter without harming readability. |
I myself haven't thought deeply into how to fixing this particular issue in big-endian, but wanted to know mapping of {llvm ir, register lane, memory layout} thereby the paragraph above -> that's partly why I suggest fixing little-endian for simplicity earlier :-)
The idea to combine the two switches sounds good if we can. It is a bit hard to follow if this would always be valid for all truncates through a bitcast. We should be protected against most of that because I don't think it can currently come up. Unfortunately that can make testing it difficult too.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18054 | VT can be replaced by ResVT. | |
18059 | We only generate AArch64ISD::UZP1 from certain types, which will always be simple. I believe that because we only generate them post-lowering, we can currently rely on the truncates begin legal types too, but that might not always be true if things change. | |
18094 | DestVT == VT == ResVT |
Address kindly feedbacks. No major change in the core algorithm.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18112 | Combining the two switches to make it more terse is possible, but to make it more readable is beyond my ability, since the two switches have different responsibilities. That is, the first one is responsible for half the element size of a 128 bit vector to a 128bit vector : v2i64 => v4i32 on the other hand, the second one is responsible for double the element size of a 64bit vector to a 128bit vector: v2i32 => v2i64 Putting two different responsibilities on a single switch may turn it into a enigma. So I prefer to maintain the status quo. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18061–18062 | This doesn't need to check for Simple, given it is checking for specific MVT's. | |
18064 | If we check for specific MVT's below, we are already checking for 64BitVector. | |
18067 | Same again.. | |
18070 | This can just check the ResVT directly if (ResVT != MVT::v2i32 && ResVT != MVT::v4i16 && ResVT != MVT::v8i8) | |
18112 | That makes sense, if the two types are independent and we handle all combinations through the bicast. If we know that SourceOp0 == SourceOp1 though, we needn't call HalfElementSize twice. Just calculate the ResultTy once and bitcast both results to the same type. It should check that the truncate input VT is simple though, if it is used in a switch. Just to be safe. | |
18118 | This looks like it should be UzpOp0. | |
llvm/test/CodeGen/AArch64/aarch64-uzp1-combine.ll | ||
4–5 | This comment can now be updated. |
Thanks. With a couple more nitpicks, this LGTM.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18077–18084 | These can be combined into a single if. | |
18104 | In llvm it is customary to add messages to asserts. In this case it is likely correct by construction, as we have just created the two ops with the same type above. | |
18121 | should -> Should |
Hi, I realize D133280 is a diffbase from the 'stack' UI, and actually I never figured out how to send stack reviews over others' patches, so didn't know if a) or b) is a simpler procedure
a) for me to submit D133280 first b) you could just commit all changes reviewed in this patch.
Since this patch supersedes the test-only patch, I'm fine with a) or b), whichever makes the procedure simpler. Just let me know :)
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
18104 | I remove it since --- as you indicate --- this is likely to be correct. Even if not, it must be a bug at somewhere else. |
I think it is best to do the same reviewing procedure as this on D133280, and commit it first.
VT can be replaced by ResVT.