This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
29/29
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1/1
aarch64-uzp1-combine.ll
-
fptosi-sat-vector.ll
-
fptoui-sat-vector.ll

Differential D133850

[AArch64] Improve codegen for "trunc <4 x i64> to <4 x i8>" for all cases
ClosedPublic

Authored by 0x59616e on Sep 14 2022, 4:00 AM.

Download Raw Diff

Details

Reviewers

mingmingl
dmgreen
efriedma
fhahn

Commits

rG62fc58a61d15: [AArch64] Improve codegen for "trunc <4 x i64> to <4 x i8>" for all cases

Summary

Fixes #57502

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

0x59616e created this revision.Sep 14 2022, 4:00 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 14 2022, 4:00 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

0x59616e requested review of this revision.Sep 14 2022, 4:00 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 14 2022, 4:00 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B186576: Diff 460033.Sep 14 2022, 4:01 AM

0x59616e added a reviewer: mingmingl.Sep 14 2022, 4:01 AM

0x59616e added a parent revision: D133849: [NFC] Pre commit test of PR57502.

Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!

A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.

Also, I found it helpful to keep these in mind while working on an implementation

the caveats of BITCAST (across vectors, or across vectors and scalars)
memory layout of LLVM IR vectors

https://reviews.llvm.org/D94964 answers the above two questions perfectly.

[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18039–18044	nit: If I read correctly, this uses `MVT::Other` as a sentinel. To limit the scope of the problem to be solved, an alternative option (without multiplexing the semantics of `MVT::Other`) // This won't generalize the solution as commented, but just to illustrate another way of limiting the scope if OrinOp.getValueType().getSimpleVT() != MVT::v2i64 return false;
18041	nit: Bail early (return false) if `OrigOp0.getValueType()` is not a simple type (`getSimpleVT()` will assert if `OrigOp0.getValueType()` is not a simple type.

took the liberty to loop in a few more folks for aarch64.

In this context, D133495 may also be interesting. It's using tbl to lower i32->i8 truncation and could also be extended to handle i64->i8. This would allow to do the conversion with one instructions plus a load that materializes the mask, which is why D133495 limits this to cases in loops, where the load can be hoisted out.

Thanks for your eye-opening information ! I'll embark on this asap ;)

In D133850#3789942, @mingmingl wrote:

Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!

A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.

Also, I found it helpful to keep these in mind while working on an implementation

the caveats of BITCAST (across vectors, or across vectors and scalars)

memory layout of LLVM IR vectors

https://reviews.llvm.org/D94964 answers the above two questions perfectly.

[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)

Your implementation is more comprehensive than mine. Can I proceed the implementation with yours ?

In D133850#3794139, @0x59616e wrote:

In D133850#3789942, @mingmingl wrote:

Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!

A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.

Also, I found it helpful to keep these in mind while working on an implementation

the caveats of BITCAST (across vectors, or across vectors and scalars)

memory layout of LLVM IR vectors

https://reviews.llvm.org/D94964 answers the above two questions perfectly.

[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)

Your implementation is more comprehensive than mine. Can I proceed the implementation with yours ?

Feel free to go ahead (we won't step on each others toes as long as only one of us is working on this and let the other people know). I think it could more general than the unfinished work (only optimizing two test cases).

In D133850#3794147, @mingmingl wrote:

In D133850#3794139, @0x59616e wrote:

In D133850#3789942, @mingmingl wrote:

Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!

A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.

Also, I found it helpful to keep these in mind while working on an implementation

the caveats of BITCAST (across vectors, or across vectors and scalars)

memory layout of LLVM IR vectors

https://reviews.llvm.org/D94964 answers the above two questions perfectly.

[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)

Your implementation is more comprehensive than mine. Can I proceed the implementation with yours ?

Feel free to go ahead (we won't step on each others toes as long as only one of us is working on this and let the other people know). I think it could more general than the unfinished work (only optimizing two test cases).

Huge thanks for your kindness

0x59616e planned changes to this revision.Sep 19 2022, 5:02 PM

0x59616e edited parent revisions, added: D133280: [AArch64][NFC] Pre-commit test case to show sub-optimal codegen for Github issue #57502; removed: D133849: [NFC] Pre commit test of PR57502.

I have a question : how does the SIMD instruction view the vector register in big endian mode ?

This question is raised from the confusing execution result of qemu-aarch64_be with the following instructions

fmov d0, x0
mov v0.d[1], x1

Suppose the content of $x0 and $x1 is 0x102030405060708 and 0x90a0b0c0e0f00 respectively. Here is the content of the $v0 after the above instructions is executed:

(gdb) p $v0.b
$14 = {u = {9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8}...

This confuses me. The $x1 is stored to the last element, but it shows up in the first. In my understanding, it should be:

{u = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0}...

What goes wrong here ? Does the SIMD instruction view the last element of the vector register as the 0th one in big endian mode ? Where can I find information related to this ?

Thanks :)

I have no idea regarding how gdb decides to print it... but for the general issue, hopefully https://llvm.org/docs/BigEndianNEON.html helps?

In D133850#3809437, @efriedma wrote:

I have no idea regarding how gdb decides to print it... but for the general issue, hopefully https://llvm.org/docs/BigEndianNEON.html helps?

This is helpful. It contains some information that I didn't know before. Thanks :)

bitcast is handled in this diff.

To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.

For example, given the following register with type v2i64:

LSB______MSB

x0 x1

x2 x3

Applying xtn on it we get:

This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:

=== uzp1 ===>

We can transform xtn to uzp1 by this observation, and vice versa.

This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:

LSB________MSB

On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?

Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

0x59616e marked 2 inline comments as done.Sep 25 2022, 3:45 PM

Harbormaster completed remote builds in B188613: Diff 462775.Sep 25 2022, 3:53 PM

In D133850#3814014, @0x59616e wrote:

bitcast is handled in this diff.

To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.

For example, given the following register with type v2i64:

LSB______MSB

x0 x1 x2 x3

Applying xtn on it we get:

x0 x2

This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:

x0 x1 x2 x3

=== uzp1 ===>

x0 x2 <value from other register>

We can transform xtn to uzp1 by this observation, and vice versa.

This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:

LSB________MSB

x0 x1 x2 x3

On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?

Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

Take this with a grain of salt

My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.

Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L

For little-endian,

A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as

0x0 0x4  0x8  0xc
A0L A0H A1L A1H

After bitcast <2 x i64> to <4 x i32> (which is a store followed by a load), the q0 register is still A0L A0H A1L A1H and LLVM IR <4 x i32> element 0 is A0L

For big-endian, the memory layout of <2 x i64> is

0x0 0x4 0x8 0xc
A0H A0L A1H A1L

So after a bitcast to <4 x i32>, q0 register becomes A0H A0L A1H A1L -> for LLVM IR <4 x i32>, element 0 is A0H -> this changes the shuffle result.

p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18037–18038	nit: maybe use `DataLayout::isLittleEndian` [1] directly, even if they expands to the same code if `isLittleEndian` is inlined. [1] https://github.com/llvm/llvm-project/blob/1f451a8bd6f32465b6ff26c30ba7fb6fc7e0e689/llvm/include/llvm/IR/DataLayout.h#L244
18042–18044	(I wrote `assert` in the previous work as well) On a second thought, it's more future proof to bail out if type is not one of {v4i116, v2i32, v8i8} in this context, given that UZP1 SDNode definition doesn't require vector element type to be integer (i.e. v4f16 is ok for compilation) Something like Type val; switch (SimpleVT) { case valid-case1: val = ...; break; case valid-case2; val = ... break; default: break; } if val is not set bail out
18047	nit pick: it's more idiomatic to call `getOpcode()` in the LLVM codebase (even if compiler should do CSE) const unsigned Opcode = Operand.getOpcode(); if (Opcode == ISD::TRUNCATE) ... if (Opcode == ISD::BITCAST) ...
18083–18086	If we see through `BITCAST` for Op0 and Op1 respectively but UzpOp0 and UzpOp1 have different return type (that could be casted to the same type), the current approach will feed `AArch64ISD::UZP1` with two SDValue of different type(see test case in https://godbolt.org/z/TT4ErT5Mf) On the other hand, `AArch64ISD::UZP1` expects two operands have the same value type (SDTypeProfile) For little-endian, BITCAST both operands to the type of return value here should work.

Address kindly feedback.

Most of them are minor issues. No major change regarding the core algorithm.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18042–18044	I use switch to implement this logic at line 17892 in the latest diff.
18083–18086	My algorithm requires the two trunc operates on the same type. If not, it won't work correctly. Take this for example: x0 x1 x2 x3 x4 x5 x6 x7 y0 y1 y2 y3 y4 y5 y6 y7 Assume we trunc the left one from v2i32 to v2i16, the right one from v4i16 to v4i18, we get these: x0 x1 x2 x3 x4 x5 x6 x7 ______ y0 y1 y2 y3 y4 y5 y6 y7 _______x _x________x_x___________ x_____x_____x____x These two are asymmetric, we cannot reproduce the same result with `uzp1` since it operates on the two operands with the same action, i.e. it can only produce this: x0 x1 x2 x3 x4 x5 x6 x7______y0 y1 y2 y3 y4 y5 y6 y7 ______ x _x________x_x_____________x _x________x__x or this: x0 x1 x2 x3 x4 x5 x6 x7______y0 y1 y2 y3 y4 y5 y6 y7 ___x_____x_____x_____x__________x_____x____x_____x But not both at the same time. I bail out if the type of UzpOp0 and UzpOp1 has discrepancy.

0x59616e added inline comments.Sep 27 2022, 8:13 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18042–18044	Correct: 17896.

In D133850#3815818, @mingmingl wrote:
In D133850#3814014, @0x59616e wrote:

bitcast is handled in this diff.

To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.

For example, given the following register with type v2i64:

LSB______MSB

x0 x1 x2 x3

Applying xtn on it we get:

x0 x2

This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:

x0 x1 x2 x3

=== uzp1 ===>

x0 x2 <value from other register>

We can transform xtn to uzp1 by this observation, and vice versa.

This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:

LSB________MSB

x0 x1 x2 x3

On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?

Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

Take this with a grain of salt

My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.

Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L

For little-endian,

A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as
0x0 0x4  0x8  0xc
A0L A0H A1L A1H
After bitcast <2 x i64> to <4 x i32> (which is a store followed by a load), the q0 register is still A0L A0H A1L A1H and LLVM IR <4 x i32> element 0 is A0L

For big-endian, the memory layout of <2 x i64> is
0x0 0x4 0x8 0xc
A0H A0L A1H A1L
So after a bitcast to <4 x i32>, q0 register becomes A0H A0L A1H A1L -> for LLVM IR <4 x i32>, element 0 is A0H -> this changes the shuffle result.

p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.

Just for curious: This optimization involves a lot of bitcasts. Does the benefit of less xtn outweigh the copious bitcast instructions, i.e. rev(16|32|64) and ext ?

If no, maybe we can just implement this only on little endian ?

Thanks for the work. This LGTM overall. However, I don't consider myself a qualified reviewer as activity history shows, but willing to share and help by reviewing :-)

Following https://llvm.org/docs/CodeReview.html#lgtm-how-a-patch-is-accepted, I think we could wait a little bit more for feedback (or approval :)) from other reviewers.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18038–18039	nit pick about style: It's more idiomatic to bail out early, see https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code
18071–18076	nittest nit: I'd probably move these before `auto HalfElementSize = lambda function`, and move 'HalfElementSize' closer to where they are used.
18078–18082	nit pick: This is effectively require Uzp1 Op0 and Op1 have the same result type for correctness. To avoid nested if and reduce indentation, probably move this out of `if (HalfElementSize(SourceOp0, UzpOp0) &&HalfElementSize(SourceOp1, UzpOp1))`, something like: if (SourceOp0.getSimpleValueType() != SourceOp1.getSimpleValueType()) return SDValue(); if (HalfElementSize(Op0..) && HalfElementSize(Op1..)) { assert (UzpOp0.getValueType() == UzpOp1.getValueType()); ... }
18083–18086	Thanks for the example. The update to require the same operand type sounds reasonable.
18091	I wonder if it makes code more readable and succinct by combining this switch with the switch inside `HalfElementSize` above, given that `BitcastResultTy` and `Uzp1ResultTy` are co-related (i.e. 'BitcastResultTy' == 'TruncOperandTy', and 'TruncOperandTy' == bitcast 'Uzp1ResultTy' to double-element-size)
18101–18102	nit pick: this default should be 'llvm_unreachable' based on the context (since `HalfElementSize` switch rules out invalid cases). (See the other comment) I wonder if merging two switches makes the code shorter without harming readability.

In D133850#3819802, @0x59616e wrote:
In D133850#3815818, @mingmingl wrote:
In D133850#3814014, @0x59616e wrote:

bitcast is handled in this diff.

To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.

For example, given the following register with type v2i64:

LSB______MSB

x0 x1 x2 x3

Applying xtn on it we get:

x0 x2

This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:

x0 x1 x2 x3

=== uzp1 ===>

x0 x2 <value from other register>

We can transform xtn to uzp1 by this observation, and vice versa.

This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:

LSB________MSB

x0 x1 x2 x3

On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?

Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

Take this with a grain of salt

My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.

Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L

For little-endian,

A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as
0x0 0x4  0x8  0xc
A0L A0H A1L A1H
After bitcast <2 x i64> to <4 x i32> (which is a store followed by a load), the q0 register is still A0L A0H A1L A1H and LLVM IR <4 x i32> element 0 is A0L

For big-endian, the memory layout of <2 x i64> is
0x0 0x4 0x8 0xc
A0H A0L A1H A1L
So after a bitcast to <4 x i32>, q0 register becomes A0H A0L A1H A1L -> for LLVM IR <4 x i32>, element 0 is A0H -> this changes the shuffle result.

p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.
Just for curious: This optimization involves a lot of bitcasts. Does the benefit of less xtn outweigh the copious bitcast instructions, i.e. rev(16|32|64) and ext ?

If no, maybe we can just implement this only on little endian ?

I myself haven't thought deeply into how to fixing this particular issue in big-endian, but wanted to know mapping of {llvm ir, register lane, memory layout} thereby the paragraph above -> that's partly why I suggest fixing little-endian for simplicity earlier :-)

The idea to combine the two switches sounds good if we can. It is a bit hard to follow if this would always be valid for all truncates through a bitcast. We should be protected against most of that because I don't think it can currently come up. Unfortunately that can make testing it difficult too.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18033	VT can be replaced by ResVT.
18038	We only generate AArch64ISD::UZP1 from certain types, which will always be simple. I believe that because we only generate them post-lowering, we can currently rely on the truncates begin legal types too, but that might not always be true if things change.
18073	DestVT == VT == ResVT

Address kindly feedbacks. No major change in the core algorithm.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18091	Combining the two switches to make it more terse is possible, but to make it more readable is beyond my ability, since the two switches have different responsibilities. That is, the first one is responsible for `half the element size of a 128 bit vector to a 128bit vector` : v2i64 => v4i32 v4i32 => v8i16 v8i16 => v16i8 on the other hand, the second one is responsible for `double the element size of a 64bit vector to a 128bit vector`: v2i32 => v2i64 v4i16 => v4i32 v8i8 => v8i16 Putting two different responsibilities on a single switch may turn it into a enigma. So I prefer to maintain the status quo.

Harbormaster completed remote builds in B189822: Diff 464462.Sep 30 2022, 8:56 PM

dmgreen added inline comments.Oct 9 2022, 3:11 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18040–18041	This doesn't need to check for Simple, given it is checking for specific MVT's.
18043	If we check for specific MVT's below, we are already checking for 64BitVector.
18046	Same again..
18049	This can just check the ResVT directly if (ResVT != MVT::v2i32 && ResVT != MVT::v4i16 && ResVT != MVT::v8i8)
18091	That makes sense, if the two types are independent and we handle all combinations through the bicast. If we know that SourceOp0 == SourceOp1 though, we needn't call HalfElementSize twice. Just calculate the ResultTy once and bitcast both results to the same type. It should check that the truncate input VT is simple though, if it is used in a switch. Just to be safe.
18097	This looks like it should be UzpOp0.
llvm/test/CodeGen/AArch64/aarch64-uzp1-combine.ll
4–5	This comment can now be updated.

Oh, also make sure you update all the tests that need it.

Address friendly feedback and update all the affected tests

Harbormaster completed remote builds in B191228: Diff 466441.Oct 10 2022, 1:03 AM

Thanks. With a couple more nitpicks, this LGTM.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18056–18063	These can be combined into a single if.
18083	In llvm it is customary to add messages to asserts. In this case it is likely correct by construction, as we have just created the two ops with the same type above.
18100	should -> Should

This revision is now accepted and ready to land.Oct 11 2022, 1:29 AM

Hi, I realize D133280 is a diffbase from the 'stack' UI, and actually I never figured out how to send stack reviews over others' patches, so didn't know if a) or b) is a simpler procedure
a) for me to submit D133280 first b) you could just commit all changes reviewed in this patch.

Since this patch supersedes the test-only patch, I'm fine with a) or b), whichever makes the procedure simpler. Just let me know :)

Address friendly feedbacks

0x59616e added inline comments.Oct 11 2022, 6:01 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
18083	I remove it since --- as you indicate --- this is likely to be correct. Even if not, it must be a bug at somewhere else.

In D133850#3849932, @mingmingl wrote:

Hi, I realize D133280 is a diffbase from the 'stack' UI, and actually I never figured out how to send stack reviews over others' patches, so didn't know if a) or b) is a simpler procedure
a) for me to submit D133280 first b) you could just commit all changes reviewed in this patch.

Since this patch supersedes the test-only patch, I'm fine with a) or b), whichever makes the procedure simpler. Just let me know :)

I think it is best to do the same reviewing procedure as this on D133280, and commit it first.

Harbormaster completed remote builds in B191624: Diff 466984.Oct 11 2022, 6:41 PM

Thanks for all of your kindly help ;)

Time to take off to the upstream.

Closed by commit rG62fc58a61d15: [AArch64] Improve codegen for "trunc <4 x i64> to <4 x i8>" for all cases (authored by 0x59616e). · Explain WhyOct 13 2022, 4:09 AM

This revision was automatically updated to reflect the committed changes.

0x59616e added a commit: rG62fc58a61d15: [AArch64] Improve codegen for "trunc <4 x i64> to <4 x i8>" for all cases.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

71 lines

test/

CodeGen/

AArch64/

aarch64-uzp1-combine.ll

33 lines

fptosi-sat-vector.ll

5 lines

fptoui-sat-vector.ll

5 lines

Diff 467434

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 18,023 Lines • ▼ Show 20 Lines	static SDValue performUzpCombine(SDNode *N, SelectionDAG &DAG) {
// uzp1(x, unpkhi(uzp1(y, z))) => uzp1(x, z)		// uzp1(x, unpkhi(uzp1(y, z))) => uzp1(x, z)
if (Op1.getOpcode() == AArch64ISD::UUNPKHI) {		if (Op1.getOpcode() == AArch64ISD::UUNPKHI) {
if (Op1.getOperand(0).getOpcode() == AArch64ISD::UZP1) {		if (Op1.getOperand(0).getOpcode() == AArch64ISD::UZP1) {
SDValue Z = Op1.getOperand(0).getOperand(1);		SDValue Z = Op1.getOperand(0).getOperand(1);
return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, Op0, Z);		return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, Op0, Z);
}		}
}		}

		// uzp1(xtn x, xtn y) -> xtn(uzp1 (x, y))
		// Only implemented on little-endian subtargets.
		dmgreenUnsubmitted Done Reply Inline Actions VT can be replaced by ResVT. dmgreen: VT can be replaced by ResVT.
		bool IsLittleEndian = DAG.getDataLayout().isLittleEndian();

		// This optimization only works on little endian.
		if (!IsLittleEndian)
		return SDValue();
		mingminglUnsubmitted Done Reply Inline Actions nit: maybe use `DataLayout::isLittleEndian` [1] directly, even if they expands to the same code if `isLittleEndian` is inlined. [1] https://github.com/llvm/llvm-project/blob/1f451a8bd6f32465b6ff26c30ba7fb6fc7e0e689/llvm/include/llvm/IR/DataLayout.h#L244 mingmingl: nit: maybe use `DataLayout::isLittleEndian` [1] directly, even if they expands to the same code…
		dmgreenUnsubmitted Done Reply Inline Actions We only generate AArch64ISD::UZP1 from certain types, which will always be simple. I believe that because we only generate them post-lowering, we can currently rely on the truncates begin legal types too, but that might not always be true if things change. dmgreen: We only generate AArch64ISD::UZP1 from certain types, which will always be simple. I believe…

		mingminglUnsubmitted Done Reply Inline Actions nit pick about style: It's more idiomatic to bail out early, see https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code mingmingl: nit pick about style: It's more idiomatic to bail out early, see https://llvm.
		if (ResVT != MVT::v2i32 && ResVT != MVT::v4i16 && ResVT != MVT::v8i8)
		return SDValue();
		mingminglUnsubmitted Done Reply Inline Actions nit: Bail early (return false) if `OrigOp0.getValueType()` is not a simple type (`getSimpleVT()` will assert if `OrigOp0.getValueType()` is not a simple type. mingmingl: nit: Bail early (return false) if `OrigOp0.getValueType()` is not a simple type (`getSimpleVT…
		dmgreenUnsubmitted Done Reply Inline Actions This doesn't need to check for Simple, given it is checking for specific MVT's. dmgreen: This doesn't need to check for Simple, given it is checking for specific MVT's.

		auto getSourceOp = [](SDValue Operand) -> SDValue {
		dmgreenUnsubmitted Done Reply Inline Actions If we check for specific MVT's below, we are already checking for 64BitVector. dmgreen: If we check for specific MVT's below, we are already checking for 64BitVector.
		const unsigned Opcode = Operand.getOpcode();
		mingminglUnsubmitted Done Reply Inline Actions nit: If I read correctly, this uses `MVT::Other` as a sentinel. To limit the scope of the problem to be solved, an alternative option (without multiplexing the semantics of `MVT::Other`) // This won't generalize the solution as commented, but just to illustrate another way of limiting the scope if OrinOp.getValueType().getSimpleVT() != MVT::v2i64 return false; mingmingl: nit: If I read correctly, this uses `MVT::Other` as a sentinel. To limit the scope of the…
		mingminglUnsubmitted Done Reply Inline Actions (I wrote `assert` in the previous work as well) On a second thought, it's more future proof to bail out if type is not one of {v4i116, v2i32, v8i8} in this context, given that UZP1 SDNode definition doesn't require vector element type to be integer (i.e. v4f16 is ok for compilation) Something like Type val; switch (SimpleVT) { case valid-case1: val = ...; break; case valid-case2; val = ... break; default: break; } if val is not set bail out mingmingl: (I wrote `assert` in the previous work as well) On a second thought, it's more future proof to…
		0x59616eAuthorUnsubmitted Done Reply Inline Actions I use switch to implement this logic at line 17892 in the latest diff. 0x59616e: I use switch to implement this logic at line 17892 in the latest diff.
		0x59616eAuthorUnsubmitted Done Reply Inline Actions Correct: 17896. 0x59616e: Correct: 17896.
		if (Opcode == ISD::TRUNCATE)
		return Operand->getOperand(0);
		dmgreenUnsubmitted Done Reply Inline Actions Same again.. dmgreen: Same again..
		if (Opcode == ISD::BITCAST &&
		mingminglUnsubmitted Done Reply Inline Actions nit pick: it's more idiomatic to call `getOpcode()` in the LLVM codebase (even if compiler should do CSE) const unsigned Opcode = Operand.getOpcode(); if (Opcode == ISD::TRUNCATE) ... if (Opcode == ISD::BITCAST) ... mingmingl: nit pick: it's more idiomatic to call `getOpcode()` in the LLVM codebase (even if compiler…
		Operand->getOperand(0).getOpcode() == ISD::TRUNCATE)
		return Operand->getOperand(0)->getOperand(0);
		dmgreenUnsubmitted Done Reply Inline Actions This can just check the ResVT directly if (ResVT != MVT::v2i32 && ResVT != MVT::v4i16 && ResVT != MVT::v8i8) dmgreen: This can just check the ResVT directly ``` if (ResVT != MVT::v2i32 && ResVT != MVT::v4i16 &&…
		return SDValue();
		};

		SDValue SourceOp0 = getSourceOp(Op0);
		SDValue SourceOp1 = getSourceOp(Op1);

		if (!SourceOp0 \|\| !SourceOp1)
		return SDValue();

		if (SourceOp0.getValueType() != SourceOp1.getValueType() \|\|
		!SourceOp0.getValueType().isSimple())
return SDValue();		return SDValue();

		EVT ResultTy;
		dmgreenUnsubmitted Done Reply Inline Actions These can be combined into a single if. dmgreen: These can be combined into a single if.

		switch (SourceOp0.getSimpleValueType().SimpleTy) {
		case MVT::v2i64:
		ResultTy = MVT::v4i32;
		break;
		case MVT::v4i32:
		ResultTy = MVT::v8i16;
		break;
		case MVT::v8i16:
		ResultTy = MVT::v16i8;
		dmgreenUnsubmitted Done Reply Inline Actions DestVT == VT == ResVT dmgreen: DestVT == VT == ResVT
		break;
		default:
		return SDValue();
		mingminglUnsubmitted Done Reply Inline Actions nittest nit: I'd probably move these before `auto HalfElementSize = lambda function`, and move 'HalfElementSize' closer to where they are used. mingmingl: nittest nit: I'd probably move these before `auto HalfElementSize = lambda function`, and move…
		}

		SDValue UzpOp0 = DAG.getNode(ISD::BITCAST, DL, ResultTy, SourceOp0);
		SDValue UzpOp1 = DAG.getNode(ISD::BITCAST, DL, ResultTy, SourceOp1);
		SDValue UzpResult =
		DAG.getNode(AArch64ISD::UZP1, DL, UzpOp0.getValueType(), UzpOp0, UzpOp1);
		mingminglUnsubmitted Done Reply Inline Actions nit pick: This is effectively require Uzp1 Op0 and Op1 have the same result type for correctness. To avoid nested if and reduce indentation, probably move this out of `if (HalfElementSize(SourceOp0, UzpOp0) &&HalfElementSize(SourceOp1, UzpOp1))`, something like: if (SourceOp0.getSimpleValueType() != SourceOp1.getSimpleValueType()) return SDValue(); if (HalfElementSize(Op0..) && HalfElementSize(Op1..)) { assert (UzpOp0.getValueType() == UzpOp1.getValueType()); ... } mingmingl: nit pick: This is effectively require Uzp1 Op0 and Op1 have the same result type for…

		dmgreenUnsubmitted Done Reply Inline Actions In llvm it is customary to add messages to asserts. In this case it is likely correct by construction, as we have just created the two ops with the same type above. dmgreen: In llvm it is customary to add messages to asserts. In this case it is likely correct by…
		0x59616eAuthorUnsubmitted Done Reply Inline Actions I remove it since --- as you indicate --- this is likely to be correct. Even if not, it must be a bug at somewhere else. 0x59616e: I remove it since --- as you indicate --- this is likely to be correct. Even if not, it must…
		EVT BitcastResultTy;

		switch (ResVT.getSimpleVT().SimpleTy) {
		mingminglUnsubmitted Done Reply Inline Actions If we see through `BITCAST` for Op0 and Op1 respectively but UzpOp0 and UzpOp1 have different return type (that could be casted to the same type), the current approach will feed `AArch64ISD::UZP1` with two SDValue of different type(see test case in https://godbolt.org/z/TT4ErT5Mf) On the other hand, `AArch64ISD::UZP1` expects two operands have the same value type (SDTypeProfile) For little-endian, BITCAST both operands to the type of return value here should work. mingmingl: If we see through `BITCAST` for Op0 and Op1 respectively but UzpOp0 and UzpOp1 have different…
		0x59616eAuthorUnsubmitted Done Reply Inline Actions My algorithm requires the two trunc operates on the same type. If not, it won't work correctly. Take this for example: x0 x1 x2 x3 x4 x5 x6 x7 y0 y1 y2 y3 y4 y5 y6 y7 Assume we trunc the left one from v2i32 to v2i16, the right one from v4i16 to v4i18, we get these: x0 x1 x2 x3 x4 x5 x6 x7 ______ y0 y1 y2 y3 y4 y5 y6 y7 _______x _x________x_x___________ x_____x_____x____x These two are asymmetric, we cannot reproduce the same result with `uzp1` since it operates on the two operands with the same action, i.e. it can only produce this: x0 x1 x2 x3 x4 x5 x6 x7______y0 y1 y2 y3 y4 y5 y6 y7 ______ x _x________x_x_____________x _x________x__x or this: x0 x1 x2 x3 x4 x5 x6 x7______y0 y1 y2 y3 y4 y5 y6 y7 ___x_____x_____x_____x__________x_____x____x_____x But not both at the same time. I bail out if the type of UzpOp0 and UzpOp1 has discrepancy. 0x59616e: My algorithm requires the two trunc operates on the same type. If not, it won't work correctly.
		mingminglUnsubmitted Done Reply Inline Actions Thanks for the example. The update to require the same operand type sounds reasonable. mingmingl: Thanks for the example. The update to require the same operand type sounds reasonable.
		case MVT::v2i32:
		BitcastResultTy = MVT::v2i64;
		break;
		case MVT::v4i16:
		BitcastResultTy = MVT::v4i32;
		mingminglUnsubmitted Done Reply Inline Actions I wonder if it makes code more readable and succinct by combining this switch with the switch inside `HalfElementSize` above, given that `BitcastResultTy` and `Uzp1ResultTy` are co-related (i.e. 'BitcastResultTy' == 'TruncOperandTy', and 'TruncOperandTy' == bitcast 'Uzp1ResultTy' to double-element-size) mingmingl: I wonder if it makes code more readable and succinct by combining this switch with the switch…
		0x59616eAuthorUnsubmitted Done Reply Inline Actions Combining the two switches to make it more terse is possible, but to make it more readable is beyond my ability, since the two switches have different responsibilities. That is, the first one is responsible for `half the element size of a 128 bit vector to a 128bit vector` : v2i64 => v4i32 v4i32 => v8i16 v8i16 => v16i8 on the other hand, the second one is responsible for `double the element size of a 64bit vector to a 128bit vector`: v2i32 => v2i64 v4i16 => v4i32 v8i8 => v8i16 Putting two different responsibilities on a single switch may turn it into a enigma. So I prefer to maintain the status quo. 0x59616e: Combining the two switches to make it more terse is possible, but to make it more readable is…
		dmgreenUnsubmitted Done Reply Inline Actions That makes sense, if the two types are independent and we handle all combinations through the bicast. If we know that SourceOp0 == SourceOp1 though, we needn't call HalfElementSize twice. Just calculate the ResultTy once and bitcast both results to the same type. It should check that the truncate input VT is simple though, if it is used in a switch. Just to be safe. dmgreen: That makes sense, if the two types are independent and we handle all combinations through the…
		break;
		case MVT::v8i8:
		BitcastResultTy = MVT::v8i16;
		break;
		default:
		llvm_unreachable("Should be one of {v2i32, v4i16, v8i8}");
		dmgreenUnsubmitted Done Reply Inline Actions This looks like it should be UzpOp0. dmgreen: This looks like it should be UzpOp0.
		}

		return DAG.getNode(ISD::TRUNCATE, DL, ResVT,
		dmgreenUnsubmitted Done Reply Inline Actions should -> Should dmgreen: should -> Should
		DAG.getNode(ISD::BITCAST, DL, BitcastResultTy, UzpResult));
}		}
		mingminglUnsubmitted Done Reply Inline Actions nit pick: this default should be 'llvm_unreachable' based on the context (since `HalfElementSize` switch rules out invalid cases). (See the other comment) I wonder if merging two switches makes the code shorter without harming readability. mingmingl: nit pick: this default should be 'llvm_unreachable' based on the context (since…

static SDValue performGLD1Combine(SDNode *N, SelectionDAG &DAG) {		static SDValue performGLD1Combine(SDNode *N, SelectionDAG &DAG) {
unsigned Opc = N->getOpcode();		unsigned Opc = N->getOpcode();

assert(((Opc >= AArch64ISD::GLD1_MERGE_ZERO && // unsigned gather loads		assert(((Opc >= AArch64ISD::GLD1_MERGE_ZERO && // unsigned gather loads
Opc <= AArch64ISD::GLD1_IMM_MERGE_ZERO) \|\|		Opc <= AArch64ISD::GLD1_IMM_MERGE_ZERO) \|\|
(Opc >= AArch64ISD::GLD1S_MERGE_ZERO && // signed gather loads		(Opc >= AArch64ISD::GLD1S_MERGE_ZERO && // signed gather loads
Opc <= AArch64ISD::GLD1S_IMM_MERGE_ZERO)) &&		Opc <= AArch64ISD::GLD1S_IMM_MERGE_ZERO)) &&
▲ Show 20 Lines • Show All 4,753 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-uzp1-combine.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc < %s -mtriple aarch64-none-linux-gnu \| FileCheck --check-prefix=CHECK-LE %s		; RUN: llc < %s -mtriple aarch64-none-linux-gnu \| FileCheck --check-prefix=CHECK-LE %s
; RUN: llc < %s -mtriple aarch64_be-none-linux-gnu \| FileCheck --check-prefix=CHECK-BE %s		; RUN: llc < %s -mtriple aarch64_be-none-linux-gnu \| FileCheck --check-prefix=CHECK-BE %s

; Test cases to show when UZP1 (TRUNC, TRUNC) could be combined to TRUNC (UZP1) but not yet implemented.

define <4 x i16> @test_combine_v4i16_v2i64(<2 x i64> %a, <2 x i64> %b) {		define <4 x i16> @test_combine_v4i16_v2i64(<2 x i64> %a, <2 x i64> %b) {
		dmgreenUnsubmitted Done Reply Inline Actions This comment can now be updated. dmgreen: This comment can now be updated.
; CHECK-LE-LABEL: test_combine_v4i16_v2i64:		; CHECK-LE-LABEL: test_combine_v4i16_v2i64:
; CHECK-LE: // %bb.0:		; CHECK-LE: // %bb.0:
; CHECK-LE-NEXT: xtn v0.2s, v0.2d		; CHECK-LE-NEXT: uzp1 v0.4s, v0.4s, v1.4s
; CHECK-LE-NEXT: xtn v1.2s, v1.2d		; CHECK-LE-NEXT: xtn v0.4h, v0.4s
; CHECK-LE-NEXT: uzp1 v0.4h, v0.4h, v1.4h
; CHECK-LE-NEXT: ret		; CHECK-LE-NEXT: ret
;		;
; CHECK-BE-LABEL: test_combine_v4i16_v2i64:		; CHECK-BE-LABEL: test_combine_v4i16_v2i64:
; CHECK-BE: // %bb.0:		; CHECK-BE: // %bb.0:
; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8		; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8
; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8		; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
; CHECK-BE-NEXT: xtn v1.2s, v1.2d		; CHECK-BE-NEXT: xtn v1.2s, v1.2d
; CHECK-BE-NEXT: xtn v0.2s, v0.2d		; CHECK-BE-NEXT: xtn v0.2s, v0.2d
Show All 10 Lines	; CHECK-BE-NEXT: ret

%ab = shufflevector <4 x i16> %a2, <4 x i16> %b2, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%ab = shufflevector <4 x i16> %a2, <4 x i16> %b2, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
ret <4 x i16> %ab		ret <4 x i16> %ab
}		}

define <4 x i16> @test_combine_v4i16_v4i32(<4 x i32> %a, <4 x i32> %b) {		define <4 x i16> @test_combine_v4i16_v4i32(<4 x i32> %a, <4 x i32> %b) {
; CHECK-LE-LABEL: test_combine_v4i16_v4i32:		; CHECK-LE-LABEL: test_combine_v4i16_v4i32:
; CHECK-LE: // %bb.0:		; CHECK-LE: // %bb.0:
		; CHECK-LE-NEXT: uzp1 v0.8h, v0.8h, v1.8h
; CHECK-LE-NEXT: xtn v0.4h, v0.4s		; CHECK-LE-NEXT: xtn v0.4h, v0.4s
; CHECK-LE-NEXT: xtn v1.4h, v1.4s
; CHECK-LE-NEXT: uzp1 v0.4h, v0.4h, v1.4h
; CHECK-LE-NEXT: ret		; CHECK-LE-NEXT: ret
;		;
; CHECK-BE-LABEL: test_combine_v4i16_v4i32:		; CHECK-BE-LABEL: test_combine_v4i16_v4i32:
; CHECK-BE: // %bb.0:		; CHECK-BE: // %bb.0:
; CHECK-BE-NEXT: rev64 v1.4s, v1.4s		; CHECK-BE-NEXT: rev64 v1.4s, v1.4s
; CHECK-BE-NEXT: rev64 v0.4s, v0.4s		; CHECK-BE-NEXT: rev64 v0.4s, v0.4s
; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8		; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8
; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8		; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
; CHECK-BE-NEXT: xtn v1.4h, v1.4s		; CHECK-BE-NEXT: xtn v1.4h, v1.4s
; CHECK-BE-NEXT: xtn v0.4h, v0.4s		; CHECK-BE-NEXT: xtn v0.4h, v0.4s
; CHECK-BE-NEXT: uzp1 v0.4h, v0.4h, v1.4h		; CHECK-BE-NEXT: uzp1 v0.4h, v0.4h, v1.4h
; CHECK-BE-NEXT: rev64 v0.4h, v0.4h		; CHECK-BE-NEXT: rev64 v0.4h, v0.4h
; CHECK-BE-NEXT: ret		; CHECK-BE-NEXT: ret
%a1 = trunc <4 x i32> %a to <4 x i16>		%a1 = trunc <4 x i32> %a to <4 x i16>
%b1 = trunc <4 x i32> %b to <4 x i16>		%b1 = trunc <4 x i32> %b to <4 x i16>

%ab = shufflevector <4 x i16> %a1, <4 x i16> %b1, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%ab = shufflevector <4 x i16> %a1, <4 x i16> %b1, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
ret <4 x i16> %ab		ret <4 x i16> %ab
}		}

define <4 x i16> @test_combine_v4i16_v8i16(<8 x i16> %a, <8 x i16> %b) {		define <4 x i16> @test_combine_v4i16_v8i16(<8 x i16> %a, <8 x i16> %b) {
; CHECK-LE-LABEL: test_combine_v4i16_v8i16:		; CHECK-LE-LABEL: test_combine_v4i16_v8i16:
; CHECK-LE: // %bb.0:		; CHECK-LE: // %bb.0:
; CHECK-LE-NEXT: xtn v0.8b, v0.8h		; CHECK-LE-NEXT: uzp1 v0.16b, v0.16b, v1.16b
; CHECK-LE-NEXT: xtn v1.8b, v1.8h		; CHECK-LE-NEXT: xtn v0.4h, v0.4s
; CHECK-LE-NEXT: uzp1 v0.4h, v0.4h, v1.4h
; CHECK-LE-NEXT: ret		; CHECK-LE-NEXT: ret
;		;
; CHECK-BE-LABEL: test_combine_v4i16_v8i16:		; CHECK-BE-LABEL: test_combine_v4i16_v8i16:
; CHECK-BE: // %bb.0:		; CHECK-BE: // %bb.0:
; CHECK-BE-NEXT: rev64 v1.8h, v1.8h		; CHECK-BE-NEXT: rev64 v1.8h, v1.8h
; CHECK-BE-NEXT: rev64 v0.8h, v0.8h		; CHECK-BE-NEXT: rev64 v0.8h, v0.8h
; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8		; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8
; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8		; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
Show All 13 Lines	; CHECK-BE-NEXT: ret
%ab = shufflevector <4 x i16> %a2, <4 x i16> %b2, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%ab = shufflevector <4 x i16> %a2, <4 x i16> %b2, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
ret <4 x i16> %ab		ret <4 x i16> %ab
}		}


define <8 x i8> @test_combine_v8i8_v2i64(<2 x i64> %a, <2 x i64> %b) {		define <8 x i8> @test_combine_v8i8_v2i64(<2 x i64> %a, <2 x i64> %b) {
; CHECK-LE-LABEL: test_combine_v8i8_v2i64:		; CHECK-LE-LABEL: test_combine_v8i8_v2i64:
; CHECK-LE: // %bb.0:		; CHECK-LE: // %bb.0:
; CHECK-LE-NEXT: xtn v0.2s, v0.2d		; CHECK-LE-NEXT: uzp1 v0.4s, v0.4s, v1.4s
; CHECK-LE-NEXT: xtn v1.2s, v1.2d		; CHECK-LE-NEXT: xtn v0.8b, v0.8h
; CHECK-LE-NEXT: uzp1 v0.8b, v0.8b, v1.8b
; CHECK-LE-NEXT: ret		; CHECK-LE-NEXT: ret
;		;
; CHECK-BE-LABEL: test_combine_v8i8_v2i64:		; CHECK-BE-LABEL: test_combine_v8i8_v2i64:
; CHECK-BE: // %bb.0:		; CHECK-BE: // %bb.0:
; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8		; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8
; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8		; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
; CHECK-BE-NEXT: xtn v1.2s, v1.2d		; CHECK-BE-NEXT: xtn v1.2s, v1.2d
; CHECK-BE-NEXT: xtn v0.2s, v0.2d		; CHECK-BE-NEXT: xtn v0.2s, v0.2d
Show All 10 Lines	; CHECK-BE-NEXT: ret

%ab = shufflevector <8 x i8> %a2, <8 x i8> %b2, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>		%ab = shufflevector <8 x i8> %a2, <8 x i8> %b2, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
ret <8 x i8> %ab		ret <8 x i8> %ab
}		}

define <8 x i8> @test_combine_v8i8_v4i32(<4 x i32> %a, <4 x i32> %b) {		define <8 x i8> @test_combine_v8i8_v4i32(<4 x i32> %a, <4 x i32> %b) {
; CHECK-LE-LABEL: test_combine_v8i8_v4i32:		; CHECK-LE-LABEL: test_combine_v8i8_v4i32:
; CHECK-LE: // %bb.0:		; CHECK-LE: // %bb.0:
; CHECK-LE-NEXT: xtn v0.4h, v0.4s		; CHECK-LE-NEXT: uzp1 v0.8h, v0.8h, v1.8h
; CHECK-LE-NEXT: xtn v1.4h, v1.4s		; CHECK-LE-NEXT: xtn v0.8b, v0.8h
; CHECK-LE-NEXT: uzp1 v0.8b, v0.8b, v1.8b
; CHECK-LE-NEXT: ret		; CHECK-LE-NEXT: ret
;		;
; CHECK-BE-LABEL: test_combine_v8i8_v4i32:		; CHECK-BE-LABEL: test_combine_v8i8_v4i32:
; CHECK-BE: // %bb.0:		; CHECK-BE: // %bb.0:
; CHECK-BE-NEXT: rev64 v1.4s, v1.4s		; CHECK-BE-NEXT: rev64 v1.4s, v1.4s
; CHECK-BE-NEXT: rev64 v0.4s, v0.4s		; CHECK-BE-NEXT: rev64 v0.4s, v0.4s
; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8		; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8
; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8		; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
Show All 12 Lines	; CHECK-BE-NEXT: ret

%ab = shufflevector <8 x i8> %a2, <8 x i8> %b2, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>		%ab = shufflevector <8 x i8> %a2, <8 x i8> %b2, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
ret <8 x i8> %ab		ret <8 x i8> %ab
}		}

define <8 x i8> @test_combine_v8i8_v8i16(<8 x i16> %a, <8 x i16> %b) {		define <8 x i8> @test_combine_v8i8_v8i16(<8 x i16> %a, <8 x i16> %b) {
; CHECK-LE-LABEL: test_combine_v8i8_v8i16:		; CHECK-LE-LABEL: test_combine_v8i8_v8i16:
; CHECK-LE: // %bb.0:		; CHECK-LE: // %bb.0:
		; CHECK-LE-NEXT: uzp1 v0.16b, v0.16b, v1.16b
; CHECK-LE-NEXT: xtn v0.8b, v0.8h		; CHECK-LE-NEXT: xtn v0.8b, v0.8h
; CHECK-LE-NEXT: xtn v1.8b, v1.8h
; CHECK-LE-NEXT: uzp1 v0.8b, v0.8b, v1.8b
; CHECK-LE-NEXT: ret		; CHECK-LE-NEXT: ret
;		;
; CHECK-BE-LABEL: test_combine_v8i8_v8i16:		; CHECK-BE-LABEL: test_combine_v8i8_v8i16:
; CHECK-BE: // %bb.0:		; CHECK-BE: // %bb.0:
; CHECK-BE-NEXT: rev64 v1.8h, v1.8h		; CHECK-BE-NEXT: rev64 v1.8h, v1.8h
; CHECK-BE-NEXT: rev64 v0.8h, v0.8h		; CHECK-BE-NEXT: rev64 v0.8h, v0.8h
; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8		; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8
; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8		; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	; CHECK-BE-NEXT: ret

%ab = shufflevector <2 x i32> %a2, <2 x i32> %b2, <2 x i32> <i32 0, i32 2>		%ab = shufflevector <2 x i32> %a2, <2 x i32> %b2, <2 x i32> <i32 0, i32 2>
ret <2 x i32> %ab		ret <2 x i32> %ab
}		}

define i8 @trunc_v4i64_v4i8(<4 x i64> %input) {		define i8 @trunc_v4i64_v4i8(<4 x i64> %input) {
; CHECK-LE-LABEL: trunc_v4i64_v4i8:		; CHECK-LE-LABEL: trunc_v4i64_v4i8:
; CHECK-LE: // %bb.0:		; CHECK-LE: // %bb.0:
; CHECK-LE-NEXT: xtn v1.2s, v1.2d		; CHECK-LE-NEXT: uzp1 v0.4s, v0.4s, v1.4s
; CHECK-LE-NEXT: xtn v0.2s, v0.2d		; CHECK-LE-NEXT: xtn v0.4h, v0.4s
; CHECK-LE-NEXT: uzp1 v0.4h, v0.4h, v1.4h
; CHECK-LE-NEXT: addv h0, v0.4h		; CHECK-LE-NEXT: addv h0, v0.4h
; CHECK-LE-NEXT: fmov w0, s0		; CHECK-LE-NEXT: fmov w0, s0
; CHECK-LE-NEXT: ret		; CHECK-LE-NEXT: ret
;		;
; CHECK-BE-LABEL: trunc_v4i64_v4i8:		; CHECK-BE-LABEL: trunc_v4i64_v4i8:
; CHECK-BE: // %bb.0:		; CHECK-BE: // %bb.0:
; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8		; CHECK-BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8		; CHECK-BE-NEXT: ext v1.16b, v1.16b, v1.16b, #8
Show All 34 Lines

llvm/test/CodeGen/AArch64/fptosi-sat-vector.ll

	Show First 20 Lines • Show All 2,974 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: movi v2.4s, #127			; CHECK-NEXT: movi v2.4s, #127
	; CHECK-NEXT: fcvtzs v1.4s, v1.4s			; CHECK-NEXT: fcvtzs v1.4s, v1.4s
	; CHECK-NEXT: fcvtzs v0.4s, v0.4s			; CHECK-NEXT: fcvtzs v0.4s, v0.4s
	; CHECK-NEXT: smin v1.4s, v1.4s, v2.4s			; CHECK-NEXT: smin v1.4s, v1.4s, v2.4s
	; CHECK-NEXT: smin v0.4s, v0.4s, v2.4s			; CHECK-NEXT: smin v0.4s, v0.4s, v2.4s
	; CHECK-NEXT: mvni v2.4s, #127			; CHECK-NEXT: mvni v2.4s, #127
	; CHECK-NEXT: smax v1.4s, v1.4s, v2.4s			; CHECK-NEXT: smax v1.4s, v1.4s, v2.4s
	; CHECK-NEXT: smax v0.4s, v0.4s, v2.4s			; CHECK-NEXT: smax v0.4s, v0.4s, v2.4s
	; CHECK-NEXT: xtn v1.4h, v1.4s			; CHECK-NEXT: uzp1 v0.8h, v0.8h, v1.8h
	; CHECK-NEXT: xtn v0.4h, v0.4s			; CHECK-NEXT: xtn v0.8b, v0.8h
	; CHECK-NEXT: uzp1 v0.8b, v0.8b, v1.8b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%x = call <8 x i8> @llvm.fptosi.sat.v8f32.v8i8(<8 x float> %f)			%x = call <8 x i8> @llvm.fptosi.sat.v8f32.v8i8(<8 x float> %f)
	ret <8 x i8> %x			ret <8 x i8> %x
	}			}

	define <16 x i8> @test_signed_v16f32_v16i8(<16 x float> %f) {			define <16 x i8> @test_signed_v16f32_v16i8(<16 x float> %f) {
	; CHECK-LABEL: test_signed_v16f32_v16i8:			; CHECK-LABEL: test_signed_v16f32_v16i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	▲ Show 20 Lines • Show All 706 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll

	Show First 20 Lines • Show All 2,485 Lines • ▼ Show 20 Lines
	define <8 x i8> @test_unsigned_v8f32_v8i8(<8 x float> %f) {			define <8 x i8> @test_unsigned_v8f32_v8i8(<8 x float> %f) {
	; CHECK-LABEL: test_unsigned_v8f32_v8i8:			; CHECK-LABEL: test_unsigned_v8f32_v8i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: movi v2.2d, #0x0000ff000000ff			; CHECK-NEXT: movi v2.2d, #0x0000ff000000ff
	; CHECK-NEXT: fcvtzu v1.4s, v1.4s			; CHECK-NEXT: fcvtzu v1.4s, v1.4s
	; CHECK-NEXT: fcvtzu v0.4s, v0.4s			; CHECK-NEXT: fcvtzu v0.4s, v0.4s
	; CHECK-NEXT: umin v1.4s, v1.4s, v2.4s			; CHECK-NEXT: umin v1.4s, v1.4s, v2.4s
	; CHECK-NEXT: umin v0.4s, v0.4s, v2.4s			; CHECK-NEXT: umin v0.4s, v0.4s, v2.4s
	; CHECK-NEXT: xtn v1.4h, v1.4s			; CHECK-NEXT: uzp1 v0.8h, v0.8h, v1.8h
	; CHECK-NEXT: xtn v0.4h, v0.4s			; CHECK-NEXT: xtn v0.8b, v0.8h
	; CHECK-NEXT: uzp1 v0.8b, v0.8b, v1.8b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%x = call <8 x i8> @llvm.fptoui.sat.v8f32.v8i8(<8 x float> %f)			%x = call <8 x i8> @llvm.fptoui.sat.v8f32.v8i8(<8 x float> %f)
	ret <8 x i8> %x			ret <8 x i8> %x
	}			}

	define <16 x i8> @test_unsigned_v16f32_v16i8(<16 x float> %f) {			define <16 x i8> @test_unsigned_v16f32_v16i8(<16 x float> %f) {
	; CHECK-LABEL: test_unsigned_v16f32_v16i8:			; CHECK-LABEL: test_unsigned_v16f32_v16i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	▲ Show 20 Lines • Show All 542 Lines • Show Last 20 Lines