This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve codegen for "trunc <4 x i64> to <4 x i8>" for all cases
ClosedPublic

Authored by 0x59616e on Sep 14 2022, 4:00 AM.

Diff Detail

Event Timeline

0x59616e created this revision.Sep 14 2022, 4:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 14 2022, 4:00 AM
0x59616e requested review of this revision.Sep 14 2022, 4:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 14 2022, 4:00 AM

Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!

A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.

Also, I found it helpful to keep these in mind while working on an implementation

  1. the caveats of BITCAST (across vectors, or across vectors and scalars)
  2. memory layout of LLVM IR vectors

https://reviews.llvm.org/D94964 answers the above two questions perfectly.

[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17586–17591

nit:

If I read correctly, this uses MVT::Other as a sentinel.

To limit the scope of the problem to be solved, an alternative option (without multiplexing the semantics of MVT::Other)

// This won't generalize the solution as commented, but just to illustrate another way of limiting the scope
if OrinOp.getValueType().getSimpleVT() != MVT::v2i64
  return false;
17588

nit:

Bail early (return false) if OrigOp0.getValueType() is not a simple type (getSimpleVT() will assert if OrigOp0.getValueType() is not a simple type.

mingmingl added a reviewer: fhahn.

took the liberty to loop in a few more folks for aarch64.

In this context, D133495 may also be interesting. It's using tbl to lower i32->i8 truncation and could also be extended to handle i64->i8. This would allow to do the conversion with one instructions plus a load that materializes the mask, which is why D133495 limits this to cases in loops, where the load can be hoisted out.

Thanks for your eye-opening information ! I'll embark on this asap ;)

0x59616e added a comment.EditedSep 15 2022, 7:56 PM

Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!

A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.

Also, I found it helpful to keep these in mind while working on an implementation

  1. the caveats of BITCAST (across vectors, or across vectors and scalars)
  2. memory layout of LLVM IR vectors

https://reviews.llvm.org/D94964 answers the above two questions perfectly.

[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)

Your implementation is more comprehensive than mine. Can I proceed the implementation with yours ?

Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!

A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.

Also, I found it helpful to keep these in mind while working on an implementation

  1. the caveats of BITCAST (across vectors, or across vectors and scalars)
  2. memory layout of LLVM IR vectors

https://reviews.llvm.org/D94964 answers the above two questions perfectly.

[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)

Your implementation is more comprehensive than mine. Can I proceed the implementation with yours ?

Feel free to go ahead (we won't step on each others toes as long as only one of us is working on this and let the other people know). I think it could more general than the unfinished work (only optimizing two test cases).

Thanks for working on this! I got my hands tight on working on this [1] but I'm more than glad to collaborate on the review side as well!

A high level question, https://reviews.llvm.org/D133280 has some other test cases for the same pattern, also on big-endian and little-endian systems to show the potential difference. It'd be great to generalize the current solution for those test cases. For simplicity, I'd probably start from solving the problem on little-endian first.

Also, I found it helpful to keep these in mind while working on an implementation

  1. the caveats of BITCAST (across vectors, or across vectors and scalars)
  2. memory layout of LLVM IR vectors

https://reviews.llvm.org/D94964 answers the above two questions perfectly.

[1] A preview of unfinished work in https://reviews.llvm.org/differential/diff/460138/ (not polished in terms of code style, and not generalized enough yet)

Your implementation is more comprehensive than mine. Can I proceed the implementation with yours ?

Feel free to go ahead (we won't step on each others toes as long as only one of us is working on this and let the other people know). I think it could more general than the unfinished work (only optimizing two test cases).

Huge thanks for your kindness

0x59616e added a comment.EditedSep 22 2022, 9:53 AM

I have a question : how does the SIMD instruction view the vector register in big endian mode ?

This question is raised from the confusing execution result of qemu-aarch64_be with the following instructions

fmov d0, x0
mov v0.d[1], x1

Suppose the content of $x0 and $x1 is 0x102030405060708 and 0x90a0b0c0e0f00 respectively. Here is the content of the $v0 after the above instructions is executed:

(gdb) p $v0.b
$14 = {u = {9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8}...

This confuses me. The $x1 is stored to the last element, but it shows up in the first. In my understanding, it should be:

{u = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0}...

What goes wrong here ? Does the SIMD instruction view the last element of the vector register as the 0th one in big endian mode ? Where can I find information related to this ?

Thanks :)

I have no idea regarding how gdb decides to print it... but for the general issue, hopefully https://llvm.org/docs/BigEndianNEON.html helps?

I have no idea regarding how gdb decides to print it... but for the general issue, hopefully https://llvm.org/docs/BigEndianNEON.html helps?

This is helpful. It contains some information that I didn't know before. Thanks :)

0x59616e updated this revision to Diff 462775.EditedSep 25 2022, 3:44 PM

bitcast is handled in this diff.

To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.

For example, given the following register with type v2i64:

LSB______MSB

x0 x1x2 x3

Applying xtn on it we get:

x0x2

This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:

x0x1x2x3

=== uzp1 ===>

x0x2<value from other register>

We can transform xtn to uzp1 by this observation, and vice versa.

This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:

LSB________MSB

x0x1x2x3

On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?

Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

0x59616e marked 2 inline comments as done.Sep 25 2022, 3:45 PM
mingmingl added a comment.EditedSep 26 2022, 11:47 AM

bitcast is handled in this diff.

To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.

For example, given the following register with type v2i64:

LSB______MSB

x0 x1x2 x3

Applying xtn on it we get:

x0x2

This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:

x0x1x2x3

=== uzp1 ===>

x0x2<value from other register>

We can transform xtn to uzp1 by this observation, and vice versa.

This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:

LSB________MSB

x0x1x2x3

On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?

Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

Take this with a grain of salt

My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.

Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L

For little-endian,

  1. A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as
0x0 0x4  0x8  0xc
A0L A0H A1L A1H
  1. After bitcast <2 x i64> to <4 x i32> (which is a store followed by a load), the q0 register is still A0L A0H A1L A1H and LLVM IR <4 x i32> element 0 is A0L

For big-endian, the memory layout of <2 x i64> is

0x0 0x4 0x8 0xc
A0H A0L A1H A1L

So after a bitcast to <4 x i32>, q0 register becomes A0H A0L A1H A1L -> for LLVM IR <4 x i32>, element 0 is A0H -> this changes the shuffle result.

p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17584–17585

nit: maybe use DataLayout::isLittleEndian [1] directly, even if they expands to the same code if isLittleEndian is inlined.

[1] https://github.com/llvm/llvm-project/blob/1f451a8bd6f32465b6ff26c30ba7fb6fc7e0e689/llvm/include/llvm/IR/DataLayout.h#L244

17589–17591

(I wrote assert in the previous work as well)

On a second thought, it's more future proof to bail out if type is not one of {v4i116, v2i32, v8i8} in this context, given that UZP1 SDNode definition doesn't require vector element type to be integer (i.e. v4f16 is ok for compilation)

Something like

Type val;
switch (SimpleVT) {
  case valid-case1:
    val = ...;
    break;
  case valid-case2;
    val = ...
    break;
  default:
    break;
}

if val is not set
  bail out
17594

nit pick: it's more idiomatic to call getOpcode() in the LLVM codebase (even if compiler should do CSE)

const unsigned Opcode = Operand.getOpcode();
if (Opcode == ISD::TRUNCATE) 
  ...
if (Opcode == ISD::BITCAST)
  ...
17630–17633

If we see through BITCAST for Op0 and Op1 respectively but UzpOp0 and UzpOp1 have different return type (that could be casted to the same type), the current approach will feed AArch64ISD::UZP1 with two SDValue of different type(see test case in https://godbolt.org/z/TT4ErT5Mf)

On the other hand, AArch64ISD::UZP1 expects two operands have the same value type (SDTypeProfile)

For little-endian, BITCAST both operands to the type of return value here should work.

0x59616e updated this revision to Diff 463409.Sep 27 2022, 8:12 PM
0x59616e marked 3 inline comments as done.

Address kindly feedback.

Most of them are minor issues. No major change regarding the core algorithm.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17589–17591

I use switch to implement this logic at line 17892 in the latest diff.

17630–17633

My algorithm requires the two trunc operates on the same type. If not, it won't work correctly. Take this for example:

x0 x1 x2 x3 x4 x5 x6 x7 y0 y1 y2 y3 y4 y5 y6 y7

Assume we trunc the left one from v2i32 to v2i16, the right one from v4i16 to v4i18, we get these:

x0 x1 x2 x3 x4 x5 x6 x7 ______ y0 y1 y2 y3 y4 y5 y6 y7
_______x _x________x_x___________ x_____x_____x____x

These two are asymmetric, we cannot reproduce the same result with uzp1 since it operates on the two operands with the same action, i.e. it can only produce this:

x0 x1 x2 x3 x4 x5 x6 x7______y0 y1 y2 y3 y4 y5 y6 y7
______ x _x________x_x_____________x _x________x__x

or this:

x0 x1 x2 x3 x4 x5 x6 x7______y0 y1 y2 y3 y4 y5 y6 y7
___x_____x_____x_____x__________x_____x____x_____x

But not both at the same time.

I bail out if the type of UzpOp0 and UzpOp1 has discrepancy.

0x59616e added inline comments.Sep 27 2022, 8:13 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17589–17591

Correct: 17896.

0x59616e added a comment.EditedSep 27 2022, 8:16 PM

bitcast is handled in this diff.

To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.

For example, given the following register with type v2i64:

LSB______MSB

x0 x1x2 x3

Applying xtn on it we get:

x0x2

This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:

x0x1x2x3

=== uzp1 ===>

x0x2<value from other register>

We can transform xtn to uzp1 by this observation, and vice versa.

This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:

LSB________MSB

x0x1x2x3

On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?

Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

Take this with a grain of salt

My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.

Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L

For little-endian,

  1. A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as
0x0 0x4  0x8  0xc
A0L A0H A1L A1H
  1. After bitcast <2 x i64> to <4 x i32> (which is a store followed by a load), the q0 register is still A0L A0H A1L A1H and LLVM IR <4 x i32> element 0 is A0L

For big-endian, the memory layout of <2 x i64> is

0x0 0x4 0x8 0xc
A0H A0L A1H A1L

So after a bitcast to <4 x i32>, q0 register becomes A0H A0L A1H A1L -> for LLVM IR <4 x i32>, element 0 is A0H -> this changes the shuffle result.

p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.

Just for curious: This optimization involves a lot of bitcasts. Does the benefit of less xtn outweigh the copious bitcast instructions, i.e. rev(16|32|64) and ext ?

If no, maybe we can just implement this only on little endian ?

Thanks for the work. This LGTM overall. However, I don't consider myself a qualified reviewer as activity history shows, but willing to share and help by reviewing :-)

Following https://llvm.org/docs/CodeReview.html#lgtm-how-a-patch-is-accepted, I think we could wait a little bit more for feedback (or approval :)) from other reviewers.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17585–17586

nit pick about style:

It's more idiomatic to bail out early, see https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code

17618–17623

nittest nit:

I'd probably move these before auto HalfElementSize = lambda function, and move 'HalfElementSize' closer to where they are used.

17625–17629

nit pick:

This is effectively require Uzp1 Op0 and Op1 have the same result type for correctness.

To avoid nested if and reduce indentation, probably move this out of if (HalfElementSize(SourceOp0, UzpOp0) &&HalfElementSize(SourceOp1, UzpOp1)), something like:

if (SourceOp0.getSimpleValueType() != SourceOp1.getSimpleValueType())
   return SDValue();

if (HalfElementSize(Op0..) && HalfElementSize(Op1..)) {
  assert (UzpOp0.getValueType() == UzpOp1.getValueType());
  ...
}
17630–17633

Thanks for the example. The update to require the same operand type sounds reasonable.

17638

I wonder if it makes code more readable and succinct by combining this switch with the switch inside HalfElementSize above, given that BitcastResultTy and Uzp1ResultTy are co-related (i.e. 'BitcastResultTy' == 'TruncOperandTy', and 'TruncOperandTy' == bitcast 'Uzp1ResultTy' to double-element-size)

17648–17649

nit pick: this default should be 'llvm_unreachable' based on the context (since HalfElementSize switch rules out invalid cases).

(See the other comment) I wonder if merging two switches makes the code shorter without harming readability.

bitcast is handled in this diff.

To handle bitcast, we need this observation: uzp1 is just a xtn that operates on two registers simultaneously.

For example, given the following register with type v2i64:

LSB______MSB

x0 x1x2 x3

Applying xtn on it we get:

x0x2

This is equivalent to bitcast it to v4i32, and then applying uzp1 on it:

x0x1x2x3

=== uzp1 ===>

x0x2<value from other register>

We can transform xtn to uzp1 by this observation, and vice versa.

This observation only works on little endian target. Big endian target has a problem: the uzp1 cannot be replaced by xtn since there is a discrepancy in the behavior of uzp1 between the little endian and big endian. To illustrate, take the following for example:

LSB________MSB

x0x1x2x3

On little endian, uzp1 grabs x0 and x2, which is right; on big endian, it grabs x3 and x1, which doesn't match what I saw on the document. But, since I'm new to AArch64, take my word with a pinch of salt. This bevavior is observed on gdb, maybe there's issue in the order of the value printed by it ?

Whatever the reason is, the execution result given by qemu just doesn't match. So I disable this on big endian target temporarily until we find the crux.

Take this with a grain of salt

My understanding is that, 'BITCAST' on little-endian works in this context since the element order and byte order is consistent that 'bitcast' won't change the relative order of bytes before and after the cast.

Use LLVM IR <2 x i64> as an example, we refer to element 0 as A0 and element 1 as A1, refer to the higher half (MSB) as A0H, and lower half as A0L

For little-endian,

  1. A0 is in lane 0 of the register and A1 is in lane1 of the register, with memory representation as
0x0 0x4  0x8  0xc
A0L A0H A1L A1H
  1. After bitcast <2 x i64> to <4 x i32> (which is a store followed by a load), the q0 register is still A0L A0H A1L A1H and LLVM IR <4 x i32> element 0 is A0L

For big-endian, the memory layout of <2 x i64> is

0x0 0x4 0x8 0xc
A0H A0L A1H A1L

So after a bitcast to <4 x i32>, q0 register becomes A0H A0L A1H A1L -> for LLVM IR <4 x i32>, element 0 is A0H -> this changes the shuffle result.

p.s. I use small functions like https://godbolt.org/z/63h9xja5e and https://gcc.godbolt.org/z/EsE3eWW71 to wrap my head around the mapping among {LLVM IR, register lanes, memory layout}.

Just for curious: This optimization involves a lot of bitcasts. Does the benefit of less xtn outweigh the copious bitcast instructions, i.e. rev(16|32|64) and ext ?

If no, maybe we can just implement this only on little endian ?

I myself haven't thought deeply into how to fixing this particular issue in big-endian, but wanted to know mapping of {llvm ir, register lane, memory layout} thereby the paragraph above -> that's partly why I suggest fixing little-endian for simplicity earlier :-)

The idea to combine the two switches sounds good if we can. It is a bit hard to follow if this would always be valid for all truncates through a bitcast. We should be protected against most of that because I don't think it can currently come up. Unfortunately that can make testing it difficult too.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17580

VT can be replaced by ResVT.

17585

We only generate AArch64ISD::UZP1 from certain types, which will always be simple. I believe that because we only generate them post-lowering, we can currently rely on the truncates begin legal types too, but that might not always be true if things change.

17620

DestVT == VT == ResVT

0x59616e updated this revision to Diff 464462.Sep 30 2022, 8:11 PM
0x59616e marked 7 inline comments as done.

Address kindly feedbacks. No major change in the core algorithm.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17638

Combining the two switches to make it more terse is possible, but to make it more readable is beyond my ability, since the two switches have different responsibilities.

That is, the first one is responsible for half the element size of a 128 bit vector to a 128bit vector :

v2i64 => v4i32
v4i32 => v8i16
v8i16 => v16i8

on the other hand, the second one is responsible for double the element size of a 64bit vector to a 128bit vector:

v2i32 => v2i64
v4i16 => v4i32
v8i8 => v8i16

Putting two different responsibilities on a single switch may turn it into a enigma. So I prefer to maintain the status quo.

dmgreen added inline comments.Oct 9 2022, 3:11 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17611–17612

This doesn't need to check for Simple, given it is checking for specific MVT's.

17614

If we check for specific MVT's below, we are already checking for 64BitVector.

17617

Same again..

17620

This can just check the ResVT directly

if (ResVT != MVT::v2i32 && ResVT != MVT::v4i16 && ResVT != MVT::v8i8)
17638

That makes sense, if the two types are independent and we handle all combinations through the bicast. If we know that SourceOp0 == SourceOp1 though, we needn't call HalfElementSize twice. Just calculate the ResultTy once and bitcast both results to the same type.

It should check that the truncate input VT is simple though, if it is used in a switch. Just to be safe.

17668

This looks like it should be UzpOp0.

llvm/test/CodeGen/AArch64/aarch64-uzp1-combine.ll
5 ↗(On Diff #464462)

This comment can now be updated.

Oh, also make sure you update all the tests that need it.

0x59616e updated this revision to Diff 466441.Oct 10 2022, 12:13 AM
0x59616e marked 9 inline comments as done.

Address friendly feedback and update all the affected tests

dmgreen accepted this revision.Oct 11 2022, 1:29 AM

Thanks. With a couple more nitpicks, this LGTM.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17627–17634

These can be combined into a single if.

17654

In llvm it is customary to add messages to asserts. In this case it is likely correct by construction, as we have just created the two ops with the same type above.

17671

should -> Should

This revision is now accepted and ready to land.Oct 11 2022, 1:29 AM

Hi, I realize D133280 is a diffbase from the 'stack' UI, and actually I never figured out how to send stack reviews over others' patches, so didn't know if a) or b) is a simpler procedure
a) for me to submit D133280 first b) you could just commit all changes reviewed in this patch.

Since this patch supersedes the test-only patch, I'm fine with a) or b), whichever makes the procedure simpler. Just let me know :)

0x59616e updated this revision to Diff 466984.Oct 11 2022, 5:59 PM
0x59616e marked 3 inline comments as done.

Address friendly feedbacks

0x59616e added inline comments.Oct 11 2022, 6:01 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17654

I remove it since --- as you indicate --- this is likely to be correct.

Even if not, it must be a bug at somewhere else.

Hi, I realize D133280 is a diffbase from the 'stack' UI, and actually I never figured out how to send stack reviews over others' patches, so didn't know if a) or b) is a simpler procedure
a) for me to submit D133280 first b) you could just commit all changes reviewed in this patch.

Since this patch supersedes the test-only patch, I'm fine with a) or b), whichever makes the procedure simpler. Just let me know :)

I think it is best to do the same reviewing procedure as this on D133280, and commit it first.

Thanks for all of your kindly help ;)

Time to take off to the upstream.