This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add custom lowering for v4i8 trunc store
ClosedPublic

Authored by zatrazz on Jun 19 2018, 12:38 PM.

Details

Summary

This patch adds a custom trunc store lowering for v4i8 vector types.
Since there is not v.4b register, the v4i8 is promoted to v4i16 (v.4h)
and default action for v4i8 is to extract each element and issue 4
byte stores.

A better strategy would be to extended the promoted v4i16 to v8i16
(with undef elements) and extract and store the word lane which
represents the v4i8 subvectores. The construction:

define void @foo(<4 x i16> %x, i8* nocapture %p) {
  %0 = trunc <4 x i16> %x to <4 x i8>
  %1 = bitcast i8* %p to <4 x i8>*
  store <4 x i8> %0, <4 x i8>* %1, align 4, !tbaa !2
  ret void
}

Can be optimized from:

umov    w8, v0.h[3]
umov    w9, v0.h[2]
umov    w10, v0.h[1]
umov    w11, v0.h[0]
strb    w8, [x0, #3]
strb    w9, [x0, #2]
strb    w10, [x0, #1]
strb    w11, [x0]
ret

To:

xtn     v0.8b, v0.8h
str     s0, [x0]
ret

The patch also adjust the memory cost for autovectorization, so the C
code:

void foo (const int *src, int width, unsigned char *dst)
{
  for (int i = 0; i < width; i++)
     *dst++ = *src++;
}

can be vectorized to:

.LBB0_4:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
      ldr     q0, [x0], #16
      subs    x12, x12, #4            // =4
      xtn     v0.4h, v0.4s
      xtn     v0.8b, v0.8h
      st1     { v0.s }[0], [x2], #4
      b.ne    .LBB0_4

Instead of byte operations.

Diff Detail

Event Timeline

zatrazz created this revision.Jun 19 2018, 12:38 PM

I wonder if we should prefer to widen <2 x i8> and <4 x i8> to <8 x i8> instead of promoting to <4 x i16>. It would make stores like this a bit cheaper. Maybe an interesting experiment at some point (mostly just modifying AArch64TargetLowering::getPreferredVectorAction, I think, and seeing what happens to the generated code).

Do we need similar handling to this patch for <2 x i16> or <2 x i8>?

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
636–637

It looks like we still scalarize extloads?

test/CodeGen/AArch64/neon-truncStore-extLoad.ll
27

Why does this CHECK line have two possible lowerings?

rengolin added inline comments.Jun 19 2018, 1:33 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
2711

LowerSTORE is too generic a name for this specific function, but I get it it's the pattern in the custom lowering.

You can keep the generic name as this will be the entry point for *all* custom store lowering, even if it only implements one type right now. But I'd add a longer comment explaining, for now, this only lowers truncating vector stores, but would be the place to add *any* custom store lowering (vector or not, truncating or not).

2714

asserts on StoreNode being not null

2715

declare AS close to its usage to make it clear it's not left over dead code.

test/CodeGen/AArch64/neon-truncStore-extLoad.ll
27

looks like it's a copy of the pattern around... weird...

What's the instruction actually generated?

I wonder if we should prefer to widen <2 x i8> and <4 x i8> to <8 x i8> instead of promoting to <4 x i16>. It would make stores like this a bit cheaper. Maybe an interesting experiment at some point (mostly just modifying AArch64TargetLowering::getPreferredVectorAction, I think, and seeing what happens to the generated code).

I tried your suggestion, but without further tuning in vector lowering this does not yield much gain on a vector store operation. The operation:

%0 = trunc <4 x i32> %a to <4 x i8>
store <4 x i8> %0, <4 x i8>* %p, align 4, !tbaa !2

is scalarized because LowerBUILD_VECTOR can't really see a good pattern to use on it:

Custom lowering: t49: v8i8 = BUILD_VECTOR t37, t40, t43, t46, undef:i32, undef:i32, undef:i32, undef:i32
AArch64TargetLowering::ReconstructShuffle
Reshuffle failed: span too large for a VEXT to cope
LowerBUILD_VECTOR: alternatives failed, creating sequence of INSERT_VECTOR_ELT

Maybe if we handle v4i8 as v4i32 we could get a better code generation, but also it would require some more tuning in generic code. I do see a better code generation for trunc store v2i32 to v2i8, but I am not convinced that this vector type should be tuned.

Do we need similar handling to this patch for <2 x i16> or <2 x i8>?

The trunc store for v2i16 to v2i8 and v4i32 to v4i8 indeed can be optimized, but I also think it can be orthogonal to this optimization.

zatrazz added inline comments.Jun 20 2018, 11:52 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
2711

Ack, I will add a comment.

2714

Ack.

2715

Ack.

test/CodeGen/AArch64/neon-truncStore-extLoad.ll
27

Indeed I used the previous function as example (truncStore.v4i32), and for current testing the instruction being generated is 'str'.

I tried your suggestion, but without further tuning in vector lowering this does not yield much gain on a vector store operation.

Yes, it's not really an alternative to this patch, just something to explore if you care about the lowering of <4 x i8> in general.

zatrazz updated this revision to Diff 152134.Jun 20 2018, 12:50 PM

Updated patch from previous comments.

rengolin added inline comments.Jun 20 2018, 1:11 PM
lib/Target/AArch64/AArch64ISelLowering.cpp
2717

Last nit: we usually add a text at the end, in this case, something saying this is not the droids we're looking for... ex:

assert(StoreNode && "Can only custom lower store nodes");

assert(VT.isVector() && "Can only custom lower vector store types");
zatrazz updated this revision to Diff 152266.Jun 21 2018, 5:49 AM

Updated patch based from previous comment.

Thanks! I'm happy with Eli is happy. :)

efriedma added inline comments.Jun 21 2018, 11:27 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
636–637

I'm still not sure this change belongs in this patch, given that we still scalarize <4 x i8> loads.

test/CodeGen/AArch64/neon-truncStore-extLoad.ll
27

Please get rid of the "|" in all the patterns in this file.

zatrazz added inline comments.Jun 21 2018, 2:01 PM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
636–637

I will add this modification on another patch then.

test/CodeGen/AArch64/neon-truncStore-extLoad.ll
27

Right (for some reason I had in mind I already fixed it).

zatrazz updated this revision to Diff 152458.Jun 22 2018, 5:25 AM

Updated patch based on previous comments. Indeed, changing AArch64TTIImpl::getMemoryOpCost for both load and store was wrong, since <4 x i8> loads are still scalarized. I have changed to just adjust cost for stores.

I also think widen <2 x i8> and <4 x i8> to <8 x i8> instead of promoting to <4 x i16> is a more comprehensible approach and I will check what kind of adjustment would require to make it profitable in all cases.

efriedma added inline comments.Jun 22 2018, 11:12 AM
test/CodeGen/AArch64/neon-truncStore-extLoad.ll
27

Still not fixed...?

zatrazz updated this revision to Diff 152528.Jun 22 2018, 11:43 AM

Update patch from previous comments.

This revision is now accepted and ready to land.Jun 26 2018, 11:04 AM
zatrazz closed this revision.Jun 27 2018, 7:04 AM