This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] fix incorrect vectorization of abs() on POWER9
ClosedPublic

Authored by inouehrs on Apr 11 2018, 8:33 AM.

Details

Summary

Vectorized loops with abs() returns incorrect results on POWER9. This patch fixes it.
For example the following code returns negative result if input values are negative though it sums up the absolute value of the inputs. This problem causes test failures for libvpx.

int vpx_satd_c(const int16_t *coeff, int length) {
  int satd = 0;
  for (int i = 0; i < length; ++i) satd += abs(coeff[i]);
  return satd;
}

For vector absolute and vector absolute difference on POWER9, LLVM generates VABSDUW (Vector Absolute Difference Unsigned Word) instruction or variants.
Since these instructions are for unsigned integers, we need adjustment for signed integers.
For abs(sub(a, b)), we generate VABSDUW(a+0x80000000, b+0x80000000). Otherwise, abs(sub(-1, 0)) returns 0xFFFFFFFF(=-1) instead of 1. For abs(a), we generate VABSDUW(a+0x80000000, 0x80000000).

Diff Detail

Event Timeline

inouehrs created this revision.Apr 11 2018, 8:33 AM
inouehrs updated this revision to Diff 142023.Apr 11 2018, 8:52 AM
  • fix for halfword and byte cases

I imagine the constant materialization and splatting direct moves would not cost anything additional to what I suggested in an amortized sense (i.e. if LICM takes them out of the loop). However, I think it's still useful to produce code with lower path length.

Also, I can't really think of a better way to do it for the halfword version. I imagine there isn't a better way.
Finally for consistency, I would just use vxor instead of the various add opcodes. I think that is semantically equivalent to the modulo adds (correct me if I'm wrong here).

lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4811

Shouldn't this check ZERO_EXTEND_VECTOR_INREG as well? Or is that a node we can't have this late?

4816

It seems that for the v4i32 type, we should be able to just use xvnegsp rather than loading the immediate, moving and adding.

4834

We should just be able to do something like:

xxspltib 35, 128 # Mask
vxor 0, 3, 2    # Flip sign
vabsduh ...      # The actual absdiff
inouehrs added inline comments.Apr 18 2018, 12:39 AM
lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4811

In my understanding ZERO_EXTEND_VECTOR_INREG is created in the legalize phase, while this code is for the initial selection phase. So I think this code will not find ZERO_EXTEND_VECTOR_INREG node here.

4816

Do you know it is safe to use a floating point instruction for integer data if the bit pattern is for NaN of Inf?

4834

Good catch. I will update to use xxspltib.
VSX splat immediate supports 8-bit immediate while older VMX splat immediate supports only 5 bits.

nemanjai added inline comments.Apr 18 2018, 3:40 AM
lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4816

I think it's OK according to the ISA since it doesn't modify any special registers or do anything special for NaN/Inf. The description just says that it copies the contents with the high bit of each word element complemented, so I think this is just a bitwise operation rather than a vector fp operation.

inouehrs updated this revision to Diff 142965.Apr 18 2018, 10:52 AM

addressed comments from @nemanjai

inouehrs added inline comments.Apr 18 2018, 10:53 AM
lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4816

As far as I tested, it works at least on POWER9.

Sorry, I find the code kind of difficult to follow now. This is exacerbated by the fact that we end up creating nodes with AddOpcode that produce unary operations - which is very counter intuitive. I think it would be much easier to follow if you made it flow more naturally. Perhaps something along the lines of:

MachineSDNode *flipSign(...)
  if (Type == v4i32)
    <produce and return XVNEGSP>
  if (Type == v8i16)
    <produce the { 0x8000, 0x8000, ... } vector> // The implicit CSE in the DAG will ensure we don't get multiple nodes
  else if (Type == v16i8)
    <produce the { 0x80, 0x80, ... } vector>     // The implicit CSE in the DAG will ensure we don't get multiple nodes
  if (InputOp == Zero)
    <return vector from above>

  <produce and return the add/xor>
...
if (SkipAdjust)
  <just produce VABSDU[BHW], replace, return>
if (Opcode == SUB)
  Op1 = flipSign(Operand1)
  Op2 = flipSign(Operand2)
else
  Op1 = flipSign(Operand1)
  Op2 = flipSign(Zero)
<produce VABSDU[BHW] with Op1/Op2, replace, return>

I think that if it's structured in a similar way, it is much easier to follow the code and see exactly what is going on.

inouehrs updated this revision to Diff 143252.Apr 20 2018, 1:39 AM
  • refactoring based on the suggestion from @nemanjai
nemanjai accepted this revision.Apr 20 2018, 6:01 AM

I think this flows much more cleanly now. And if you reorder it to not create the bool temporary and return early, I think it'll be even better. Feel free to do that on the commit, I don't think this needs another review cycle.

BTW. Using V_SET0 or XXLXORz instead of XXSPLTIB to produce a zero vector might be a better choice since it has a 1-cycle lower maximum latency (but in practice, this shouldn't be noticeable).

lib/Target/PowerPC/PPCISelDAGToDAG.cpp
336

This is a member function and you're passing CurDAG which is a data member of the same class, right? If so, please remove the argument. Also, the SDLoc is easy enough to get from N.

3981

Since you're setting the output parameter rather than returning it...
s/returns/sets it to

3996

I suppose we could do this by:

vspltish 5, 1
vspltish 6, 15
vslh 5, 6, 5

As the two splats can be done in parallel. But this might very well be worse since it uses an extra vector register and due to the dispatch rules for vector operations.
Up to you of course.

4808

Can you just move this condition below where you set the opcodes and then we don't need SkipAdjust (since we now use it in only one place)?

4822

Can you please return here so the remaining code does not need to be nested into an else?

This revision is now accepted and ready to land.Apr 20 2018, 6:01 AM
This revision was automatically updated to reflect the committed changes.