This is an archive of the discontinued LLVM Phabricator instance.

[PATCH][ARM] Fix issue with UMLAL (unsigned multiple accumulate long) lowering
AbandonedPublic

Authored by jyoti.allur on May 20 2015, 8:15 AM.

Details

Summary

[ARM] This patch addresses following issue.

unsigned long long
foo (unsigned long long a, unsigned char *b, unsigned short *c)
{
  return a + *b * *c;
}

Should compile to use UMLAL ((unsigned multiple accumulate long) which multiplies two unsigned 32-bit values to produce a 64-bit value, and accumulates this with a 64-bit value.

foo:
        ldrb    r2, [r2]
        ldrh    r3, [r3]
        mul     r2, r3, r2
        adds    r0, r2, r0
        adc     r1, r1, #0
        bx      lr

The above is reduced to following with this patch:

foo: 
        ldrb    r2, [r2]
        ldrh    r3, [r3]
        umlal   r0, r1, r3, r2
        bx      lr

Diff Detail

Repository
rL LLVM

Event Timeline

jyoti.allur retitled this revision from to [PATCH][ARM] Fix issue with UMLAL (unsigned multiple accumulate long) lowering.
jyoti.allur updated this object.
jyoti.allur edited the test plan for this revision. (Show Details)
jyoti.allur added reviewers: jmolloy, t.p.northover.
jyoti.allur set the repository for this revision to rL LLVM.
jyoti.allur added a subscriber: Unknown Object (MLST).
jyoti.allur abandoned this revision.May 20 2015, 8:16 AM

creating another revision . something went wrong here.

MatzeB added a subscriber: MatzeB.May 20 2015, 10:58 AM
  • Please upload your patches with context, see http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface
  • First: I think this patch is incorrect: I haven't tried it but reading the code it looks like it does not check that the UMUL/SMUL does not overflow, in which case using an UMLAL/SMLAL changes the semantics. It is not obvious to me how the code prohibits that, it looks like it accepts any Constant, SRA, CopyFromReg as input to the UMUL/SMUL. But as a simple example (UMUL 0xffffffff, 0xffffffff) - let's assume it is not constant folded - is accepted here because it has two constant inputs but does overflow and cannot be combined with an addc/adde to an UMLAL.
  • A test for SMLAL would be nice.
../../llvm/lib/Target/ARM/ARMISelLowering.cpp
7946–7951

These changes are unrelated and should be left out of this patch.

7992

unrelated change.

8013

This should be set (and is) on every path before use, leaving out the unnecessary initialization will get warnings to people that might refactor this and miss a case.

8014–8017

Move all these down towards their definitions some of them can even be merged with their definitions because they only have one. This esp. avoids the strange shadowing of the 2nd SDValue MULOp in the then case.

8029–8031

Remove declaration of IsLeftOperandCopyFromReg and do:
bool IsLeftOperandCopyFromReg = (AddeOp0->getOpcode() == ISD::CopyFromReg);
You could also inline into the if directly, but as that variable name has documentation value I'm fine with not doing that as well.

8050–8051

This could be moved into the then part of the preceding if.

8061–8064

Maybe don't bother introducing the IsLeftOperandMul variable and simply Set HiAdd = &AddeOp1/&AddeOp0 above instead of IsLeftOperandMul = true/false; (although I realize that was already strange in the previous code).

8092

unrelated change. More unrelated changed following.

Hi Matthias,
I had abandoned this change due to out of context changes shown after applying clang-format and some other issue.
To clarify, Constant, SRA, CopyFromReg are possible input nodes to ADDE node.
I was trying to cover cases where ISD::MUL, ISD::ADDC, ISD::ADDE can be combined to a S/UMLAL where in ISD::MUL operands are less 8-bit or 16-bits operands. In these cases result can stored in a in single 32-bit register. Hence in such cases, none of ADDE 's operand will store result from ISD:MUL. Instead, one of the ADDE's operand will hold 32:63 bits of 64-bit accumulate value, while the 0:31 bits of the accumulate value is stored in one of the operands of ADDC.

for example: in the assembly below,
R1 - ACC [ 32:63 ]
R0 - ACC [ 0:31 ]

foo:
        ldrb    r2, [r2]
        ldrh    r3, [r3]
        mul     r2, r3, r2
        adds    r0, r2, r0
        adc     r1, r1, #0
        bx      lr

I agree, overflow cases were not handled yet. Could you suggest how to detect overflow ?

../../llvm/lib/Target/ARM/ARMISelLowering.cpp
8013

ok.

8014–8017

ok

8029–8031

ok

8061–8064

IsLeftOperandMul was probably added for code readability and understanding pruposes.

Hi Matthias,
To clarify, Constant, SRA, CopyFromReg are possible input nodes to ADDE node.

Hmm, why would you need to restrict the possible nodes here? My understanding is that UMLAL does a full 64bit addition, so I don't see why you would need to restrict those.

I was trying to cover cases where ISD::MUL, ISD::ADDC, ISD::ADDE can be combined to a S/UMLAL where in ISD::MUL operands are less 8-bit or 16-bits operands. In these cases result can stored in a in single 32-bit register. Hence in such cases, none of ADDE 's operand will store result from ISD:MUL. Instead, one of the ADDE's operand will hold 32:63 bits of 64-bit accumulate value, while the 0:31 bits of the accumulate value is stored in one of the operands of ADDC.

for example: in the assembly below,
R1 - ACC [ 32:63 ]
R0 - ACC [ 0:31 ]

foo:
        ldrb    r2, [r2]
        ldrh    r3, [r3]
        mul     r2, r3, r2
        adds    r0, r2, r0
        adc     r1, r1, #0
        bx      lr

I agree, overflow cases were not handled yet. Could you suggest how to detect overflow ?

Well you need to match patterns that are obviously fine and reject everything else.

Good things would probably be querying for the NoUnsignedWrap flag, or using SelectionDAG::computeKnownBits and test if both MUL operands have their upper 16bit known to be zero. One can probably think of more patterns so I'd factor it out into a function so people can add more patterns later.

Hi Matthias,
To clarify, Constant, SRA, CopyFromReg are possible input nodes to ADDE node.

Hmm, why would you need to restrict the possible nodes here? My understanding is that UMLAL does a full 64bit addition, so I don't see why you would need to restrict those.

the cases that i was trying to cover (added newly in the testcase longMAC.ll here) contained one of the operands to ADDE from either of those nodes, hence it was added, but i now see that its not necessary to restrict to those nodes only.

I was trying to cover cases where ISD::MUL, ISD::ADDC, ISD::ADDE can be combined to a S/UMLAL where in ISD::MUL operands are less 8-bit or 16-bits operands. In these cases result can stored in a in single 32-bit register. Hence in such cases, none of ADDE 's operand will store result from ISD:MUL. Instead, one of the ADDE's operand will hold 32:63 bits of 64-bit accumulate value, while the 0:31 bits of the accumulate value is stored in one of the operands of ADDC.

for example: in the assembly below,
R1 - ACC [ 32:63 ]
R0 - ACC [ 0:31 ]

foo:
        ldrb    r2, [r2]
        ldrh    r3, [r3]
        mul     r2, r3, r2
        adds    r0, r2, r0
        adc     r1, r1, #0
        bx      lr

I agree, overflow cases were not handled yet. Could you suggest how to detect overflow ?

Well you need to match patterns that are obviously fine and reject everything else.

Good things would probably be querying for the NoUnsignedWrap flag, or using SelectionDAG::computeKnownBits and test if both MUL operands have their upper 16bit known to be zero. One can probably think of more patterns so I'd factor it out into a function so people can add more patterns later.

mul r2, r1, r3
SelectionDAG::computeKnownBits does not return the bit info for this instruction ISD::MUL if the mul operands are of opcode ISD::CopyFromReg This is not handled in computeKnownBits. I tried to do that but, RegisterSDNode class only gives register number, not the value contained in the register. How could one access the value contained in a RegisterSDNode, so as to check that MUL operands have their upper 16bit known to be zero ?

I also tried,
MULOp.getOperand(0).getValueSizeInBits() > 16 = overflow
but MULOp.getOperand(0).getValueSizeInBits() always returns 32 irrespective of the value stored in MUL operands.

This is my code to check overflow

MULOp = (AddcOp0->getOpcode() == ISD::MUL) ? AddcOp0 : AddcOp1;
APInt KnownZero, KnownOne;
    APInt KnownZero1, KnownOne1;
    DCI.DAG.computeKnownBits(MULOp.getOperand(0), KnownZero, KnownOne);
    DCI.DAG.computeKnownBits(MULOp.getOperand(1), KnownZero1, KnownOne1);

  APInt Mask(APInt::getHighBitsSet(VT.getSizeInBits(), 16));
  if ((KnownZero & Mask).getBoolValue() && (KnownZero1 & Mask).getBoolValue())
      return SDValue();

This should do the trick, but KnownZero is always zero even when MUL operands contain 0xFFFFFFFF, 0XFFFFFFFF for reasons sited above.
I would highly appreciate your inputs on this.

mul r2, r1, r3
SelectionDAG::computeKnownBits does not return the bit info for this instruction ISD::MUL if the mul operands are of opcode ISD::CopyFromReg This is not handled in computeKnownBits. I tried to do that but, RegisterSDNode class only gives register number, not the value contained in the register. How could one access the value contained in a RegisterSDNode, so as to check that MUL operands have their upper 16bit known to be zero ?

It's possible that the computeKnownBits doesn't catch all cases yet - or maybe CopyFromReg can't guarantee that in general, not sure.

I also tried,
MULOp.getOperand(0).getValueSizeInBits() > 16 = overflow
but MULOp.getOperand(0).getValueSizeInBits() always returns 32 irrespective of the value stored in MUL operands.

This is my code to check overflow

MULOp = (AddcOp0->getOpcode() == ISD::MUL) ? AddcOp0 : AddcOp1;
APInt KnownZero, KnownOne;
    APInt KnownZero1, KnownOne1;
    DCI.DAG.computeKnownBits(MULOp.getOperand(0), KnownZero, KnownOne);
    DCI.DAG.computeKnownBits(MULOp.getOperand(1), KnownZero1, KnownOne1);

  APInt Mask(APInt::getHighBitsSet(VT.getSizeInBits(), 16));
  if ((KnownZero & Mask).getBoolValue() && (KnownZero1 & Mask).getBoolValue())
      return SDValue();

This should do the trick, but KnownZero is always zero even when MUL operands contain 0xFFFFFFFF, 0XFFFFFFFF for reasons sited above.
I would highly appreciate your inputs on this.

This needs to have || in the if condition, but apart from that looks okay to me.

I tried your original example again and ToT clang gets met this:

%0 = load i8, i8* %b, align 1, !tbaa !2
%conv = zext i8 %0 to i64
%1 = load i16, i16* %c, align 2, !tbaa !5
%conv1 = zext i16 %1 to i64
%mul = mul nuw nsw i64 %conv1, %conv
%add = add i64 %mul, %a
ret i64 %add

So the mul operation already has an NUW flag that should be preserved, you should be able to check it by casting to OverflowingBinaryOperator and checking hasNoUnsignedWrap(). Thinking about it some more that is probably the best solution anyway as the IR optimizations already do computeKnownBits tricks to determine that flag, so it's probably best to just check for that flag and not do anything more fancy.

This comment was removed by jyoti.allur.
MatzeB added a comment.Jun 3 2015, 9:26 AM

Signed and unsigned multipliciation produce the same result for the lower 32 bits. The results are only different for the upper 32 bit. So I think you can just always choose UMLAL (or always SMLAL) as your transformation only accepts cases where the multiplication doesn't produce any high bits anyway.

  • Matthias

Hi Matthias,
Thanks a lot for your reply.
What about in this case which I was trying to handle in addition to those mentioned previously in review.

long long
foo (long long a, int b, int c)
{

return a + b * c;

}
This is a signed multiplication resulting in a 64 bit value isn't ? How do we choose SMLAL or UMLAL
I was under the impression that if either of the mul operands is a signed valued and added to a long value then select SMLAL, if both operands of mul are unsigned then select UMLAL.
Is the UMLAL/SMLAL selected based on the result value after adding the long value ?

Regards,
Jyoti Allur

  • Original Message -------

Sender : Matthias Braun<matze@braunis.de>
Date : Jun 03, 2015 21:55 (GMT+05:30)
Title : Re: [PATCH] [PATCH][ARM] Fix issue with UMLAL (unsigned multiple
accumulate long) lowering

Signed and unsigned multipliciation produce the same result for the lower 32 bits. The results are only different for the upper 32 bit. So I think you can just always choose UMLAL (or always SMLAL) as your transformation only accepts cases where the multiplication doesn't produce any high bits anyway.

  • Matthias