This is an archive of the discontinued LLVM Phabricator instance.

Improved udivmodsi4 with support for ARMv4
ClosedPublic

Authored by joerg on Jan 22 2014, 1:53 PM.

Details

Summary

The attached patch is a slightly optimized version of NetBSD's unsigned divide. License is cleared with Matt Thomas. Unlike the existing code, it has correct compile time branches for clz and hwdiv support. The test-and-subtract loop is fully unrolled to make full use of conditional execution without explicit branching. The other functions would follow after review for this code.

Diff Detail

Event Timeline

Hi Joerg,

Apart form my question below, everything looks fine. How have you tested it? Have you benchmarked it against the previous implementation? I do believe you that it's faster, just wanted to know how much gain we get from it.

cheers,
--renato

udivmodsi4.S
62 ↗(On Diff #6582)

I'd have thought that the code would only get here if r0 >= r1, because of the BCC above.

If r1 > r0, this is a case for quotient0, no?

How have you tested it? Have you benchmarked it against the previous implementation?

Cortex-A9 is probably the most important core without hardware
division. Well, M-class ones too, but I imagine those developers
wouldn't want this completely unrolled implementation in the first
place, with code size being so important.

Tim.

joerg added a comment.Jan 23 2014, 4:43 AM

Tim:

Just calling them in a loop with incrementing numerator and constant denumerator, I get the following timing on a ARM1176JZ-S (2nd generation rpi).

denumeratorold codenew code (-march=armv6)new code (-march=armv4)
655349.433.534.06
169.583.524.06
1288.173.213.76
udivmodsi4.S
62 ↗(On Diff #6582)

quotient0 is used for r0 < r1, so at this point r1 >= r0. This means r3 >= ip and therefore r3-ip >= 0.

joerg added inline comments.Jan 23 2014, 4:57 AM
udivmodsi4.S
62 ↗(On Diff #6582)

Not enough tea. quotient0 is used for r0 < r1, so at this point r1 <= r0. This means clz(r1) >= clz(r0).

Joerg,

I'll test your patch locally and will let you know. I'm happy with the (expected) v4 and v6 performance gains, we shouldn't get much on A15 but good gain on A9.

Tim,

This implementation is ARM only, so M-class wouldn't use this. I think the only worry in code-size would be for v4 users, since nowadays, ARM11 systems have 256MB or more.

cheers,
--renato

I'll test your patch locally and will let you know. I'm happy with the (expected) v4 and v6 performance gains,

Yeah, those are some very nice benefits.

we shouldn't get much on A15 but good gain on A9.

I'd hope nothing on A15, it has hardware divide (as part of
thevirtualisation extensions!) doesn't it?

Cheers.

Tim.

So, I've been fighting this patch for a while, and here are some observations.

  1. __ARM_ARCH is not defined before GCC 4.8, so we can't rely on it. You have to do something like:

    #if defined(ARM_ARCH_2) || defined(ARM_ARCH_3) || defined(ARM_ARCH_4) || defined(ARM_ARCH_3M) || defined(ARM_ARCH_4T) #define ARM_ARCH_OLD #endif

Then...

#ifndef __ARM_ARCH_OLD__
clz ip, r0
clz r3, r1

or, reuse something that compiler-rt does for you. GLibC does this when GCC != 4.8:

/* The __ARM_ARCH define is provided by gcc 4.8.  Construct it otherwise.  */
#ifndef __ARM_ARCH
# ifdef __ARM_ARCH_2__
#  define __ARM_ARCH 2
# elif defined (__ARM_ARCH_3__) || defined (__ARM_ARCH_3M__)
#  define __ARM_ARCH 3
# elif defined (__ARM_ARCH_4__) || defined (__ARM_ARCH_4T__)
#  define __ARM_ARCH 4
# elif defined (__ARM_ARCH_5__) || defined (__ARM_ARCH_5E__) \
       || defined(__ARM_ARCH_5T__) || defined(__ARM_ARCH_5TE__) \
       || defined(__ARM_ARCH_5TEJ__)
#  define __ARM_ARCH 5
# elif defined (__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) \
       || defined (__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) \
       || defined (__ARM_ARCH_6K__) || defined(__ARM_ARCH_6T2__)
#  define __ARM_ARCH 6
# elif defined (__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) \
       || defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) \
       || defined(__ARM_ARCH_7EM__)
#  define __ARM_ARCH 7
# else
#  error unknown arm architecture
# endif
#endif
  1. Even so, some of the macros are not right (inline comment).
  1. A15 works as before and, as expected, shows no difference.
  1. Because of the macros, A9 was following the ARMv4 path and breaking on many far too many cases, I'm looking into it. [ex: 8/4 = 1 (rem 4)]
  1. If I make it go down the CLZ path, it works and get some 20% performance.
udivmodsi4.S
58 ↗(On Diff #6582)

You mean __ARM_ARCH >= 5, right?

ARM_ARCH_5T is *not* set for any other ARCH > 5T, so that's not enough for all v6, v7 and other v5s.

Since CLZ and BX LR are both v5+, I think you can safely set an ARM_ARCH_OLD using my method above and check the same flag for both purposes.

Thanks for working on those benchmarks Renato.

or, reuse something that compiler-rt does for you. GLibC does this when GCC != 4.8:

  /* The __ARM_ARCH define is provided by gcc 4.8.  Construct it otherwise.  */
  #ifndef __ARM_ARCH
  # ifdef __ARM_ARCH_2__
  #  define __ARM_ARCH 2

I much prefer this one. We can even put it in a header and then nuke
it the second LLVM starts requiring GCC 4.8.

Cheers.

Tim.

joerg updated this revision to Unknown Object (????).Jan 23 2014, 1:28 PM

Fix a incorrect condition in the last step of shift computation for non-clz ARM.
Provide ARM_ARCH for compilers lacking it and move feature checks into assembly.h.
Implement
udivsi3 and __umodsi3 using the base version.

rengolin added inline comments.Jan 23 2014, 1:50 PM
lib/arm/umodsi3.S
150 ↗(On Diff #6612)

I remember this vaguely, but now it doesn't compile. Looks like a left over of something?

lib/assembly.h
60 ↗(On Diff #6612)

Missing an || here

With the || fix, the udivmod works well on A9 on both v5 and v6 (CLZ) versions, with v5 being as performing and v6 being 8% faster.

I need to check the other routines.

lib/arm/umodsi3.S
150 ↗(On Diff #6612)

My fault, out-of-date repo.

joerg updated this revision to Unknown Object (????).Jan 23 2014, 2:29 PM

Fixed ARMv5 conditional.

compnerd added inline comments.Jan 23 2014, 8:07 PM
lib/arm/udivmodsi4.S
22 ↗(On Diff #6617)

bx is supported as of ARMv4T, I take it that you want to support pre-T versions of ARMv4?

59 ↗(On Diff #6617)

I think adding a reminder that due to the previous comparison, r0 >= r1 will hold would be nice.

111 ↗(On Diff #6617)

Can you please rewrite the loop unrolling as:

.macro block shift
  cmp r0, r1, lsl # \shift
  addhs r3, r3, #(1 << \shift)
  subhs r0, r0, r1, lsl # \shift
.endm

.Lround = 31
.rept 31
  block .Lround
  .Lround = .Lround - 1
.endr
LOCAL_LABEL(div0block):
  block 0

The net effect is identical, but it is much simpler to maintain IMO.

lib/arm/udivsi3.S
1 ↗(On Diff #6617)

I think that this update is incorrect.

19 ↗(On Diff #6617)

Similar comments apply as previous file.

lib/arm/umodsi3.S
1 ↗(On Diff #6617)

Again, implementing a different routine here.

10 ↗(On Diff #6617)

The comment does not match the function again.

19 ↗(On Diff #6617)

Similar things as the previous two files.

joerg added inline comments.Jan 24 2014, 1:03 AM
lib/arm/udivmodsi4.S
22 ↗(On Diff #6617)

Correct.

59 ↗(On Diff #6617)

That's what line 61 is about.

111 ↗(On Diff #6617)

GAS macros should burn in hell, so I disagree on this.

lib/arm/udivsi3.S
1 ↗(On Diff #6617)

Copy editing error. Fixed locally as well as the same issue in umodsi3.S.

rengolin added inline comments.Jan 24 2014, 1:46 AM
lib/arm/udivmodsi4.S
111 ↗(On Diff #6617)

I agree with Joerg. Making the IAS support macros is one thing, actually using it is quite another.

rengolin accepted this revision.Jan 24 2014, 4:24 AM

Hi Joerg,

I've tested all new routines on an A9, both with the v5 and v6 encoding (CLZ) and here are the numbers:

  • divmod:
    • v5: on par
    • v6: 8% faster
  • div:
    • v5: 10% slower
    • v6: 5% faster
  • mod:
    • v5: 15% faster
    • v6: 4% faster

All results are statistically relevant (differences larger than the standard deviation). I'm not sure what the regression in div-v5 is, but as I told you, the A9 is very aggressive, and it could be anything. On average, though, and where it matters (v6+), it's consistently faster, so I'm happy with the results.

Fixing the typos, the patch looks good to go.

cheers,
--renato

joerg closed this revision.Jan 24 2014, 5:49 AM

Closed by commit rL200001 (authored by @joerg).