This is an archive of the discontinued LLVM Phabricator instance.

Improved udivmodsi4 with support for ARMv4
ClosedPublic

Authored by joerg on Jan 22 2014, 1:53 PM.

Download Raw Diff

Details

Reviewers

rengolin
compnerd
t.p.northover

Commits

rL200001: Provide support for ARMv4, lacking bx and clz. Unroll the

Summary

The attached patch is a slightly optimized version of NetBSD's unsigned divide. License is cleared with Matt Thomas. Unlike the existing code, it has correct compile time branches for clz and hwdiv support. The test-and-subtract loop is fully unrolled to make full use of conditional execution without explicit branching. The other functions would follow after review for this code.

Diff Detail

Event Timeline

Hi Joerg,

Apart form my question below, everything looks fine. How have you tested it? Have you benchmarked it against the previous implementation? I do believe you that it's faster, just wanted to know how much gain we get from it.

cheers,
--renato

udivmodsi4.S
62 ↗	(On Diff #6582)	I'd have thought that the code would only get here if r0 >= r1, because of the BCC above. If r1 > r0, this is a case for quotient0, no?

How have you tested it? Have you benchmarked it against the previous implementation?

Cortex-A9 is probably the most important core without hardware
division. Well, M-class ones too, but I imagine those developers
wouldn't want this completely unrolled implementation in the first
place, with code size being so important.

Tim.

Tim:

Just calling them in a loop with incrementing numerator and constant denumerator, I get the following timing on a ARM1176JZ-S (2nd generation rpi).

denumerator	old code	new code (-march=armv6)	new code (-march=armv4)
65534	9.43	3.53	4.06
16	9.58	3.52	4.06
128	8.17	3.21	3.76

udivmodsi4.S
62 ↗	(On Diff #6582)	quotient0 is used for r0 < r1, so at this point r1 >= r0. This means r3 >= ip and therefore r3-ip >= 0.

joerg added inline comments.Jan 23 2014, 4:57 AM

udivmodsi4.S
62 ↗	(On Diff #6582)	Not enough tea. quotient0 is used for r0 < r1, so at this point r1 <= r0. This means clz(r1) >= clz(r0).

Joerg,

I'll test your patch locally and will let you know. I'm happy with the (expected) v4 and v6 performance gains, we shouldn't get much on A15 but good gain on A9.

Tim,

This implementation is ARM only, so M-class wouldn't use this. I think the only worry in code-size would be for v4 users, since nowadays, ARM11 systems have 256MB or more.

cheers,
--renato

I'll test your patch locally and will let you know. I'm happy with the (expected) v4 and v6 performance gains,

Yeah, those are some very nice benefits.

we shouldn't get much on A15 but good gain on A9.

I'd hope nothing on A15, it has hardware divide (as part of
thevirtualisation extensions!) doesn't it?

Cheers.

Tim.

So, I've been fighting this patch for a while, and here are some observations.

__ARM_ARCH is not defined before GCC 4.8, so we can't rely on it. You have to do something like:

#if defined(ARM_ARCH_2) || defined(ARM_ARCH_3) || defined(ARM_ARCH_4) || defined(ARM_ARCH_3M) || defined(ARM_ARCH_4T) #define ARM_ARCH_OLD #endif

Then...

#ifndef __ARM_ARCH_OLD__
clz ip, r0
clz r3, r1

or, reuse something that compiler-rt does for you. GLibC does this when GCC != 4.8:

/* The __ARM_ARCH define is provided by gcc 4.8.  Construct it otherwise.  */
#ifndef __ARM_ARCH
# ifdef __ARM_ARCH_2__
#  define __ARM_ARCH 2
# elif defined (__ARM_ARCH_3__) || defined (__ARM_ARCH_3M__)
#  define __ARM_ARCH 3
# elif defined (__ARM_ARCH_4__) || defined (__ARM_ARCH_4T__)
#  define __ARM_ARCH 4
# elif defined (__ARM_ARCH_5__) || defined (__ARM_ARCH_5E__) \
       || defined(__ARM_ARCH_5T__) || defined(__ARM_ARCH_5TE__) \
       || defined(__ARM_ARCH_5TEJ__)
#  define __ARM_ARCH 5
# elif defined (__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) \
       || defined (__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) \
       || defined (__ARM_ARCH_6K__) || defined(__ARM_ARCH_6T2__)
#  define __ARM_ARCH 6
# elif defined (__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) \
       || defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7M__) \
       || defined(__ARM_ARCH_7EM__)
#  define __ARM_ARCH 7
# else
#  error unknown arm architecture
# endif
#endif

Even so, some of the macros are not right (inline comment).

A15 works as before and, as expected, shows no difference.

Because of the macros, A9 was following the ARMv4 path and breaking on many far too many cases, I'm looking into it. [ex: 8/4 = 1 (rem 4)]

If I make it go down the CLZ path, it works and get some 20% performance.

udivmodsi4.S
58 ↗	(On Diff #6582)	You mean __ARM_ARCH >= 5, right? ARM_ARCH_5T is not set for any other ARCH > 5T, so that's not enough for all v6, v7 and other v5s. Since CLZ and BX LR are both v5+, I think you can safely set an ARM_ARCH_OLD using my method above and check the same flag for both purposes.

Thanks for working on those benchmarks Renato.

or, reuse something that compiler-rt does for you. GLibC does this when GCC != 4.8:

  /* The __ARM_ARCH define is provided by gcc 4.8.  Construct it otherwise.  */
  #ifndef __ARM_ARCH
  # ifdef __ARM_ARCH_2__
  #  define __ARM_ARCH 2

I much prefer this one. We can even put it in a header and then nuke
it the second LLVM starts requiring GCC 4.8.

Cheers.

Tim.

Fix a incorrect condition in the last step of shift computation for non-clz ARM.
Provide ARM_ARCH for compilers lacking it and move feature checks into assembly.h.
Implement udivsi3 and __umodsi3 using the base version.

rengolin added inline comments.Jan 23 2014, 1:50 PM

lib/arm/umodsi3.S
150 ↗	(On Diff #6612)	I remember this vaguely, but now it doesn't compile. Looks like a left over of something?
lib/assembly.h
60 ↗	(On Diff #6612)	Missing an \|\| here

With the || fix, the udivmod works well on A9 on both v5 and v6 (CLZ) versions, with v5 being as performing and v6 being 8% faster.

I need to check the other routines.

lib/arm/umodsi3.S
150 ↗	(On Diff #6612)	My fault, out-of-date repo.

Fixed ARMv5 conditional.

compnerd added inline comments.Jan 23 2014, 8:07 PM

lib/arm/udivmodsi4.S
22 ↗	(On Diff #6617)	bx is supported as of ARMv4T, I take it that you want to support pre-T versions of ARMv4?
59 ↗	(On Diff #6617)	I think adding a reminder that due to the previous comparison, r0 >= r1 will hold would be nice.
111 ↗	(On Diff #6617)	Can you please rewrite the loop unrolling as: .macro block shift cmp r0, r1, lsl # \shift addhs r3, r3, #(1 << \shift) subhs r0, r0, r1, lsl # \shift .endm .Lround = 31 .rept 31 block .Lround .Lround = .Lround - 1 .endr LOCAL_LABEL(div0block): block 0 The net effect is identical, but it is much simpler to maintain IMO.
lib/arm/udivsi3.S
1 ↗	(On Diff #6617)	I think that this update is incorrect.
19 ↗	(On Diff #6617)	Similar comments apply as previous file.
lib/arm/umodsi3.S
1 ↗	(On Diff #6617)	Again, implementing a different routine here.
10 ↗	(On Diff #6617)	The comment does not match the function again.
19 ↗	(On Diff #6617)	Similar things as the previous two files.

joerg added inline comments.Jan 24 2014, 1:03 AM

lib/arm/udivmodsi4.S
22 ↗	(On Diff #6617)	Correct.
59 ↗	(On Diff #6617)	That's what line 61 is about.
111 ↗	(On Diff #6617)	GAS macros should burn in hell, so I disagree on this.
lib/arm/udivsi3.S
1 ↗	(On Diff #6617)	Copy editing error. Fixed locally as well as the same issue in umodsi3.S.

rengolin added inline comments.Jan 24 2014, 1:46 AM

lib/arm/udivmodsi4.S
111 ↗	(On Diff #6617)	I agree with Joerg. Making the IAS support macros is one thing, actually using it is quite another.

Hi Joerg,

I've tested all new routines on an A9, both with the v5 and v6 encoding (CLZ) and here are the numbers:

divmod:
- v5: on par
- v6: 8% faster
div:
- v5: 10% slower
- v6: 5% faster
mod:
- v5: 15% faster
- v6: 4% faster

All results are statistically relevant (differences larger than the standard deviation). I'm not sure what the regression in div-v5 is, but as I told you, the A9 is very aggressive, and it could be anything. On average, though, and where it matters (v6+), it's consistently faster, so I'm happy with the results.

Fixing the typos, the patch looks good to go.

cheers,
--renato

Closed by commit rL200001 (authored by @joerg).

Revision Contents

Path

Size

compiler-rt/

trunk/

lib/

arm/

udivmodsi4.S

212 lines

udivsi3.S

212 lines

umodsi3.S

184 lines

Diff 6638

compiler-rt/trunk/lib/arm/udivmodsi4.S

	/*===-- udivmodsi4.S - 32-bit unsigned integer divide and modulus ---------===//			/*===-- udivmodsi4.S - 32-bit unsigned integer divide and modulus ---------===//
	*			*
	* The LLVM Compiler Infrastructure			* The LLVM Compiler Infrastructure
	*			*
	* This file is dual licensed under the MIT and the University of Illinois Open			* This file is dual licensed under the MIT and the University of Illinois Open
	* Source Licenses. See LICENSE.TXT for details.			* Source Licenses. See LICENSE.TXT for details.
	*			*
	*===----------------------------------------------------------------------===//			*===----------------------------------------------------------------------===//
	*			*
	* This file implements the __udivmodsi4 (32-bit unsigned integer divide and			* This file implements the __udivmodsi4 (32-bit unsigned integer divide and
	* modulus) function for the ARM architecture. A naive digit-by-digit			* modulus) function for the ARM 32-bit architecture.
	* computation is employed for simplicity.
	*			*
	===----------------------------------------------------------------------===/			===----------------------------------------------------------------------===/

	#include "../assembly.h"			#include "../assembly.h"

	#define ESTABLISH_FRAME \
	push {r4, r7, lr} ;\
	add r7, sp, #4
	#define CLEAR_FRAME_AND_RETURN \
	pop {r4, r7, pc}

	#define a r0
	#define b r1
	#define i r3
	#define r r4
	#define q ip
	#define one lr

	.syntax unified			.syntax unified
	.align 3
				#ifdef ARM_HAS_BX
				#define JMP(r) bx r
				#else
				#define JMP(r) mov pc, r
				#endif

				.text
				.arm
				.p2align 2
	DEFINE_COMPILERRT_FUNCTION(__udivmodsi4)			DEFINE_COMPILERRT_FUNCTION(__udivmodsi4)
	#if __ARM_ARCH_EXT_IDIV__			#if __ARM_ARCH_EXT_IDIV__
	tst r1, r1			tst r1, r1
	beq LOCAL_LABEL(divzero)			beq LOCAL_LABEL(divby0)
	mov r3, r0			mov r3, r0
	udiv r0, r3, r1			udiv r0, r3, r1
	mls r1, r0, r1, r3			mls r1, r0, r1, r3
	str r1, [r2]			str r1, [r2]
	bx lr			bx lr
	LOCAL_LABEL(divzero):			#else
				cmp r1, #1
				bcc LOCAL_LABEL(divby0)
				beq LOCAL_LABEL(divby1)
				cmp r0, r1
				bcc LOCAL_LABEL(quotient0)
				/*
				* Implement division using binary long division algorithm.
				*
				* r0 is the numerator, r1 the denominator.
				*
				* The code before JMP computes the correct shift I, so that
				* r0 and (r1 << I) have the highest bit set in the same position.
				* At the time of JMP, ip := .Ldiv0block - 12 * I.
				* This depends on the fixed instruction size of block.
				*
				* block(shift) implements the test-and-update-quotient core.
				* It assumes (r0 << shift) can be computed without overflow and
				* that (r0 << shift) < 2 * r1. The quotient is stored in r3.
				*/

				# ifdef __ARM_FEATURE_CLZ
				clz ip, r0
				clz r3, r1
				/* r0 >= r1 implies clz(r0) <= clz(r1), so ip <= r3. */
				sub r3, r3, ip
				adr ip, LOCAL_LABEL(div0block)
				sub ip, ip, r3, lsl #2
				sub ip, ip, r3, lsl #3
				mov r3, #0
				bx ip
				# else
				str r4, [sp, #-8]!

				mov r4, r0
				adr ip, LOCAL_LABEL(div0block)

				lsr r3, r4, #16
				cmp r3, r1
				movhs r4, r3
				subhs ip, ip, #(16 * 12)

				lsr r3, r4, #8
				cmp r3, r1
				movhs r4, r3
				subhs ip, ip, #(8 * 12)

				lsr r3, r4, #4
				cmp r3, r1
				movhs r4, r3
				subhs ip, #(4 * 12)

				lsr r3, r4, #2
				cmp r3, r1
				movhs r4, r3
				subhs ip, ip, #(2 * 12)

				/* Last block, no need to update r3 or r4. */
				cmp r1, r4, lsr #1
				subls ip, ip, #(1 * 12)

				ldr r4, [sp], #8 /* restore r4, we are done with it. */
				mov r3, #0

				JMP(ip)
				# endif

				#define IMM #

				#define block(shift) \
				cmp r0, r1, lsl IMM shift; \
				addhs r3, r3, IMM (1 << shift); \
				subhs r0, r0, r1, lsl IMM shift

				block(31)
				block(30)
				block(29)
				block(28)
				block(27)
				block(26)
				block(25)
				block(24)
				block(23)
				block(22)
				block(21)
				block(20)
				block(19)
				block(18)
				block(17)
				block(16)
				block(15)
				block(14)
				block(13)
				block(12)
				block(11)
				block(10)
				block(9)
				block(8)
				block(7)
				block(6)
				block(5)
				block(4)
				block(3)
				block(2)
				block(1)
				LOCAL_LABEL(div0block):
				block(0)

				str r0, [r2]
				mov r0, r3
				JMP(lr)

				LOCAL_LABEL(quotient0):
				str r0, [r2]
	mov r0, #0			mov r0, #0
	bx lr			JMP(lr)

				LOCAL_LABEL(divby1):
				mov r3, #0
				str r3, [r2]
				JMP(lr)
				#endif /* __ARM_ARCH_EXT_IDIV__ */

				LOCAL_LABEL(divby0):
				mov r0, #0
				#ifdef __ARM_EABI__
				b __aeabi_idiv0
	#else			#else
	// We use a simple digit by digit algorithm; before we get into the actual			JMP(lr)
	// divide loop, we must calculate the left-shift amount necessary to align
	// the MSB of the divisor with that of the dividend (If this shift is
	// negative, then the result is zero, and we early out). We also conjure a
	// bit mask of 1 to use in constructing the quotient, and initialize the
	// quotient to zero.
	ESTABLISH_FRAME
	clz r4, a
	tst b, b // detect divide-by-zero
	clz r3, b
	mov q, #0
	beq LOCAL_LABEL(return) // return 0 if b is zero.
	mov one, #1
	subs i, r3, r4
	blt LOCAL_LABEL(return) // return 0 if MSB(a) < MSB(b)

	LOCAL_LABEL(mainLoop):
	// This loop basically implements the following:
	//
	// do {
	// if (a >= b << i) {
	// a -= b << i;
	// q \|= 1 << i;
	// if (a == 0) break;
	// }
	// } while (--i)
	//
	// Note that this does not perform the final iteration (i == 0); by doing it
	// this way, we can merge the two branches which is a substantial win for
	// such a tight loop on current ARM architectures.
	subs r, a, b, lsl i
	itt hs
	orrhs q, q,one, lsl i
	movhs a, r
	it ne
	subsne i, i, #1
	bhi LOCAL_LABEL(mainLoop)

	// Do the final test subtraction and update of quotient (i == 0), as it is
	// not performed in the main loop.
	subs r, a, b
	itt hs
	orrhs q, #1
	movhs a, r

	LOCAL_LABEL(return):
	// Store the remainder, and move the quotient to r0, then return.
	str a, [r2]
	mov r0, q
	CLEAR_FRAME_AND_RETURN
	#endif			#endif

				END_COMPILERRT_FUNCTION(__udivmodsi4)

compiler-rt/trunk/lib/arm/udivsi3.S

	/*===-- udivsi3.S - 32-bit unsigned integer divide ------------------------===//			/*===-- udivmodsi4.S - 32-bit unsigned integer divide ---------------------===//
	*			*
	* The LLVM Compiler Infrastructure			* The LLVM Compiler Infrastructure
	*			*
	* This file is dual licensed under the MIT and the University of Illinois Open			* This file is dual licensed under the MIT and the University of Illinois Open
	* Source Licenses. See LICENSE.TXT for details.			* Source Licenses. See LICENSE.TXT for details.
	*			*
	*===----------------------------------------------------------------------===//			*===----------------------------------------------------------------------===//
	*			*
	* This file implements the __udivsi3 (32-bit unsigned integer divide)			* This file implements the __udivsi3 (32-bit unsigned integer divide)
	* function for the ARM architecture. A naive digit-by-digit computation is			* function for the ARM 32-bit architecture.
	* employed for simplicity.
	*			*
	===----------------------------------------------------------------------===/			===----------------------------------------------------------------------===/

	#include "../assembly.h"			#include "../assembly.h"

	#define ESTABLISH_FRAME \
	push {r7, lr} ;\
	mov r7, sp
	#define CLEAR_FRAME_AND_RETURN \
	pop {r7, pc}

	#define a r0
	#define b r1
	#define r r2
	#define i r3
	#define q ip
	#define one lr

	.syntax unified			.syntax unified
	.align 3
	// Ok, APCS and AAPCS agree on 32 bit args, so it's safe to use the same routine.			#ifdef ARM_HAS_BX
				#define JMP(r) bx r
				#define JMPc(r,c) bx##c r
				#else
				#define JMP(r) mov pc, r
				#define JMPc(r,c) mov##c pc, r
				#endif

				.text
				.arm
				.p2align 2
	DEFINE_AEABI_FUNCTION_ALIAS(__aeabi_uidiv, __udivsi3)			DEFINE_AEABI_FUNCTION_ALIAS(__aeabi_uidiv, __udivsi3)
	DEFINE_COMPILERRT_FUNCTION(__udivsi3)			DEFINE_COMPILERRT_FUNCTION(__udivsi3)
	#if __ARM_ARCH_EXT_IDIV__			#if __ARM_ARCH_EXT_IDIV__
	tst r1,r1			tst r1, r1
	beq LOCAL_LABEL(divzero)			beq LOCAL_LABEL(divby0)
	udiv r0, r0, r1			mov r3, r0
				udiv r0, r3, r1
				mls r1, r0, r1, r3
	bx lr			bx lr
	LOCAL_LABEL(divzero):			#else
				cmp r1, #1
				bcc LOCAL_LABEL(divby0)
				JMPc(lr, eq)
				cmp r0, r1
				movcc r0, #0
				JMPc(lr, cc)
				/*
				* Implement division using binary long division algorithm.
				*
				* r0 is the numerator, r1 the denominator.
				*
				* The code before JMP computes the correct shift I, so that
				* r0 and (r1 << I) have the highest bit set in the same position.
				* At the time of JMP, ip := .Ldiv0block - 12 * I.
				* This depends on the fixed instruction size of block.
				*
				* block(shift) implements the test-and-update-quotient core.
				* It assumes (r0 << shift) can be computed without overflow and
				* that (r0 << shift) < 2 * r1. The quotient is stored in r3.
				*/

				# ifdef __ARM_FEATURE_CLZ
				clz ip, r0
				clz r3, r1
				/* r0 >= r1 implies clz(r0) <= clz(r1), so ip <= r3. */
				sub r3, r3, ip
				adr ip, LOCAL_LABEL(div0block)
				sub ip, ip, r3, lsl #2
				sub ip, ip, r3, lsl #3
				mov r3, #0
				bx ip
				# else
				mov r2, r0
				adr ip, LOCAL_LABEL(div0block)

				lsr r3, r2, #16
				cmp r3, r1
				movhs r2, r3
				subhs ip, ip, #(16 * 12)

				lsr r3, r2, #8
				cmp r3, r1
				movhs r2, r3
				subhs ip, ip, #(8 * 12)

				lsr r3, r2, #4
				cmp r3, r1
				movhs r2, r3
				subhs ip, #(4 * 12)

				lsr r3, r2, #2
				cmp r3, r1
				movhs r2, r3
				subhs ip, ip, #(2 * 12)

				/* Last block, no need to update r2 or r3. */
				cmp r1, r2, lsr #1
				subls ip, ip, #(1 * 12)

				mov r3, #0

				JMP(ip)
				# endif

				#define IMM #

				#define block(shift) \
				cmp r0, r1, lsl IMM shift; \
				addhs r3, r3, IMM (1 << shift); \
				subhs r0, r0, r1, lsl IMM shift

				block(31)
				block(30)
				block(29)
				block(28)
				block(27)
				block(26)
				block(25)
				block(24)
				block(23)
				block(22)
				block(21)
				block(20)
				block(19)
				block(18)
				block(17)
				block(16)
				block(15)
				block(14)
				block(13)
				block(12)
				block(11)
				block(10)
				block(9)
				block(8)
				block(7)
				block(6)
				block(5)
				block(4)
				block(3)
				block(2)
				block(1)
				LOCAL_LABEL(div0block):
				block(0)

				mov r0, r3
				JMP(lr)
				#endif /* __ARM_ARCH_EXT_IDIV__ */

				LOCAL_LABEL(divby0):
	mov r0,#0			mov r0, #0
	bx lr			#ifdef __ARM_EABI__
				b __aeabi_idiv0
	#else			#else
	// We use a simple digit by digit algorithm; before we get into the actual			JMP(lr)
	// divide loop, we must calculate the left-shift amount necessary to align
	// the MSB of the divisor with that of the dividend (If this shift is
	// negative, then the result is zero, and we early out). We also conjure a
	// bit mask of 1 to use in constructing the quotient, and initialize the
	// quotient to zero.
	ESTABLISH_FRAME
	clz r2, a
	tst b, b // detect divide-by-zero
	clz r3, b
	mov q, #0
	beq LOCAL_LABEL(return) // return 0 if b is zero.
	mov one, #1
	subs i, r3, r2
	blt LOCAL_LABEL(return) // return 0 if MSB(a) < MSB(b)

	LOCAL_LABEL(mainLoop):
	// This loop basically implements the following:
	//
	// do {
	// if (a >= b << i) {
	// a -= b << i;
	// q \|= 1 << i;
	// if (a == 0) break;
	// }
	// } while (--i)
	//
	// Note that this does not perform the final iteration (i == 0); by doing it
	// this way, we can merge the two branches which is a substantial win for
	// such a tight loop on current ARM architectures.
	subs r, a, b, lsl i
	itt hs
	orrhs q, q,one, lsl i
	movhs a, r
	it ne
	subsne i, i, #1
	bhi LOCAL_LABEL(mainLoop)

	// Do the final test subtraction and update of quotient (i == 0), as it is
	// not performed in the main loop.
	subs r, a, b
	it hs
	orrhs q, #1

	LOCAL_LABEL(return):
	// Move the quotient to r0 and return.
	mov r0, q
	CLEAR_FRAME_AND_RETURN
	#endif			#endif

				END_COMPILERRT_FUNCTION(__udivsi3)

compiler-rt/trunk/lib/arm/umodsi3.S

	/*===-- umodsi3.S - 32-bit unsigned integer modulus -----------------------===//			/*===-- udivmodsi4.S - 32-bit unsigned integer modulus --------------------===//
	*			*
	* The LLVM Compiler Infrastructure			* The LLVM Compiler Infrastructure
	*			*
	* This file is dual licensed under the MIT and the University of Illinois Open			* This file is dual licensed under the MIT and the University of Illinois Open
	* Source Licenses. See LICENSE.TXT for details.			* Source Licenses. See LICENSE.TXT for details.
	*			*
	*===----------------------------------------------------------------------===//			*===----------------------------------------------------------------------===//
	*			*
	* This file implements the __umodsi3 (32-bit unsigned integer modulus)			* This file implements the __udivmodsi4 (32-bit unsigned integer divide and
	* function for the ARM architecture. A naive digit-by-digit computation is			* modulus) function for the ARM 32-bit architecture.
	* employed for simplicity.
	*			*
	===----------------------------------------------------------------------===/			===----------------------------------------------------------------------===/

	#include "../assembly.h"			#include "../assembly.h"

	#define a r0
	#define b r1
	#define r r2
	#define i r3

	.syntax unified			.syntax unified
	.align 3
				#ifdef ARM_HAS_BX
				#define JMP(r) bx r
				#define JMPc(r,c) bx##c r
				#else
				#define JMP(r) mov pc, r
				#define JMPc(r,c) mov##c pc, r
				#endif

				.text
				.arm
				.p2align 2
	DEFINE_COMPILERRT_FUNCTION(__umodsi3)			DEFINE_COMPILERRT_FUNCTION(__umodsi3)
	#if __ARM_ARCH_EXT_IDIV__			#if __ARM_ARCH_EXT_IDIV__
	tst r1, r1			tst r1, r1
	beq LOCAL_LABEL(divzero)			beq LOCAL_LABEL(divby0)
	udiv r2, r0, r1			mov r3, r0
	mls r0, r2, r1, r0			udiv r0, r3, r1
				mls r1, r0, r1, r3
				str r1, [r2]
	bx lr			bx lr
	LOCAL_LABEL(divzero):			#else
				cmp r1, #1
				bcc LOCAL_LABEL(divby0)
				moveq r0, #0
				JMPc(lr, eq)
				cmp r0, r1
				JMPc(lr, cc)
				/*
				* Implement division using binary long division algorithm.
				*
				* r0 is the numerator, r1 the denominator.
				*
				* The code before JMP computes the correct shift I, so that
				* r0 and (r1 << I) have the highest bit set in the same position.
				* At the time of JMP, ip := .Ldiv0block - 8 * I.
				* This depends on the fixed instruction size of block.
				*
				* block(shift) implements the test-and-update-quotient core.
				* It assumes (r0 << shift) can be computed without overflow and
				* that (r0 << shift) < 2 * r1. The quotient is stored in r3.
				*/

				# ifdef __ARM_FEATURE_CLZ
				clz ip, r0
				clz r3, r1
				/* r0 >= r1 implies clz(r0) <= clz(r1), so ip <= r3. */
				sub r3, r3, ip
				adr ip, LOCAL_LABEL(div0block)
				sub ip, ip, r3, lsl #3
				bx ip
				# else
				mov r2, r0
				adr ip, LOCAL_LABEL(div0block)

				lsr r3, r2, #16
				cmp r3, r1
				movhs r2, r3
				subhs ip, ip, #(16 * 8)

				lsr r3, r2, #8
				cmp r3, r1
				movhs r2, r3
				subhs ip, ip, #(8 * 8)

				lsr r3, r2, #4
				cmp r3, r1
				movhs r2, r3
				subhs ip, #(4 * 8)

				lsr r3, r2, #2
				cmp r3, r1
				movhs r2, r3
				subhs ip, ip, #(2 * 8)

				/* Last block, no need to update r2 or r3. */
				cmp r1, r2, lsr #1
				subls ip, ip, #(1 * 8)

				JMP(ip)
				# endif

				#define IMM #

				#define block(shift) \
				cmp r0, r1, lsl IMM shift; \
				subhs r0, r0, r1, lsl IMM shift

				block(31)
				block(30)
				block(29)
				block(28)
				block(27)
				block(26)
				block(25)
				block(24)
				block(23)
				block(22)
				block(21)
				block(20)
				block(19)
				block(18)
				block(17)
				block(16)
				block(15)
				block(14)
				block(13)
				block(12)
				block(11)
				block(10)
				block(9)
				block(8)
				block(7)
				block(6)
				block(5)
				block(4)
				block(3)
				block(2)
				block(1)
				LOCAL_LABEL(div0block):
				block(0)
				JMP(lr)
				#endif /* __ARM_ARCH_EXT_IDIV__ */

				LOCAL_LABEL(divby0):
	mov r0, #0			mov r0, #0
	bx lr			#ifdef __ARM_EABI__
				b __aeabi_idiv0
	#else			#else
	// We use a simple digit by digit algorithm; before we get into the actual			JMP(lr)
	// divide loop, we must calculate the left-shift amount necessary to align
	// the MSB of the divisor with that of the dividend.
	clz r2, a
	tst b, b // detect b == 0
	clz r3, b
	bxeq lr // return a if b == 0
	subs i, r3, r2
	bxlt lr // return a if MSB(a) < MSB(b)

	LOCAL_LABEL(mainLoop):
	// This loop basically implements the following:
	//
	// do {
	// if (a >= b << i) {
	// a -= b << i;
	// if (a == 0) break;
	// }
	// } while (--i)
	//
	// Note that this does not perform the final iteration (i == 0); by doing it
	// this way, we can merge the two branches which is a substantial win for
	// such a tight loop on current ARM architectures.
	subs r, a, b, lsl i
	it hs
	movhs a, r
	it ne
	subsne i, i, #1
	bhi LOCAL_LABEL(mainLoop)

	// Do the final test subtraction and update of remainder (i == 0), as it is
	// not performed in the main loop.
	subs r, a, b
	it hs
	movhs a, r
	bx lr
	#endif			#endif

				END_COMPILERRT_FUNCTION(__umodsi3)

This is an archive of the discontinued LLVM Phabricator instance.

Improved udivmodsi4 with support for ARMv4ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 6638

compiler-rt/trunk/lib/arm/udivmodsi4.S

compiler-rt/trunk/lib/arm/udivsi3.S

compiler-rt/trunk/lib/arm/umodsi3.S

Improved udivmodsi4 with support for ARMv4
ClosedPublic