1 /* 32 and 64-bit millicode, original author Hewlett-Packard
2 adapted for gcc by Paul Bame <bame@debian.org>
3 and Alan Modra <alan@linuxcare.com.au>.
5 Copyright 2001, 2002, 2003 Free Software Foundation, Inc.
7 This file is part of GCC and is released under the terms of
8 of the GNU General Public License as published by the Free Software
9 Foundation; either version 2, or (at your option) any later version.
10 See the file COPYING in the top-level GCC source directory for a copy
33 . Divide by selected constants for single precision binary integers.
38 . sr0 == return space when called externally
45 OTHER REGISTERS AFFECTED:
49 . Causes a trap under the following conditions: NONE
50 . Changes memory at the following places: NONE
54 . Does not create a stack frame.
55 . Suitable for internal or external millicode.
56 . Assumes the special millicode register conventions.
59 . Calls other millicode routines using mrp: NONE
60 . Calls other millicode routines: NONE */
63 /* TRUNCATED DIVISION BY SMALL INTEGERS
65 We are interested in q(x) = floor(x/y), where x >= 0 and y > 0
68 Let a = floor(z/y), for some choice of z. Note that z will be
69 chosen so that division by z is cheap.
71 Let r be the remainder(z/y). In other words, r = z - ay.
73 Now, our method is to choose a value for b such that
75 q'(x) = floor((ax+b)/z)
77 is equal to q(x) over as large a range of x as possible. If the
78 two are equal over a sufficiently large range, and if it is easy to
79 form the product (ax), and it is easy to divide by z, then we can
80 perform the division much faster than the general division algorithm.
82 So, we want the following to be true:
84 . For x in the following range:
90 . k <= (ax+b)/z < (k+1)
92 We want to determine b such that this is true for all k in the
93 range {0..K} for some maximum K.
95 Since (ax+b) is an increasing function of x, we can take each
96 bound separately to determine the "best" value for b.
98 (ax+b)/z < (k+1) implies
100 (a((k+1)y-1)+b < (k+1)z implies
102 b < a + (k+1)(z-ay) implies
106 This needs to be true for all k in the range {0..K}. In
107 particular, it is true for k = 0 and this leads to a maximum
108 acceptable value for b.
110 b < a+r or b <= a+r-1
112 Taking the other bound, we have
114 k <= (ax+b)/z implies
116 k <= (aky+b)/z implies
122 Clearly, the largest range for k will be achieved by maximizing b,
123 when r is not zero. When r is zero, then the simplest choice for b
124 is 0. When r is not 0, set
128 Now, by construction, q'(x) = floor((ax+b)/z) = q(x) = floor(x/y)
129 for all x in the range:
133 We need to determine what K is. Of our two bounds,
135 . b < a+(k+1)r is satisfied for all k >= 0, by construction.
141 This is always true if r = 0. If r is not 0 (the usual case), then
142 K = floor((a+r-1)/r), is the maximum value for k.
144 Therefore, the formula q'(x) = floor((ax+b)/z) yields the correct
145 answer for q(x) = floor(x/y) when x is in the range
147 (0,(K+1)y-1) K = floor((a+r-1)/r)
149 To be most useful, we want (K+1)y-1 = (max x) >= 2**32-1 so that
150 the formula for q'(x) yields the correct value of q(x) for all x
151 representable by a single word in HPPA.
153 We are also constrained in that computing the product (ax), adding
154 b, and dividing by z must all be done quickly, otherwise we will be
155 better off going through the general algorithm using the DS
156 instruction, which uses approximately 70 cycles.
158 For each y, there is a choice of z which satisfies the constraints
159 for (K+1)y >= 2**32. We may not, however, be able to satisfy the
160 timing constraints for arbitrary y. It seems that z being equal to
161 a power of 2 or a power of 2 minus 1 is as good as we can do, since
162 it minimizes the time to do division by z. We want the choice of z
163 to also result in a value for (a) that minimizes the computation of
164 the product (ax). This is best achieved if (a) has a regular bit
165 pattern (so the multiplication can be done with shifts and adds).
166 The value of (a) also needs to be less than 2**32 so the product is
167 always guaranteed to fit in 2 words.
169 In actual practice, the following should be done:
171 1) For negative x, you should take the absolute value and remember
172 . the fact so that the result can be negated. This obviously does
173 . not apply in the unsigned case.
174 2) For even y, you should factor out the power of 2 that divides y
175 . and divide x by it. You can then proceed by dividing by the
178 Here is a table of some odd values of y, and corresponding choices
179 for z which are "good".
181 y z r a (hex) max x (hex)
183 3 2**32 1 55555555 100000001
184 5 2**32 1 33333333 100000003
185 7 2**24-1 0 249249 (infinite)
186 9 2**24-1 0 1c71c7 (infinite)
187 11 2**20-1 0 1745d (infinite)
188 13 2**24-1 0 13b13b (infinite)
189 15 2**32 1 11111111 10000000d
190 17 2**32 1 f0f0f0f 10000000f
192 If r is 1, then b = a+r-1 = a. This simplifies the computation
193 of (ax+b), since you can compute (x+1)(a) instead. If r is 0,
194 then b = 0 is ok to use which simplifies (ax+b).
196 The bit patterns for 55555555, 33333333, and 11111111 are obviously
197 very regular. The bit patterns for the other values of a above are:
201 7 249249 001001001001001001001001 << regular >>
202 9 1c71c7 000111000111000111000111 << regular >>
203 11 1745d 000000010111010001011101 << irregular >>
204 13 13b13b 000100111011000100111011 << irregular >>
206 The bit patterns for (a) corresponding to (y) of 11 and 13 may be
207 too irregular to warrant using this method.
209 When z is a power of 2 minus 1, then the division by z is slightly
210 more complicated, involving an iterative solution.
212 The code presented here solves division by 1 through 17, except for
213 11 and 13. There are algorithms for both signed and unsigned
218 divisor positive negative unsigned
233 Now, the algorithm for 7, 9, and 14 is an iterative one. That is,
234 a loop body is executed until the tentative quotient is 0. The
235 number of times the loop body is executed varies depending on the
236 dividend, but is never more than two times. If the dividend is
237 less than the divisor, then the loop body is not executed at all.
238 Each iteration adds 4 cycles to the timings.
240 divisor positive negative unsigned
242 . 7 19+4n 20+4n 20+4n n = number of iterations
243 . 9 21+4n 22+4n 21+4n
244 . 14 21+4n 22+4n 20+4n
246 To give an idea of how the number of iterations varies, here is a
247 table of dividend versus number of iterations when dividing by 7.
249 smallest largest required
250 dividend dividend iterations
254 0x1000006 0xffffffff 2
256 There is some overlap in the range of numbers requiring 1 and 2
260 RDEFINE(x2,arg0) /* r26 */
261 RDEFINE(t1,arg1) /* r25 */
262 RDEFINE(x1,ret1) /* r29 */
270 /* NONE of these routines require a stack frame
271 ALL of these routines are unwindable from millicode */
273 GSYM($$divide_by_constant)
274 .export $$divide_by_constant,millicode
275 /* Provides a "nice" label for the code covered by the unwind descriptor
276 for things like gprof. */
278 /* DIVISION BY 2 (shift by 1) */
280 .export $$divI_2,millicode
284 extrs arg0,30,31,ret1
287 /* DIVISION BY 4 (shift by 2) */
289 .export $$divI_4,millicode
293 extrs arg0,29,30,ret1
296 /* DIVISION BY 8 (shift by 3) */
298 .export $$divI_8,millicode
302 extrs arg0,28,29,ret1
304 /* DIVISION BY 16 (shift by 4) */
306 .export $$divI_16,millicode
310 extrs arg0,27,28,ret1
312 /****************************************************************************
314 * DIVISION BY DIVISORS OF FFFFFFFF, and powers of 2 times these
316 * includes 3,5,15,17 and also 6,10,12
318 ****************************************************************************/
320 /* DIVISION BY 3 (use z = 2**32; a = 55555555) */
323 .export $$divI_3,millicode
324 comb,<,N x2,0,LREF(neg3)
326 addi 1,x2,x2 /* this cannot overflow */
327 extru x2,1,2,x1 /* multiply by 5 to get started */
333 subi 1,x2,x2 /* this cannot overflow */
334 extru x2,1,2,x1 /* multiply by 5 to get started */
340 .export $$divU_3,millicode
341 addi 1,x2,x2 /* this CAN overflow */
343 shd x1,x2,30,t1 /* multiply by 5 to get started */
348 /* DIVISION BY 5 (use z = 2**32; a = 33333333) */
351 .export $$divI_5,millicode
352 comb,<,N x2,0,LREF(neg5)
354 addi 3,x2,t1 /* this cannot overflow */
355 sh1add x2,t1,x2 /* multiply by 3 to get started */
360 sub 0,x2,x2 /* negate x2 */
361 addi 1,x2,x2 /* this cannot overflow */
362 shd 0,x2,31,x1 /* get top bit (can be 1) */
363 sh1add x2,x2,x2 /* multiply by 3 to get started */
368 .export $$divU_5,millicode
369 addi 1,x2,x2 /* this CAN overflow */
371 shd x1,x2,31,t1 /* multiply by 3 to get started */
376 /* DIVISION BY 6 (shift to divide by 2 then divide by 3) */
378 .export $$divI_6,millicode
379 comb,<,N x2,0,LREF(neg6)
380 extru x2,30,31,x2 /* divide by 2 */
381 addi 5,x2,t1 /* compute 5*(x2+1) = 5*x2+5 */
382 sh2add x2,t1,x2 /* multiply by 5 to get started */
387 subi 2,x2,x2 /* negate, divide by 2, and add 1 */
388 /* negation and adding 1 are done */
389 /* at the same time by the SUBI */
392 sh2add x2,x2,x2 /* multiply by 5 to get started */
397 .export $$divU_6,millicode
398 extru x2,30,31,x2 /* divide by 2 */
399 addi 1,x2,x2 /* cannot carry */
400 shd 0,x2,30,x1 /* multiply by 5 to get started */
405 /* DIVISION BY 10 (shift to divide by 2 then divide by 5) */
407 .export $$divU_10,millicode
408 extru x2,30,31,x2 /* divide by 2 */
409 addi 3,x2,t1 /* compute 3*(x2+1) = (3*x2)+3 */
410 sh1add x2,t1,x2 /* multiply by 3 to get started */
413 shd x1,x2,28,t1 /* multiply by 0x11 */
418 shd x1,x2,24,t1 /* multiply by 0x101 */
423 shd x1,x2,16,t1 /* multiply by 0x10001 */
430 .export $$divI_10,millicode
431 comb,< x2,0,LREF(neg10)
433 extru x2,30,31,x2 /* divide by 2 */
434 addib,TR 1,x2,LREF(pos) /* add 1 (cannot overflow) */
435 sh1add x2,x2,x2 /* multiply by 3 to get started */
438 subi 2,x2,x2 /* negate, divide by 2, and add 1 */
439 /* negation and adding 1 are done */
440 /* at the same time by the SUBI */
442 sh1add x2,x2,x2 /* multiply by 3 to get started */
444 shd x1,x2,28,t1 /* multiply by 0x11 */
449 shd x1,x2,24,t1 /* multiply by 0x101 */
454 shd x1,x2,16,t1 /* multiply by 0x10001 */
461 /* DIVISION BY 12 (shift to divide by 4 then divide by 3) */
463 .export $$divI_12,millicode
464 comb,< x2,0,LREF(neg12)
466 extru x2,29,30,x2 /* divide by 4 */
467 addib,tr 1,x2,LREF(pos) /* compute 5*(x2+1) = 5*x2+5 */
468 sh2add x2,x2,x2 /* multiply by 5 to get started */
471 subi 4,x2,x2 /* negate, divide by 4, and add 1 */
472 /* negation and adding 1 are done */
473 /* at the same time by the SUBI */
476 sh2add x2,x2,x2 /* multiply by 5 to get started */
479 .export $$divU_12,millicode
480 extru x2,29,30,x2 /* divide by 4 */
481 addi 5,x2,t1 /* cannot carry */
482 sh2add x2,t1,x2 /* multiply by 5 to get started */
486 /* DIVISION BY 15 (use z = 2**32; a = 11111111) */
488 .export $$divI_15,millicode
489 comb,< x2,0,LREF(neg15)
491 addib,tr 1,x2,LREF(pos)+4
499 .export $$divU_15,millicode
500 addi 1,x2,x2 /* this CAN overflow */
504 /* DIVISION BY 17 (use z = 2**32; a = f0f0f0f) */
506 .export $$divI_17,millicode
507 comb,<,n x2,0,LREF(neg17)
508 addi 1,x2,x2 /* this cannot overflow */
509 shd 0,x2,28,t1 /* multiply by 0xf to get started */
516 subi 1,x2,x2 /* this cannot overflow */
517 shd 0,x2,28,t1 /* multiply by 0xf to get started */
524 .export $$divU_17,millicode
525 addi 1,x2,x2 /* this CAN overflow */
527 shd x1,x2,28,t1 /* multiply by 0xf to get started */
535 /* DIVISION BY DIVISORS OF FFFFFF, and powers of 2 times these
536 includes 7,9 and also 14
544 Also, in order to divide by z = 2**24-1, we approximate by dividing
545 by (z+1) = 2**24 (which is easy), and then correcting.
550 So to compute (ax)/z, compute q' = (ax)/(z+1) and r = (ax) mod (z+1)
551 Then the true remainder of (ax)/z is (q'+r). Repeat the process
552 with this new remainder, adding the tentative quotients together,
553 until a tentative quotient is 0 (and then we are done). There is
554 one last correction to be done. It is possible that (q'+r) = z.
555 If so, then (q'+r)/(z+1) = 0 and it looks like we are done. But,
556 in fact, we need to add 1 more to the quotient. Now, it turns
557 out that this happens if and only if the original value x is
558 an exact multiple of y. So, to avoid a three instruction test at
559 the end, instead use 1 instruction to add 1 to x at the beginning. */
561 /* DIVISION BY 7 (use z = 2**24-1; a = 249249) */
563 .export $$divI_7,millicode
564 comb,<,n x2,0,LREF(neg7)
566 addi 1,x2,x2 /* cannot overflow */
581 /* computed <t1,x2>. Now divide it by (2**24 - 1) */
584 shd,= t1,x2,24,t1 /* tentative quotient */
586 addb,tr t1,x1,LREF(2) /* add to previous quotient */
587 extru x2,31,24,x2 /* new remainder (unadjusted) */
592 addb,tr t1,x2,LREF(1) /* adjust remainder */
593 extru,= x2,7,8,t1 /* new quotient */
596 subi 1,x2,x2 /* negate x2 and add 1 */
613 /* computed <t1,x2>. Now divide it by (2**24 - 1) */
616 shd,= t1,x2,24,t1 /* tentative quotient */
618 addb,tr t1,x1,LREF(4) /* add to previous quotient */
619 extru x2,31,24,x2 /* new remainder (unadjusted) */
622 sub 0,x1,x1 /* negate result */
625 addb,tr t1,x2,LREF(3) /* adjust remainder */
626 extru,= x2,7,8,t1 /* new quotient */
629 .export $$divU_7,millicode
630 addi 1,x2,x2 /* can carry */
637 /* DIVISION BY 9 (use z = 2**24-1; a = 1c71c7) */
639 .export $$divI_9,millicode
640 comb,<,n x2,0,LREF(neg9)
641 addi 1,x2,x2 /* cannot overflow */
649 subi 1,x2,x2 /* negate and add 1 */
657 .export $$divU_9,millicode
658 addi 1,x2,x2 /* can carry */
666 /* DIVISION BY 14 (shift to divide by 2 then divide by 7) */
668 .export $$divI_14,millicode
669 comb,<,n x2,0,LREF(neg14)
671 .export $$divU_14,millicode
672 b LREF(7) /* go to 7 case */
673 extru x2,30,31,x2 /* divide by 2 */
676 subi 2,x2,x2 /* negate (and add 2) */
678 extru x2,30,31,x2 /* divide by 2 */