Thursday, May 16, 2019

Tuning the large arithmetic thresholds

After the sqrtRem, let's now tune the other thresholds for multiplication, squaring and division.

The first thing is to have settable threshold parameters, and that's the object of Kernel-nice.1234. The settings can be changed manually through Preferences interface, but then it's tedious to tune by hand.



So the next thing is to have a strategy for auto-tuning. I have chosen this one:
  1. set the threshold for Karatsuba and Toom-3 multiplication Mul22 and Mul33 at very high values so as to have unaccelerated arithmetic
  2. gradually increase the Mul22 threshold starting at 16 bytes, and doubling until it beats reference un-accelerated performance
  3. once found, try smaller threshold by setting the next two bits (if threshold is 512, test for 320, 384 and 448)
  4. repeat the procedure for Mul33, starting at Mul22 threshold
The performance is measured on 2 series of well-balanced integers, of byte length 1xthreshold, 1.125xthreshold. The score is the sum of the 2 median_time_to_run/nbits/log(nbits) for each serie (the lower the score, the better).

I did the same for squaring with S1/2 and S1/4, but multiplication tuning must be performed first, because asymetrical squaring uses that.

For division, it's a bit different, since the recursive division did not procure an immediate gain at 1x threshold, I decided to engage digitDivSplit: at twice the threshold so as to benefit from at least 2 recursions, so I must start testing at twice the threshold and finally used series of byte length 2x, 3x and 4x threshold.

I have published this auto-tuning code in STEM repository in the LargeArithmeticBench package in its version -nice.11 at this time of writing.
To perform the tuning, doIt:

    LargeArithmeticBench new autotuneAcceleratedArithmeticThresholds.

With this strategy, the tuning is relatively fast, about 30 seconds on my MBP. But we are not that patient nowadays, so I reported the progress in the Transcript - which is a very basic interface, but works.
The autotuned thresholds now are these ones on 64 bits Spur, but they somtimes vary a bit from 1 run to another:

best Mul22=448
best Mul33=1536
best S1/2=320
best S1/4=640
best D21=448

I also refactored benchReport to have constant ratios M(n,2n) and D(n,2n), and reduced the number of runs for large bit lengths. Performance is not much changed versus hardcoded thresholds, except for division in range 2kbits to 10kbits.

Beware, the D(10n,n) now tries the very large integers, and the primitive for digitDiv:neg: crashes the un-accelerated image. This does not happen with accelerated arithmetic, because we split in small chunks (if 224 bytes is small). That's a long time since I did not put my fingers on VM code, that's an occasion...

I believe that it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk, and that I marginally improved the performance by rounding up to next multiple of 4 digits, rather than down to with bitClear: 2r11, but I launched so many tests that this would need confirmation. I put those changes in Squeak inbox/Kernel-nice.1235. If someone wants to try, it's better to re-tune the thresholds for each code version.

    LargeArithmeticBench new autotuneAcceleratedArithmeticThresholds; benchReport.

Here are the updated curves:




Sunday, May 12, 2019

Tuning the LargeArithmetic sqrtRem

After further analysis, it turns out that sqrtRem algorithm is always faster than naive Newton-Raphson iterations for LargePositiveInteger in Squeak with 64 bits Opensmalltalk Spur VM. 

So rather than tuning the threshold, I just removed the threshold, which is something we all like: less code to write, less code to test, simple, small and beautiful code 😀.

So here is the updated benchmark:

n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
SR(n)
4 800
9 200
13 490
22 120
49 130
233 170
895 920
3 691 610
25 134 780
107 222 200
456 219 390
3 130 480 810
SRa(n)
1 042
3 644
8 742
11 994
16 288
27 678
67 182
181 219
645 797
1 843 856
5 275 917
19 623 314
gain
78 %
60 %
35 %
46 %
67 %
88 %
93 %
95 %
97 %
98 %
99 %
99 %

That's pretty good, but still has the same pot-bellied look in low bits, probably related to high division costs.

While at it, I totally revamped various implementation of sqrt in Squeak. I don't give further details here, the commit message is already pretty long!

Friday, May 10, 2019

Benchmarking the arithmetic boost

General benchmarking facilities

After profiling, let's inquire the benchmarking facilities, and apply them for measuring how much accelerated is our arithmetic.

In Squeak Smalltalk, the simplest way to measure is to send the message bench to a block of code. The principle of bench is Timer-based: the block is repeatedly evaluated until a delay expires, and the rate of execution count/delay (n times per second) is extracted, along with the  runtime of single execution delay/count. The default delay is 5 seconds, but can be changed by sending benchFor: instead of bench.

    [100 factorial] benchFor: 2 seconds.
    '57,900 per second. 17.3 microseconds per run.'

I have also played with the high resolution clock as used by AndreasSystemProfiler to see if it were feasible to bench single execution instead of repeated ones, but no surprise, that's less accurate than bench: it's overestimated and subject to high jitter/erraticity). One must run several times to account for JIT, OS concurrency and other effects.

    [100 factorial] durationToRun. 
    0:00:00:00.000040395
    0:00:00:00.000024274


Specific arrangement for Large Arithmetic Bench

Finally, I opted for my own bench based on median timing of 21 trials, median is more robust than average to outliers. Each timing is established by running about 100 operations - averaged to get time to run per operation. The bench can be found as LargeArithmeticBench package at http://www.squeaksource.com/STEM.html.

I have also used the high resolution clock (via my recent Preferences addition).

I have compared updated 64bits trunk image with accelerated arithmetic vs released Squeak 5.2 wich only has naive schoolbook algorithms in primitives.  To have comparable results, I have used same Spur64 VM, and I've imported the latest timing routines from Squeak trunk (Chronology-Core-nice.44) into the 5.2 image. If you want to repeat this operation on your own, simply download LargeArithmeticBench, open a Transcript, and evaluate

    LargeArithmeticBench new benchReport.

My configuration is not really new (2.7GHz core i5 MBP from 2015 - 8Go RAM)


Raw results

 Here after, I will use these notations
  • M(n,p) is the cost of multiplying a n bits Integer * by a p bit Integer
  • D(n,p) is the cost of dividing a n bits Integer // by a p bit Integer
  • S(n) is the cost of squaring a n bits Integer (squared)
  • SR(n) is the cost of taking truncated square root of a n bits Integer (sqrtFloor)
  • Ma(n,p) Da(n,p) Sa(n) SRa(n) are for the accelerated version
First, here are the non-accelerated 5.2 results, in nanoseconds per operation:

n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
M(n,n)
228
238
495
1 219
4 409
25 980
103 066
409 161
2 553 019
10 207 342
40 838 771
255 788 190
M(n,2n)
220
310
720
2 220
10 680
51 720
205 220
1 021 830
5 106 960
20 721 210
101 966 090
N/A
M(n,10n)
330
640
2 870
10 710
41 370
256 970
1 024 470
4 090 630
25 698 210
N/A
N/A
N/A
D(2n,n)
700
970
1 660
3 710
16 600
59 670
223 720
1 309 900
5 460 440
21 446 150
127 811 680
N/A
D(10n,n)
1 520
2 860
9 230
27 540
94 620
523 460
1 996 920
8 079 780
48 989 550
N/A
N/A
N/A
S(n)
230
240
490
1 270
4 440
26 340
103 970
409 330
2 569 370
10 262 050
41 023 400
256 136 390
SR(n)
4 800
9 200
13 490
22 120
49 130
233 170
895 920
3 691 610
25 134 780
107 222 200
456 219 390
3 130 480 810

And here are  the accelerated trunk results as of Kernel-nice.1226:


n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
Ma(n,n)
244
252
516
1 235
4 411
24 362
79 166
236 754
980 818
2 801 742
7 989 935
31 317 255
Ma(n,2n)
233
328
754
2 214
10 553
57 340
213 406
616 834
2 238 966
6 346 407
19 824 556
-1
Ma(n,10n)
346
703
2 890
10 671
41 246
276 811
789 047
2 480 517
10 175 333
-1
-1
-1
Da(2n,n)
717
988
1 686
3 764
16 533
84 926
238 771
981 013
2 605 632
7 522 468
31 991 633
-1
Da(10n,n)
1 517
2 859
9 188
27 267
92 978
721 250
1 960 335
6 075 578
22 161 607
-1
-1
-1
Sa(n)
253
265
509
1 238
4 424
22 299
63 273
186 107
794 071
2 288 904
6 432 702
26 047 987
SRa(n)
4 931
9 310
13 644
17 401
21 559
33 416
72 882
188 061
656 582
1 865 753
5 319 813
19 751 009
  
And a comparison in relative time-to-run percentage (naive - accelerated)/naive - (red) for loss of performance. Here 90% means that we eliminated 90% of run time, IOW we gained a factor x10, 80% 5x, 75% 4x, 67% 3x, 50% 2x, 33% 1.5x, 25% 1.33x:

n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
M(n,n)
(7 %)
(6 %)
(4 %)
(1 %)
(0 %)
6 %
23 %
42 %
62 %
73 %
80 %
88 %
M(n,2n)
(6 %)
(6 %)
(5 %)
0 %
1 %
(11 %)
(4 %)
40 %
56 %
69 %
81 %
N/A
M(n,10n)
(5 %)
(10 %)
(1 %)
0 %
0 %
(8 %)
23 %
39 %
60 %
N/A
N/A
N/A
D(2n,n)
(2 %)
(2 %)
(2 %)
(1 %)
0 %
(42 %)
(7 %)
25 %
52 %
65 %
75 %
N/A
D(10n,n)
0 %
0 %
0 %
1 %
2 %
(38 %)
2 %
25 %
55 %
N/A
N/A
N/A
S(n)
(10 %)
(10 %)
(4 %)
3 %
0 %
15 %
39 %
55 %
69 %
78 %
84 %
90 %
SR(n)
(3 %)
(1 %)
(1 %)
21 %
56 %
86 %
92 %
95 %
97 %
98 %
99 %
99 %

Discussion

Multiplication M(n,p)

In the first table, we verify that cost of naive schoolbook multiplication algorithm used by primDigitMultiply:neg: is quadratic M(n,n)=O(n^2), no surprise: multiplying two 500kbits is 100x the cost of multiplying two 50kbits which is 100x the cost of multiplying two 5kbits.

This is not true for small Large Integer, because we pay a tribute for invoking <primitive 29> which is restricted to 64bits results (not a very big win on 64 bits platform where positive SmallInteger has up to 60 bits), then another penalty for invoking the schoolbook primitive, so for small bit lengths, we only roughly measure the overhead cost of this machinery (plus that of the benchmark machinery itself, iterating over the collection of tests has a marginal cost too).


For accelerated version, we call schoolbook primitive under 600 byes (4800 bits), and pay a penalty of at most 7% for the additional multiplyByInteger: indirection. This penalty (constant time) tends to vanish versus O(n^2) cost of primitive as n increases.

Then Karatsuba start paying immediately at 5kbits for well balanced lengths, and 3-way Toom-Cook above 2000 bytes (16kbits). Asymptotically costs are about O(n^1.58) and O(n^1.46) respectively. But as we saw on previous post about profiling, overhead of +/-/shift even if in O(N) accounts for a non negligible part, and n is not large enough to reach this limit. Plotted in log-log, we see a slope of 2 for naive M(n,n) and 1.52 for the Toom-3 part of Ma(n,n).
 
The cost of not well balanced multiplications is also interesting. For the naive algorithm, we expect M(n,m)=O(n*m) and that’s what we get M(n,2n)=2 M(n,n) and M(n,10n)=10 M(n,n)
This part has not been specially optimized in the accelerated flavor, and the cost is Ma(n,2n) > 2.2 Ma(n,n). As for Ma(n,10n), it’s between 10 Ma(n,n) and 11 Ma(n,n). For moderate length in Karatsuba domain (5-15 kBits), the performance is a few % worse than naive algorithm.



n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
M(n,2n)/M(n,n)
1.0
1.3
1.5
1.8
2.4
2.0
2.0
2.5
2.0
2.0
2.5
N/A
Ma(n,2n)/Ma(n,n)
1.0
1.3
1.5
1.8
2.4
2.4
2.7
2.6
2.3
2.3
2.5
N/A
M(n,10n)/M(n,n)
1.4
2.7
5.8
8.8
9.4
9.9
9.9
10.0
10.1
N/A
N/A
N/A
Ma(n,10n)/Ma(n,n)
1.4
2.8
5.6
8.6
9.4
11.4
10.0
10.5
10.4
N/A
N/A
N/A

The strategy of splitting into 1-1.5 parts (or 2-3) is adverse for Ma(n,2n) in the Karatsuba domain, and not of great interest for Ma(n,10n) - at best it's neutral (negligible overhead - which is already something). It's probably better for Ma(2n,k*3n) and int k, but we did not bench that case. This should be revisited, or we should adopt a different dispatching strategy based on length of both operands, instead of shorter operand length for fallback to naive algorithm.


Division

The division is interesting. For naive division, we see that D(2n,n) tends toward M(n,2n) for large n, and D(10n,n) toward 2*M(n,10n). The cost is effectively driven by cost of multiplication. For small length the price of setup in the primitive (shifting operands as in Knuth) makes the ratio closer to 3 M(n,2n).

n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
D(2n,n)/M(n,2n)
3.2
3.1
2.3
1.7
1.6
1.2
1.1
1.3
1.1
1.0
1.3
N/A
Da(2n,n)/Ma(n,2n)
3.1
3.0
2.2
1.7
1.6
1.5
1.1
1.6
1.2
1.2
1.6
N/A
D(10n,n)/M(n,10n)
4.6
4.5
3.2
2.6
2.3
2.0
1.9
2.0
1.9
N/A
N/A
N/A
Da(10n,n)/Ma(n,10n)
4.4
4.1
3.2
2.6
2.3
2.6
2.5
2.4
2.2
N/A
N/A
N/A

For accelerated Burnikel-Ziegler, we obtain the same ratios 1.2 Ma(n,2n) and 2.2 Ma(n,10n), and this confirms that the division cost is dominated by multiplication cost, but I would have expected a larger constant factor. Note that we also have some Max(n,2.5n) due to non regular 50 000/20 000 ratio (base 10 oblige) - this is causing the hickups in log-log plot (I could have changed the x axis to sqrt(n1*n2), but that's details).



Threshold has been hardcoded at 256 bytes (above 2kbits) and it seems not optimized in the Karatsuba domain up to 15kbits. Above that, the speed up is about 3x at 100kbits, 4x at 200kbits.

Squaring

What about Squaring? In the naive case S(n)=M(n,n) because we simply fall back to naive multiplication, squared is just ^self * self.

n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
S(n)/M(n,n)
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
Sa(n)/Ma(n,n)
1.0
1.1
1.0
1.0
1.0
0.9
0.8
0.8
0.8
0.8
0.8
0.8

For accelerated arithmetic, squared is more interesting than * with a 20% speed-up at 10kbits and above. That’s why I slightly modified exponentiation algorithm (raisedToInteger:) in trunk  to use squared instead of * (see https://source.squeak.org/trunk/Kernel-nice.1220.diff).



The ratios are harcoded at 400 byes (3.2kbits) for asymetrical Karatsuba and 800 bytes (6.4 kbits) for asymetrical 4-way Toom-Cook. There is a single point of measurement for Karatsuba (15% performance improvment at 5kbits) and the speed up climbs at 10x at 500kbits which is really worth the few additional lines of code.

Square Root

Last in our list, the square root: the naive algorithm is catastrophic. It uses a Newton-Raphson iteration which has good properties (self-correcting and quadratic convergence - that is twice many correct bits at each iteration). But the initial guess is so desperately bad (a single bit!), that we end up with a cost SR(n)=O(n^2 log2 n).

n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
SR(n)/S(n)
20.9
38.3
27.5
17.4
11.1
8.9
8.6
9.0
9.8
10.4
11.1
12.2
SRa(n)/Sa(n)
19.5
35.1
26.8
14.1
4.9
1.5
1.2
1.0
0.8
0.8
0.8
0.8

That’s not the case of accelerated algorithm. The Karatsuba-like version of Paul Zimmerman shines and does even perform better than squared itself ( -20% at 50kbits in the Toom-Cook domain!). Was that expected? Or does it tell that some thresholds were not correctly tuned? That amazed me!




Like explained in the conclusion of Zimmerman's paper, the algorithm somehow looks like a Newton-Raphson iteration, but the biggest difference I see with our naive original is that the initial guess already has half the bits correct thanks to divide and conquer strategy for a fraction of the cost!

The gain is progressive with a first regime presumably dominated by not fully optimized division up to 10kbits as we saw above. But it reaches 100x at 100kbits because of too naive original, and because we fight a Smalltalk implementation rather than a generated and statically optimized C code primitive, which is much easier.


Conclusions


Even if not everyone manipulates such numbers, it's not every day that we can offer speed up of 2x to 10x. We observe gains of 3x or more for Integers division, above 100 kbits, 5x for multiplication, 10x at 500kbits for Squaring and even 100x for the square root, which is already something great for not so many lines of pure Smalltalk code.

Pure Smalltalk accelerated version has to compete with generated C code for the naive primitive, and it's not amazing that O(n^x) start paying at a bit higher thresholds than for GMP, but we don't want to compete by writing C code (or slang) and spoil the fun!

There is many more to say about these benchmarks, for example the accelerated division has direct impact on large integer printing (in base 10). I already revisited the printing algorithm to use a divide and conquer form some years ago (https://source.squeak.org/trunk/Kernel-nice.200.diff and http://bugs.squeak.org/view.php?id=6887), and now that we have the Burnikel-Ziegler digitDiv21:, it’s a perfect fit, the 3x division speed-up offers almost a 2x printString speed-up at 100kBits!

    "How much cost printing of a 100kbit Integer?"
    | x |
    x := (1 << 1e5 - 1) atRandom.
    [x printString] bench.

    '50.4 per second. 19.8 milliseconds per run.' with naive division
    '84.5 per second. 11.8 milliseconds per run.' with fast recursive division

We saw that some additional tuning was necessary for the heuristic thresholds that dispatch to different variants. So tuning should be the next task.