General benchmarking facilities
After profiling, let's inquire the benchmarking facilities, and apply them for measuring how much accelerated is our arithmetic.In Squeak Smalltalk, the simplest way to measure is to send the message bench to a block of code. The principle of bench is Timer-based: the block is repeatedly evaluated until a delay expires, and the rate of execution count/delay (n times per second) is extracted, along with the runtime of single execution delay/count. The default delay is 5 seconds, but can be changed by sending benchFor: instead of bench.
[100 factorial] benchFor: 2 seconds.
'57,900 per second. 17.3 microseconds per run.'
I have also played with the high resolution clock as used by AndreasSystemProfiler to see if it were feasible to bench single execution instead of repeated ones, but no surprise, that's less accurate than bench: it's overestimated and subject to high jitter/erraticity). One must run several times to account for JIT, OS concurrency and other effects.
[100 factorial] durationToRun.
0:00:00:00.000040395
0:00:00:00.000024274
Specific arrangement for Large Arithmetic Bench
Finally, I opted for my own bench based on median timing of 21 trials, median is more robust than average to outliers. Each timing is established by running about 100 operations - averaged to get time to run per operation. The bench can be found as LargeArithmeticBench package at http://www.squeaksource.com/STEM.html.I have also used the high resolution clock (via my recent Preferences addition).
I have compared updated 64bits trunk image with accelerated arithmetic vs released Squeak 5.2 wich only has naive schoolbook algorithms in primitives. To have comparable results, I have used same Spur64 VM, and I've imported the latest timing routines from Squeak trunk (Chronology-Core-nice.44) into the 5.2 image. If you want to repeat this operation on your own, simply download LargeArithmeticBench, open a Transcript, and evaluate
LargeArithmeticBench new benchReport.
My configuration is not really new (2.7GHz core i5 MBP from 2015 - 8Go RAM)
Raw results
Here after, I will use these notations- M(n,p) is the cost of multiplying a n bits Integer * by a p bit Integer
- D(n,p) is the cost of dividing a n bits Integer // by a p bit Integer
- S(n) is the cost of squaring a n bits Integer (squared)
- SR(n) is the cost of taking truncated square root of a n bits Integer (sqrtFloor)
- Ma(n,p) Da(n,p) Sa(n) SRa(n) are for the accelerated version
n (bits)
|
100
|
200
|
500
|
1 000
|
2 000
|
5 000
|
10 000
|
20 000
|
50 000
|
100 000
|
200 000
|
500 000
|
M(n,n)
|
228
|
238
|
495
|
1 219
|
4 409
|
25 980
|
103 066
|
409 161
|
2 553 019
|
10 207 342
|
40 838 771
|
255 788 190
|
M(n,2n)
|
220
|
310
|
720
|
2 220
|
10 680
|
51 720
|
205 220
|
1 021 830
|
5 106 960
|
20 721 210
|
101 966 090
|
N/A
|
M(n,10n)
|
330
|
640
|
2 870
|
10 710
|
41 370
|
256 970
|
1 024 470
|
4 090 630
|
25 698 210
|
N/A
|
N/A
|
N/A
|
D(2n,n)
|
700
|
970
|
1 660
|
3 710
|
16 600
|
59 670
|
223 720
|
1 309 900
|
5 460 440
|
21 446 150
|
127 811 680
|
N/A
|
D(10n,n)
|
1 520
|
2 860
|
9 230
|
27 540
|
94 620
|
523 460
|
1 996 920
|
8 079 780
|
48 989 550
|
N/A
|
N/A
|
N/A
|
S(n)
|
230
|
240
|
490
|
1 270
|
4 440
|
26 340
|
103 970
|
409 330
|
2 569 370
|
10 262 050
|
41 023 400
|
256 136 390
|
SR(n)
|
4 800
|
9 200
|
13 490
|
22 120
|
49 130
|
233 170
|
895 920
|
3 691 610
|
25 134 780
|
107 222 200
|
456 219 390
|
3 130 480 810
|
And here are the accelerated trunk results as of Kernel-nice.1226:
n (bits)
|
100
|
200
|
500
|
1 000
|
2 000
|
5 000
|
10 000
|
20 000
|
50 000
|
100 000
|
200 000
|
500 000
|
Ma(n,n)
|
244
|
252
|
516
|
1 235
|
4 411
|
24 362
|
79 166
|
236 754
|
980 818
|
2 801 742
|
7 989 935
|
31 317 255
|
Ma(n,2n)
|
233
|
328
|
754
|
2 214
|
10 553
|
57 340
|
213 406
|
616 834
|
2 238 966
|
6 346 407
|
19 824 556
|
-1
|
Ma(n,10n)
|
346
|
703
|
2 890
|
10 671
|
41 246
|
276 811
|
789 047
|
2 480 517
|
10 175 333
|
-1
|
-1
|
-1
|
Da(2n,n)
|
717
|
988
|
1 686
|
3 764
|
16 533
|
84 926
|
238 771
|
981 013
|
2 605 632
|
7 522 468
|
31 991 633
|
-1
|
Da(10n,n)
|
1 517
|
2 859
|
9 188
|
27 267
|
92 978
|
721 250
|
1 960 335
|
6 075 578
|
22 161 607
|
-1
|
-1
|
-1
|
Sa(n)
|
253
|
265
|
509
|
1 238
|
4 424
|
22 299
|
63 273
|
186 107
|
794 071
|
2 288 904
|
6 432 702
|
26 047 987
|
SRa(n)
|
4 931
|
9 310
|
13 644
|
17 401
|
21 559
|
33 416
|
72 882
|
188 061
|
656 582
|
1 865 753
|
5 319 813
|
19 751 009
|
And a comparison in relative time-to-run percentage (naive - accelerated)/naive - (red) for loss of performance. Here 90% means that we eliminated 90% of run time, IOW we gained a factor x10, 80% 5x, 75% 4x, 67% 3x, 50% 2x, 33% 1.5x, 25% 1.33x:
n (bits)
|
100
|
200
|
500
|
1 000
|
2 000
|
5 000
|
10 000
|
20 000
|
50 000
|
100 000
|
200 000
|
500 000
|
M(n,n)
|
(7 %)
|
(6 %)
|
(4 %)
|
(1 %)
|
(0 %)
|
6 %
|
23 %
|
42 %
|
62 %
|
73 %
|
80 %
|
88 %
|
M(n,2n)
|
(6 %)
|
(6 %)
|
(5 %)
|
0 %
|
1 %
|
(11 %)
|
(4 %)
|
40 %
|
56 %
|
69 %
|
81 %
|
N/A
|
M(n,10n)
|
(5 %)
|
(10 %)
|
(1 %)
|
0 %
|
0 %
|
(8 %)
|
23 %
|
39 %
|
60 %
|
N/A
|
N/A
|
N/A
|
D(2n,n)
|
(2 %)
|
(2 %)
|
(2 %)
|
(1 %)
|
0 %
|
(42 %)
|
(7 %)
|
25 %
|
52 %
|
65 %
|
75 %
|
N/A
|
D(10n,n)
|
0 %
|
0 %
|
0 %
|
1 %
|
2 %
|
(38 %)
|
2 %
|
25 %
|
55 %
|
N/A
|
N/A
|
N/A
|
S(n)
|
(10 %)
|
(10 %)
|
(4 %)
|
3 %
|
0 %
|
15 %
|
39 %
|
55 %
|
69 %
|
78 %
|
84 %
|
90 %
|
SR(n)
|
(3 %)
|
(1 %)
|
(1 %)
|
21 %
|
56 %
|
86 %
|
92 %
|
95 %
|
97 %
|
98 %
|
99 %
|
99 %
|
Discussion
Multiplication M(n,p)
In the first table, we verify that cost of naive schoolbook multiplication algorithm used by primDigitMultiply:neg: is quadratic M(n,n)=O(n^2), no surprise: multiplying two 500kbits is 100x the cost of multiplying two 50kbits which is 100x the cost of multiplying two 5kbits.This is not true for small Large Integer, because we pay a tribute for invoking <primitive 29> which is restricted to 64bits results (not a very big win on 64 bits platform where positive SmallInteger has up to 60 bits), then another penalty for invoking the schoolbook primitive, so for small bit lengths, we only roughly measure the overhead cost of this machinery (plus that of the benchmark machinery itself, iterating over the collection of tests has a marginal cost too).
Then Karatsuba start paying immediately at 5kbits for well balanced lengths, and 3-way Toom-Cook above 2000 bytes (16kbits). Asymptotically costs are about O(n^1.58) and O(n^1.46) respectively. But as we saw on previous post about profiling, overhead of +/-/shift even if in O(N) accounts for a non negligible part, and n is not large enough to reach this limit. Plotted in log-log, we see a slope of 2 for naive M(n,n) and 1.52 for the Toom-3 part of Ma(n,n).
The cost of not well balanced multiplications is also interesting. For the naive algorithm, we expect M(n,m)=O(n*m) and that’s what we get M(n,2n)=2 M(n,n) and M(n,10n)=10 M(n,n)
This part has not been specially optimized in the accelerated flavor, and the cost is Ma(n,2n) > 2.2 Ma(n,n). As for Ma(n,10n), it’s between 10 Ma(n,n) and 11 Ma(n,n). For moderate length in Karatsuba domain (5-15 kBits), the performance is a few % worse than naive algorithm.
n (bits)
|
100
|
200
|
500
|
1 000
|
2 000
|
5 000
|
10 000
|
20 000
|
50 000
|
100 000
|
200 000
|
500 000
|
M(n,2n)/M(n,n)
|
1.0
|
1.3
|
1.5
|
1.8
|
2.4
|
2.0
|
2.0
|
2.5
|
2.0
|
2.0
|
2.5
|
N/A
|
Ma(n,2n)/Ma(n,n)
|
1.0
|
1.3
|
1.5
|
1.8
|
2.4
|
2.4
|
2.7
|
2.6
|
2.3
|
2.3
|
2.5
|
N/A
|
M(n,10n)/M(n,n)
|
1.4
|
2.7
|
5.8
|
8.8
|
9.4
|
9.9
|
9.9
|
10.0
|
10.1
|
N/A
|
N/A
|
N/A
|
Ma(n,10n)/Ma(n,n)
|
1.4
|
2.8
|
5.6
|
8.6
|
9.4
|
11.4
|
10.0
|
10.5
|
10.4
|
N/A
|
N/A
|
N/A
|
The strategy of splitting into 1-1.5 parts (or 2-3) is adverse for Ma(n,2n) in the Karatsuba domain, and not of great interest for Ma(n,10n) - at best it's neutral (negligible overhead - which is already something). It's probably better for Ma(2n,k*3n) and int k, but we did not bench that case. This should be revisited, or we should adopt a different dispatching strategy based on length of both operands, instead of shorter operand length for fallback to naive algorithm.
Division
The division is interesting. For naive division, we see that D(2n,n) tends toward M(n,2n) for large n, and D(10n,n) toward 2*M(n,10n). The cost is effectively driven by cost of multiplication. For small length the price of setup in the primitive (shifting operands as in Knuth) makes the ratio closer to 3 M(n,2n).
n (bits)
|
100
|
200
|
500
|
1 000
|
2 000
|
5 000
|
10 000
|
20 000
|
50 000
|
100 000
|
200 000
|
500 000
|
D(2n,n)/M(n,2n)
|
3.2
|
3.1
|
2.3
|
1.7
|
1.6
|
1.2
|
1.1
|
1.3
|
1.1
|
1.0
|
1.3
|
N/A
|
Da(2n,n)/Ma(n,2n)
|
3.1
|
3.0
|
2.2
|
1.7
|
1.6
|
1.5
|
1.1
|
1.6
|
1.2
|
1.2
|
1.6
|
N/A
|
D(10n,n)/M(n,10n)
|
4.6
|
4.5
|
3.2
|
2.6
|
2.3
|
2.0
|
1.9
|
2.0
|
1.9
|
N/A
|
N/A
|
N/A
|
Da(10n,n)/Ma(n,10n)
|
4.4
|
4.1
|
3.2
|
2.6
|
2.3
|
2.6
|
2.5
|
2.4
|
2.2
|
N/A
|
N/A
|
N/A
|
For accelerated Burnikel-Ziegler, we obtain the same ratios 1.2 Ma(n,2n) and 2.2 Ma(n,10n), and this confirms that the division cost is dominated by multiplication cost, but I would have expected a larger constant factor. Note that we also have some Max(n,2.5n) due to non regular 50 000/20 000 ratio (base 10 oblige) - this is causing the hickups in log-log plot (I could have changed the x axis to sqrt(n1*n2), but that's details).
Threshold has been hardcoded at 256 bytes (above 2kbits) and it seems not optimized in the Karatsuba domain up to 15kbits. Above that, the speed up is about 3x at 100kbits, 4x at 200kbits.
Squaring
What about Squaring? In the naive case S(n)=M(n,n) because we simply fall back to naive multiplication, squared is just ^self * self.
n (bits)
|
100
|
200
|
500
|
1 000
|
2 000
|
5 000
|
10 000
|
20 000
|
50 000
|
100 000
|
200 000
|
500 000
|
S(n)/M(n,n)
|
1.0
|
1.0
|
1.0
|
1.0
|
1.0
|
1.0
|
1.0
|
1.0
|
1.0
|
1.0
|
1.0
|
1.0
|
Sa(n)/Ma(n,n)
|
1.0
|
1.1
|
1.0
|
1.0
|
1.0
|
0.9
|
0.8
|
0.8
|
0.8
|
0.8
|
0.8
|
0.8
|
For accelerated arithmetic, squared is more interesting than * with a 20% speed-up at 10kbits and above. That’s why I slightly modified exponentiation algorithm (raisedToInteger:) in trunk to use squared instead of * (see https://source.squeak.org/trunk/Kernel-nice.1220.diff).
The ratios are harcoded at 400 byes (3.2kbits) for asymetrical Karatsuba and 800 bytes (6.4 kbits) for asymetrical 4-way Toom-Cook. There is a single point of measurement for Karatsuba (15% performance improvment at 5kbits) and the speed up climbs at 10x at 500kbits which is really worth the few additional lines of code.
Square Root
Last in our list, the square root: the naive algorithm is catastrophic. It uses a Newton-Raphson iteration which has good properties (self-correcting and quadratic convergence - that is twice many correct bits at each iteration). But the initial guess is so desperately bad (a single bit!), that we end up with a cost SR(n)=O(n^2 log2 n).
n (bits)
|
100
|
200
|
500
|
1 000
|
2 000
|
5 000
|
10 000
|
20 000
|
50 000
|
100 000
|
200 000
|
500 000
|
SR(n)/S(n)
|
20.9
|
38.3
|
27.5
|
17.4
|
11.1
|
8.9
|
8.6
|
9.0
|
9.8
|
10.4
|
11.1
|
12.2
|
SRa(n)/Sa(n)
|
19.5
|
35.1
|
26.8
|
14.1
|
4.9
|
1.5
|
1.2
|
1.0
|
0.8
|
0.8
|
0.8
|
0.8
|
That’s not the case of accelerated algorithm. The Karatsuba-like version of Paul Zimmerman shines and does even perform better than squared itself ( -20% at 50kbits in the Toom-Cook domain!). Was that expected? Or does it tell that some thresholds were not correctly tuned? That amazed me!
Like explained in the conclusion of Zimmerman's paper, the algorithm somehow looks like a Newton-Raphson iteration, but the biggest difference I see with our naive original is that the initial guess already has half the bits correct thanks to divide and conquer strategy for a fraction of the cost!
The gain is progressive with a first regime presumably dominated by not fully optimized division up to 10kbits as we saw above. But it reaches 100x at 100kbits because of too naive original, and because we fight a Smalltalk implementation rather than a generated and statically optimized C code primitive, which is much easier.
Conclusions
Even if not everyone manipulates such numbers, it's not every day that we can offer speed up of 2x to 10x. We observe gains of 3x or more for Integers division, above 100 kbits, 5x for multiplication, 10x at 500kbits for Squaring and even 100x for the square root, which is already something great for not so many lines of pure Smalltalk code.
Pure Smalltalk accelerated version has to compete with generated C code for the naive primitive, and it's not amazing that O(n^x) start paying at a bit higher thresholds than for GMP, but we don't want to compete by writing C code (or slang) and spoil the fun!
There is many more to say about these benchmarks, for example the accelerated division has direct impact on large integer printing (in base 10). I already revisited the printing algorithm to use a divide and conquer form some years ago (https://source.squeak.org/trunk/Kernel-nice.200.diff and http://bugs.squeak.org/view.php?id=6887), and now that we have the Burnikel-Ziegler digitDiv21:, it’s a perfect fit, the 3x division speed-up offers almost a 2x printString speed-up at 100kBits!
"How much cost printing of a 100kbit Integer?"
| x |
x := (1 << 1e5 - 1) atRandom.
[x printString] bench.
'50.4 per second. 19.8 milliseconds per run.' with naive division
'84.5 per second. 11.8 milliseconds per run.' with fast recursive division
We saw that some additional tuning was necessary for the heuristic thresholds that dispatch to different variants. So tuning should be the next task.
No comments:
Post a Comment