Smallissimo: Benchmarking the arithmetic boost

General benchmarking facilities

After profiling, let's inquire the benchmarking facilities, and apply them for measuring how much accelerated is our arithmetic.

In Squeak Smalltalk, the simplest way to measure is to send the message bench to a block of code. The principle of bench is Timer-based: the block is repeatedly evaluated until a delay expires, and the rate of execution count/delay (n times per second) is extracted, along with the runtime of single execution delay/count. The default delay is 5 seconds, but can be changed by sending benchFor: instead of bench.

[100 factorial] benchFor: 2 seconds.
'57,900 per second. 17.3 microseconds per run.'

I have also played with the high resolution clock as used by AndreasSystemProfiler to see if it were feasible to bench single execution instead of repeated ones, but no surprise, that's less accurate than bench: it's overestimated and subject to high jitter/erraticity). One must run several times to account for JIT, OS concurrency and other effects.

[100 factorial] durationToRun.
0:00:00:00.000040395
0:00:00:00.000024274

Specific arrangement for Large Arithmetic Bench

Finally, I opted for my own bench based on median timing of 21 trials, median is more robust than average to outliers. Each timing is established by running about 100 operations - averaged to get time to run per operation. The bench can be found as LargeArithmeticBench package at http://www.squeaksource.com/STEM.html.

I have also used the high resolution clock (via my recent Preferences addition).

I have compared updated 64bits trunk image with accelerated arithmetic vs released Squeak 5.2 wich only has naive schoolbook algorithms in primitives. To have comparable results, I have used same Spur64 VM, and I've imported the latest timing routines from Squeak trunk (Chronology-Core-nice.44) into the 5.2 image. If you want to repeat this operation on your own, simply download LargeArithmeticBench, open a Transcript, and evaluate

LargeArithmeticBench new benchReport.

My configuration is not really new (2.7GHz core i5 MBP from 2015 - 8Go RAM)

Raw results

Here after, I will use these notations

M(n,p) is the cost of multiplying a n bits Integer * by a p bit Integer
D(n,p) is the cost of dividing a n bits Integer // by a p bit Integer
S(n) is the cost of squaring a n bits Integer (squared)
SR(n) is the cost of taking truncated square root of a n bits Integer (sqrtFloor)
Ma(n,p) Da(n,p) Sa(n) SRa(n) are for the accelerated version

First, here are the non-accelerated 5.2 results, in nanoseconds per operation:

n (bits)	100	200	500	1 000	2 000	5 000	10 000	20 000	50 000	100 000	200 000	500 000
M(n,n)	228	238	495	1 219	4 409	25 980	103 066	409 161	2 553 019	10 207 342	40 838 771	255 788 190
M(n,2n)	220	310	720	2 220	10 680	51 720	205 220	1 021 830	5 106 960	20 721 210	101 966 090	N/A
M(n,10n)	330	640	2 870	10 710	41 370	256 970	1 024 470	4 090 630	25 698 210	N/A	N/A	N/A
D(2n,n)	700	970	1 660	3 710	16 600	59 670	223 720	1 309 900	5 460 440	21 446 150	127 811 680	N/A
D(10n,n)	1 520	2 860	9 230	27 540	94 620	523 460	1 996 920	8 079 780	48 989 550	N/A	N/A	N/A
S(n)	230	240	490	1 270	4 440	26 340	103 970	409 330	2 569 370	10 262 050	41 023 400	256 136 390
SR(n)	4 800	9 200	13 490	22 120	49 130	233 170	895 920	3 691 610	25 134 780	107 222 200	456 219 390	3 130 480 810

And here are the accelerated trunk results as of Kernel-nice.1226:

n (bits)	100	200	500	1 000	2 000	5 000	10 000	20 000	50 000	100 000	200 000	500 000
Ma(n,n)	244	252	516	1 235	4 411	24 362	79 166	236 754	980 818	2 801 742	7 989 935	31 317 255
Ma(n,2n)	233	328	754	2 214	10 553	57 340	213 406	616 834	2 238 966	6 346 407	19 824 556	-1
Ma(n,10n)	346	703	2 890	10 671	41 246	276 811	789 047	2 480 517	10 175 333	-1	-1	-1
Da(2n,n)	717	988	1 686	3 764	16 533	84 926	238 771	981 013	2 605 632	7 522 468	31 991 633	-1
Da(10n,n)	1 517	2 859	9 188	27 267	92 978	721 250	1 960 335	6 075 578	22 161 607	-1	-1	-1
Sa(n)	253	265	509	1 238	4 424	22 299	63 273	186 107	794 071	2 288 904	6 432 702	26 047 987
SRa(n)	4 931	9 310	13 644	17 401	21 559	33 416	72 882	188 061	656 582	1 865 753	5 319 813	19 751 009

And a comparison in relative time-to-run percentage (naive - accelerated)/naive - (red) for loss of performance. Here 90% means that we eliminated 90% of run time, IOW we gained a factor x10, 80% 5x, 75% 4x, 67% 3x, 50% 2x, 33% 1.5x, 25% 1.33x:

n (bits)	100	200	500	1 000	2 000	5 000	10 000	20 000	50 000	100 000	200 000	500 000
M(n,n)	(7 %)	(6 %)	(4 %)	(1 %)	(0 %)	6 %	23 %	42 %	62 %	73 %	80 %	88 %
M(n,2n)	(6 %)	(6 %)	(5 %)	0 %	1 %	(11 %)	(4 %)	40 %	56 %	69 %	81 %	N/A
M(n,10n)	(5 %)	(10 %)	(1 %)	0 %	0 %	(8 %)	23 %	39 %	60 %	N/A	N/A	N/A
D(2n,n)	(2 %)	(2 %)	(2 %)	(1 %)	0 %	(42 %)	(7 %)	25 %	52 %	65 %	75 %	N/A
D(10n,n)	0 %	0 %	0 %	1 %	2 %	(38 %)	2 %	25 %	55 %	N/A	N/A	N/A
S(n)	(10 %)	(10 %)	(4 %)	3 %	0 %	15 %	39 %	55 %	69 %	78 %	84 %	90 %
SR(n)	(3 %)	(1 %)	(1 %)	21 %	56 %	86 %	92 %	95 %	97 %	98 %	99 %	99 %

Discussion

Multiplication M(n,p)

In the first table, we verify that cost of naive schoolbook multiplication algorithm used by primDigitMultiply:neg: is quadratic M(n,n)=O(n^2), no surprise: multiplying two 500kbits is 100x the cost of multiplying two 50kbits which is 100x the cost of multiplying two 5kbits.

This is not true for small Large Integer, because we pay a tribute for invoking <primitive 29> which is restricted to 64bits results (not a very big win on 64 bits platform where positive SmallInteger has up to 60 bits), then another penalty for invoking the schoolbook primitive, so for small bit lengths, we only roughly measure the overhead cost of this machinery (plus that of the benchmark machinery itself, iterating over the collection of tests has a marginal cost too).

For accelerated version, we call schoolbook primitive under 600 byes (4800 bits), and pay a penalty of at most 7% for the additional multiplyByInteger: indirection. This penalty (constant time) tends to vanish versus O(n^2) cost of primitive as n increases.

Then Karatsuba start paying immediately at 5kbits for well balanced lengths, and 3-way Toom-Cook above 2000 bytes (16kbits). Asymptotically costs are about O(n^1.58) and O(n^1.46) respectively. But as we saw on previous post about profiling, overhead of +/-/shift even if in O(N) accounts for a non negligible part, and n is not large enough to reach this limit. Plotted in log-log, we see a slope of 2 for naive M(n,n) and 1.52 for the Toom-3 part of Ma(n,n).

The cost of not well balanced multiplications is also interesting. For the naive algorithm, we expect M(n,m)=O(n*m) and that’s what we get M(n,2n)=2 M(n,n) and M(n,10n)=10 M(n,n)
This part has not been specially optimized in the accelerated flavor, and the cost is Ma(n,2n) > 2.2 Ma(n,n). As for Ma(n,10n), it’s between 10 Ma(n,n) and 11 Ma(n,n). For moderate length in Karatsuba domain (5-15 kBits), the performance is a few % worse than naive algorithm.

n (bits)	100	200	500	1 000	2 000	5 000	10 000	20 000	50 000	100 000	200 000	500 000
M(n,2n)/M(n,n)	1.0	1.3	1.5	1.8	2.4	2.0	2.0	2.5	2.0	2.0	2.5	N/A
Ma(n,2n)/Ma(n,n)	1.0	1.3	1.5	1.8	2.4	2.4	2.7	2.6	2.3	2.3	2.5	N/A
M(n,10n)/M(n,n)	1.4	2.7	5.8	8.8	9.4	9.9	9.9	10.0	10.1	N/A	N/A	N/A
Ma(n,10n)/Ma(n,n)	1.4	2.8	5.6	8.6	9.4	11.4	10.0	10.5	10.4	N/A	N/A	N/A

The strategy of splitting into 1-1.5 parts (or 2-3) is adverse for Ma(n,2n) in the Karatsuba domain, and not of great interest for Ma(n,10n) - at best it's neutral (negligible overhead - which is already something). It's probably better for Ma(2n,k*3n) and int k, but we did not bench that case. This should be revisited, or we should adopt a different dispatching strategy based on length of both operands, instead of shorter operand length for fallback to naive algorithm.

Division

The division is interesting. For naive division, we see that D(2n,n) tends toward M(n,2n) for large n, and D(10n,n) toward 2*M(n,10n). The cost is effectively driven by cost of multiplication. For small length the price of setup in the primitive (shifting operands as in Knuth) makes the ratio closer to 3 M(n,2n).

n (bits)	100	200	500	1 000	2 000	5 000	10 000	20 000	50 000	100 000	200 000	500 000
D(2n,n)/M(n,2n)	3.2	3.1	2.3	1.7	1.6	1.2	1.1	1.3	1.1	1.0	1.3	N/A
Da(2n,n)/Ma(n,2n)	3.1	3.0	2.2	1.7	1.6	1.5	1.1	1.6	1.2	1.2	1.6	N/A
D(10n,n)/M(n,10n)	4.6	4.5	3.2	2.6	2.3	2.0	1.9	2.0	1.9	N/A	N/A	N/A
Da(10n,n)/Ma(n,10n)	4.4	4.1	3.2	2.6	2.3	2.6	2.5	2.4	2.2	N/A	N/A	N/A

For accelerated Burnikel-Ziegler, we obtain the same ratios 1.2 Ma(n,2n) and 2.2 Ma(n,10n), and this confirms that the division cost is dominated by multiplication cost, but I would have expected a larger constant factor. Note that we also have some Max(n,2.5n) due to non regular 50 000/20 000 ratio (base 10 oblige) - this is causing the hickups in log-log plot (I could have changed the x axis to sqrt(n1*n2), but that's details).

Threshold has been hardcoded at 256 bytes (above 2kbits) and it seems not optimized in the Karatsuba domain up to 15kbits. Above that, the speed up is about 3x at 100kbits, 4x at 200kbits.

Squaring

What about Squaring? In the naive case S(n)=M(n,n) because we simply fall back to naive multiplication, squared is just ^self * self.

n (bits)	100	200	500	1 000	2 000	5 000	10 000	20 000	50 000	100 000	200 000	500 000
S(n)/M(n,n)	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
Sa(n)/Ma(n,n)	1.0	1.1	1.0	1.0	1.0	0.9	0.8	0.8	0.8	0.8	0.8	0.8

For accelerated arithmetic, squared is more interesting than * with a 20% speed-up at 10kbits and above. That’s why I slightly modified exponentiation algorithm (raisedToInteger:) in trunk to use squared instead of * (see https://source.squeak.org/trunk/Kernel-nice.1220.diff).

The ratios are harcoded at 400 byes (3.2kbits) for asymetrical Karatsuba and 800 bytes (6.4 kbits) for asymetrical 4-way Toom-Cook. There is a single point of measurement for Karatsuba (15% performance improvment at 5kbits) and the speed up climbs at 10x at 500kbits which is really worth the few additional lines of code.

Square Root

Last in our list, the square root: the naive algorithm is catastrophic. It uses a Newton-Raphson iteration which has good properties (self-correcting and quadratic convergence - that is twice many correct bits at each iteration). But the initial guess is so desperately bad (a single bit!), that we end up with a cost SR(n)=O(n^2 log2 n).

n (bits)	100	200	500	1 000	2 000	5 000	10 000	20 000	50 000	100 000	200 000	500 000
SR(n)/S(n)	20.9	38.3	27.5	17.4	11.1	8.9	8.6	9.0	9.8	10.4	11.1	12.2
SRa(n)/Sa(n)	19.5	35.1	26.8	14.1	4.9	1.5	1.2	1.0	0.8	0.8	0.8	0.8

That’s not the case of accelerated algorithm. The Karatsuba-like version of Paul Zimmerman shines and does even perform better than squared itself ( -20% at 50kbits in the Toom-Cook domain!). Was that expected? Or does it tell that some thresholds were not correctly tuned? That amazed me!

Like explained in the conclusion of Zimmerman's paper, the algorithm somehow looks like a Newton-Raphson iteration, but the biggest difference I see with our naive original is that the initial guess already has half the bits correct thanks to divide and conquer strategy for a fraction of the cost!

The gain is progressive with a first regime presumably dominated by not fully optimized division up to 10kbits as we saw above. But it reaches 100x at 100kbits because of too naive original, and because we fight a Smalltalk implementation rather than a generated and statically optimized C code primitive, which is much easier.

Conclusions

Even if not everyone manipulates such numbers, it's not every day that we can offer speed up of 2x to 10x. We observe gains of 3x or more for Integers division, above 100 kbits, 5x for multiplication, 10x at 500kbits for Squaring and even 100x for the square root, which is already something great for not so many lines of pure Smalltalk code.

Pure Smalltalk accelerated version has to compete with generated C code for the naive primitive, and it's not amazing that O(n^x) start paying at a bit higher thresholds than for GMP, but we don't want to compete by writing C code (or slang) and spoil the fun!

There is many more to say about these benchmarks, for example the accelerated division has direct impact on large integer printing (in base 10). I already revisited the printing algorithm to use a divide and conquer form some years ago (https://source.squeak.org/trunk/Kernel-nice.200.diff and http://bugs.squeak.org/view.php?id=6887), and now that we have the Burnikel-Ziegler digitDiv21:, it’s a perfect fit, the 3x division speed-up offers almost a 2x printString speed-up at 100kBits!

    "How much cost printing of a 100kbit Integer?"
    | x |
    x := (1 << 1e5 - 1) atRandom.
    [x printString] bench.
    '50.4 per second. 19.8 milliseconds per run.' with naive division
    '84.5 per second. 11.8 milliseconds per run.' with fast recursive division

We saw that some additional tuning was necessary for the heuristic thresholds that dispatch to different variants. So tuning should be the next task.

Smallissimo

Friday, May 10, 2019

Benchmarking the arithmetic boost