Saturday, May 4, 2019

Accelerating Large Integer arithmetic in Squeak Smalltalk

I recently introduced some changes in Kernel package of Squeak development trunk http://source.squeak.org/trunk.html so as to accelerate Large Integer arithmetic involving huge numbers.

The operations boosted by this set of changes are:
  • multiplications *
  • divisions / // \\ quo: and rem:
  • squaring squared
  • square root sqrt sqrtFloor
All these changes can easily be made optionals. There is only one mandatory change in the Kernel package, the introduction of an indirection multiplyByInteger: and divideByInteger: in Integer so that subclass can define their own hook. Default behaviour of the hook is to call the schoolbook primitives primDigitMultiplyNegative via digitMultiply:neg: and primDigitDivNegative via digitDiv:neg:. I believe that Pharo and Cuis Smalltalk should adopt these basic changes, as the penalty is really neglectable, and make the accelerated arithmetic a package on its own. I've also inquired about Visualworks, but the changes required to the base classes are a bit more invasive.

For accelerating those operations, I have used these classical divide and conquer algorithms:
  • Karatsuba multiplication in digitMul22:
  • Toom-Cook multiplication in digitMul23: and digitMul33:
  • the Fast Recursive Division - MPI-I-98-1-22 from Christoph Burnikel & Joachim Ziegler in digitDiv21: digitDiv32: and driver digitDivSplit:
  •  and beautiful Karatsuba Square Root from Paul-Zimmerman - Rapport de recherche n° 3805
    lk*jFiGhFg6=f42/eHeW>4@/eH+4YW􏰓4dc`IRWGCba2aFJ`
Here, the 22, 23, 33, 21, and 32 in selector names tells in how many parts the operands (receiver and argument) are divided, 22 means two parts each (Karatsuba).

The multiplyByInteger: method is refined in LargePositiveInteger and serves as a dispatcher to the available variants, based on heuristics on the operands digitLengh. Remind that those digits are 1-byte long (8 bits) in traditional Smalltalk, though the LargeIntegersPlugin of opensmalltalk-vm uses 32-bits digits under the hood - that is why the split is made at four bytes boundary (notice the bitClear: 2r11 used in most digit-splitting methods).

When the operands sizes are unbalanced (the larger is twice longer than the smaller or more), a digitMulSplit: is used with a strategy of splitting the larger in chunks of about 1.5 times the smaller. I have tried other combinations, 1-1 and 1-2, and measured that 1-1.5 was faster, thus the addition of a digitMul23: variant of Toom-Cook. It's interesting to notice how I reconstruct the result from collected parts in O(N) by in-place modification. Interlacing is used so as to be sure to get non-overlapping digits (see inplaceAddNonOverlapping:digitShiftBy:). My first iteration did use a O(N log N) of the form (a3 * x + a2)*x^2 + (a1 * x + a0), where multiplication by x correspond to bitShift: of the appropriate digit length, but each operation means allocating memory, moving bytes, and ends up in non neglectable cost noticeable in profiling...

The 3-way Toom-Cook implemented in digitMul33: is a variant from Marco Bodrato & Alberto Zanoni, and it is interesting to read What About Toom-Cook Matrices Optimality? on this subject.

Squaring also use Karatsuba and Toom-Cook, and these are asymetrical versions which outperform the symetrical ones. Specifically, squaredByFourth use the 4-way Toom-Cook variant of  Jaewook Chung and M. Anwar Hasan - Asymmetric Squaring Formulae   https://www.lirmm.fr/arith18/papers/Chung-Squaring.pdf.

The new square root algorithms introduces a new message sqrtRem which answers an Array with truncated square root and remainder for almost the same price
self sqrtRem first squared + self sqrtRem last = self
 
This is just a starting activity, there is more to do: I have played with exponentation (raisedToInteger:), cubing (cubed) and cube root (cbrtRem as a generalization of sqrtRem) and maybe I'll blog about them, or publish if worth. There is also a need to auto-adapt the thresholds hardcoded here and there, but this is quite involved to get it right, benchmarking is an art.

Why bother with those classical algorithms without much innovating, when the state of the art GNU Multi Precision library (GMP) is far more advanced, optimized, fast and ready to use? Because I have an interest in supporting ArbitraryPrecisionFloat, and more than all, because it's very easy and fun to experiment in Squeak. It's fascinating how straight forward it is to transcribe those algorithms in Smalltalk and I believe that researchers should prototype new algorithms in such a language before paying the ultimate tribute to optimization. If you find that extending GMP is equally fun, which means programming in C, dealing with memory optimization, edit-compile-run-segfault cycles, then OK, you're even more a weirdo than I am!

 

No comments:

Post a Comment