## Friday, March 5, 2021

### Autopsy of an accuracy problem of log2 in Squeak

Some times ago, I've been annoyed by self log: 2 which was performing (self ln / 2 ln).
The problem is that it is inexact for some trivial cases. Whatever the libm used by the VM, we'll always find an integer n, such that (1.0 timesTwoPower: n) ln / 2 ln ~= n.

Rather than hacking the VM and adding a primitive, I have implemented the log2 function like this in Squeak which is way more simple:

^self significand ln / Ln2 + self exponent

In Squeak, every Float (but nan as usual) verifies:

self = (self significand timesTwoPower: self exponent).

Also for every finite Float (but 0.0 and -0.0) - even the denormals (which means that the exponent is more evolved than just extracting the exponent bits):

1 <= self significand and: [self significand < 2].

Beware when porting, this may differ in other dialects, for example in Visualworks:

self class emin - 1 <= self exponent and: [self class exponent <= (self class emax + 1)].

Where emin - 1 is for denormals, and emax + 1 is for special values (infinities and NaNs).

Back to our formulation, the nice thing is that self significand = 1.0 if and only if the Float is a power of two. It's so simple that it's the new definition of Float>>isPowerOfTwo in Squeak from now on.

And another nice thing is that 1.0 ln = 0.0 whatever libm implementation. the result is that we have an exact log2 for every exact power of two. That's  goodness, goal achieved!

From the accuracy side, we have an inexact function with ln, an inexact representation of Ln2 (2 ln), and inexact division, and the last summation which can be inexact too. Even with the best libm library (a correctly rounded one), that's up to 2 ulp of error. That's not too bad for such a simple 1 liner, and just half ulp more than previous implementation!

Err! except in case of catastrophic cancellation!
That's exactly what happens in the neighbourhood of 1.0.
Let's take one example:

x := 1.0 predecessor.
x significand = 2.0 predecessor.
x exponent = -1.
2.0 predecessor ln/2 ln = 1.0 predecessor.
1.0 predecessor - 1  = 0.5 ulp negated.

x significand ln / 2 ln + x exponent  = 0.5 ulp negated.
-> -1.1102230246251565e-16

Compared to old implementation:

1.0 predecessor ln / 2 ln.
-> -1.6017132519074588e-16

Which we can verify with doubled precision ArbitraryPrecisionFloat:

(1.0 predecessor asArbitraryPrecisionFloatNumBits: Float precision * 2) log2 asFloat
-> -1.6017132519074588e-16

That's about 50% of error for the new implementation, more than 10^14 ulp, abysmally inacceptable for a math library!

Where is the problem?
The problem of catastrophic cancellation is when evaluating the logarithm, then subtracting 1.
In other words, ln(1- x) is easy to evaluate for small x, but ln(2 - 2x) - ln(2) is not.
Bummer, that's precisely the operation that significand is bringing.

We must scale down (divide by 2) if we're getting a significand near 2, but must not spoil the accuracy of 1.0 successor log2... That's easy:

| s |
s := self significand.
^s > 1.3333333333333333
ifTrue: [(0.5 * s) ln / Ln2 + (1 + self exponent)]
ifFalse: [s ln / Ln2 + self exponent]

With this formulation, s will remain in the interval [2/3 , 4/3]. It remains close to 1 for receiver close to 1. Our pathological case is now cured:

1.0 predecessor log2.
-> -1.6017132519074588e-16

Float can be tricky, they are not always 1 liner friendly...
The alternative is to profit by already existing wisdom and implementations of log2 from the external world, there must be some good one, somewhere...
That means, adding more and more primitives or FFI calls. But dealing with all the library path and other complexity brought by the external world is neither fun, nor a 1 liner! I've been there before. Until someone find another flaw,
I prefer my 5 liners, that's managed entropy.

## Wednesday, March 3, 2021

### A hidden feature of Interval

This short post to advertise a hidden feature of Interval which I documented in Cincom store SYSBUG-FloatIntervalTest some time ago and rediscovered recently.

This feature is shared by vast majority of Smalltalk dialects, Dolphin, Pharo, Squeak, Visualworks, ...

Certainly more an accidental feature than a deliberate one:

((10 to: 12) at: 1.5) = 10.5.

Funny, isn't it?

## Friday, February 19, 2021

### Cross compiling for Windows target with cmake

This is not directly related to Smalltalk, but it is a common source of trouble: how to compile a third-party package for windows platform especially when it comes with cmake files?

My latest example: recompile an upgraded version of lapack from netlib. Lapack sources are now conveniently located on github.

I do not want to use mingw directly (mingw stands for minimal gnu for windows), because I did spend too much time for installing the various tools on mingw in the past.

Instead, I either use cygwin cross-compilation toolchain for mingw target, or linux equivalent thru wsl (windows subsystem for linux). The main advantage of cygwin or linux distributions is the large amount of packages ready to use and easy to install (FORTRAN compilers and other niceties upgraded from the past).

Usually, we find many half-working solutions on the web, like setting a bunch of environment variable. For example I first tried this little shell after cloning the git source in /home/nice/git/lapack.

#!/bin/bash
#
# my own compilation of LAPACK dll for mingw target via cygwin cross-compilation
#
[ -d build_win64 ] || mkdir build_win64
cd build_win64
CC=x86_64-w64-mingw32-gcc \
FC=x86_64-w64-mingw32-gfortran \
AR=x86_64-w64-mingw32-ar \
NM=x86_64-w64-mingw32-nm \
DLLTOOL=x86_64-w64-mingw32-dlltool \
DLLWRAP=x86_64-w64-mingw32-dllwrap \
OBJCOPY=x86_64-w64-mingw32-objcopy \
STRP=x86_64-w64-mingw32-strip \
cmake .. -DBUILD_SHARED_LIBS=ON -DBUILD_INDEX64=ON
make

The worse thing is that it somehow worked...

Except that the generated dll had a ugly and incorrect cyg prefix. And except that when I tried to add the C interface with additionnal options -DCBLAS=ON -DLAPACKE=ON that miserably failed with this message:

CMake Error at /usr/share/cmake-3.17.3/Modules/FortranCInterface.cmake:383 (message):
The Fortran compiler:

/usr/bin/x86_64-w64-mingw32-gfortran.exe

and the C compiler:

/usr/bin/cc

failed to compile a simple test project using both languages.  The output
was:

Change Dir: /home/nice/git/lapack/build_win64/CMakeFiles/FortranCInterface/VerifyC

Ouch this sticky /usr/bin/gcc is annoying, that's not what I specified! I tried to have more sticky environment variable thru export CC=... but that failed - possibly because I did not want to reconstruct the whole thing from scratch (it takes hours).

I knew that I should have instructed cmake about the toolchain instead. But how?

The wrong answer: spend some times in the terrific, horribilific, unhelpful cmake documentation. I'm not exagerating. Tell me how much are your skill increased on a scale of 10 after reading this toolchain reference. For me, the answer is very close to zero ! I've never seen such a poor documentation in my already long life!

The good answer: it's on the web of course, somewhere... In order to increase the probability of hitting one, I decided to duplicate the already existing answers and share my findings.

Here is the little toolchain file /home/nice/toolchain_x86_64-w64-mingw32.cmake that I assembled after an hour or more of browsing:

# For cross compiling for a windows target
# using cygwin or linux gcc toolchain
#
# The target System
set(CMAKE_SYSTEM_NAME Windows)

# Factor out the prefix which is common to our set of tools
set(TOOLCHAIN_PREFIX x86_64-w64-mingw32)

# The various compilers to use
set(CMAKE_C_COMPILER \${TOOLCHAIN_PREFIX}-gcc)
set(CMAKE_Fortran_COMPILER \${TOOLCHAIN_PREFIX}-gfortran)
set(CMAKE_RC_COMPILER \${TOOLCHAIN_PREFIX}-windres)

# and where to find includes etc...
SET(CMAKE_FIND_ROOT_PATH /usr/\${TOOLCHAIN_PREFIX})

Then I just have to change my script like this:

#!/bin/bash
#
# my own compilation of LAPACK dll for mingw target via cygwin cross-compilation
#
[ -d build_win64 ] || mkdir build_win64
cd build_win64
cmake .. -DBUILD_SHARED_LIBS=ON -DBUILD_INDEX64=ON -DCBLAS=ON -DLAPACKE=ON \
-DCMAKE_TOOLCHAIN_FILE=../toolchain_x86_64-w64-mingw32.cmake
make

Et voilà... This snapshot is somehow a proof of concept:

Magic is exactly the main grief I have toward cmake. As I always tell, cmake is great, as long as someone writes guesses the files for you...

There are other griefs, like not recompiling what is strictly necessary, but that's the IT trend, waste cycles of our sophisticated machines for repeating the same compilation again and again... Bad for the planet!

If this post can help someone, that will be already something. At least, I'm pretty sure that it will help myself in the near future, because with the age, I developped a new capability: forget the un-important things (we can delegate that to google, no matter the cost).

## Tuesday, October 1, 2019

### Simple piSin piCos mathematical function in Squeak

Working with radians is great, because we have the derivatives of sine function being the cosine function, and derivative of cosine being the sine negated. I always ask engineers to work with radians, which avoids many mistakes!

But when it comes to handling the periodicity modulo 2π electronically, radians are not so great because π is irrational and hardly representable as a Float value!

With Attitude and Heading Reference System delivering vehicle heading in degrees or other fractions of π, conversion to radians somehow introduce unecessary numerical noise.

To work around this, many libm provide a sinpi(x) which computes sin(π x), and there is a similar sin_pi in boost library.

I was seeking for simplest implementation that could possibly work, so I decided to give it a try in Squeak Smalltalk: these mathematical functions are called like this:

1 piCos = -1.
(1/2) piSin = 1.

And the tests

It's a very simple implementation: reduce receiver in the interval -1/4,1/4 that is reduce the angle argument in the interval -π/4,π/4, then fallback to Float approximation of sin(π x) or cos(π x).

The nice thing is that we can afford to answer an exact value for every case when the sine or cosine value takes a rational value. This could only happen for infinitely many irrational receivers that we cannot represent (so we don't care), and a handful of rational fraction of π once we performed argument (receiver in our case) reduction. That's the Nivens theorem - See http://mathworld.wolfram.com/NivensTheorem.html and I would not have imagined that it could be so cheap before implementing it!

The only thing that bothers me, is that this exactness cost me the weak monotonicity property locally. For example, we may expect that:

(1/6 + (1<<100) reciprocal) piSin >= (1/6) piSin.

But it's not the case... It happens that:

(1/6 + (1<<100) reciprocal) asFloat < (1/6) and:
[ (Float pi) < (1 asArbitraryPrecisionFloatNumBits: 100) pi ].

With such cumulated rounding by default, the floating point approximation falls short:

((1/6+(1<<100) reciprocal) * Float pi) sin = 0.5 predecessor.

This is where a bit more precision before the float conversion would help:

((1 asArbitraryPrecisionFloatNumBits: 100) pi / 6) asFloat sin = 0.5.

Weak monotonicity is preserved for Float receiver, which is already something, but I see no easy solution for the case of exact arithmetic. If I try with a longer approximation of π, then the weak monotonicity may break for mixed arithmetic, I'll have to come back to it... Fortunately, π/6 is the only case to handle (all reductions should be exact).

In Squeak there is a degreeCos and degreeSin message already that I tweaked some years ago, but those could further benefit from this new piSin and piCos functions (err, messages!). This little experiment might end up in Squeak trunk if there is some interest.

## Saturday, September 7, 2019

### How to compare exact value of SmallInteger and Float in Smalltalk

Long time ago, mixed arithmetic has been very well thought in Scheme with the notion of exact arithmetic vs inexact arithmetic (the resulting value of an operation might be an approximation of the mathematically exact value).

The R5RS specification require that compare operations be transitive in §6.2.5 Numerical operations. Doing so preserve important mathematical properties, like equivalence relation or total ordering. These properties are more than welcome when building higher level abstractions.

No matter that the inexact operand value may already carry it's own error bounds: as an agnostic library, we have no license to waste ULPs, that would only worsen the task of application writers.

A consequence of transitivity requirement is that mixed arithmetic comparisons between integer and float shall better answer the result of comparing the exact value of operands, rather than convert the int to float, then compare the potentially inexact (rounded) values. I spent enough time to convince Smalltalk community that we should adopt such superior model too. Typically, early versions of Squeak did not fulfill the transitivity expectations (but neither did most Lisp implementations, nor any Smalltalk dialect I knew):

| int1 int2 float |
int1 := 16r100000000000000 "1 bitShift: 56".
int2 := 16r100000000000001 "(1 bitShift: 56) + 1".

float := 16r100000000000000.0 "1.0 timesTwoPower: 56".
int1 = float & (float = int2) ==> (int1 = int2).

Nowadays, a majority of Smalltalk dialects will do the right thing, including the Squeak brand, the Cuis/Pharo derivatives, GNU Smalltalk and Dolphin Smalltalk. Visualworks does not abide, but I published Image side patches for 32bits version. I did not look at VisualAge for ages, the temporary licenses from Instantiations are not liberal enough to play with. Gemstone is deadly missing on my check list and has liberal enough licenses, but I have my own limitations and need to delegate... I'll take case of Smalltalk/X below.

Mathematics are one thing, but how do we technically achieve that feature at a reasonable price? Particularly in the statistically most common case of mixed SmallInteger op: Float.

With 32bits Smalltalk, and IEEE 754 double precision float providing a 53 bits precision significand, there is no problem, every conversion SmallInteger Float will be exact, and we can thus use naive conversion followed by float comparison. This is mostly done inside the VM in primitives or JIT.

At image side in Squeak, for LargeInteger and Fraction, the strategy is a bit more elaborated, because conversions can cost a lot! The idea is to test if the operand can be converted exactly (see isAnExactFloat in Squeak family), and if so, perform the conversion to Float, and compare floats (that's the fastest path). Else, exact arithmetic shall be used to perform the comparison (unless the Float is infinite or nan) and thus the Float is converted to it's exact value (a.k.a. asTrueFraction in Squeak or asExactFraction in gst).

With 64bits OpenSmalltalk VM which provides 61 bits signed SmallInteger, we now have to take care to not exceed the float precision. We had to rewrite the naive primitives and forbid an inexact conversion int64 double like what is done at image side for LargeInteger. A naive and simpler version of isAnExactFloat check has first been programmed, to eventually make the primitive fail: we just shift right by Float precision and check case of excessive high bits, rather than check the bit span highBit - lowBit + 1. False negative do not matter, they can be handled at image side where it is simpler to write more elaborated code. But those precautions were forgotten in the JIT, that's Issue 417.

In fact, all this started after observing that Smalltalk/X had the very same problem: the primitives were doing naive inexact conversions like in 64bits Visualworks instead of failing. But in Smalltalk/X, these are not really primitives, but rather inlined C-code that gets compiled dynamically and can be changed easily (see for example https://swing.fit.cvut.cz/projects/stx-jv/browser/stx.libbasic/Float.st#L1500). I thus started to think how I would resolve it in standard C so as to preserve speed. This had to be simple enough to be a viable evening hack (a few lines without too many compiler specific builtins).

So let's backtrack: when we convert 16r100000000000001 it might be converted upper or lower, and will loose last 4 bits (in OpenSmalltalk that's at most SmallInteger maxVal highBit - Float precision = 8). This specific integer loose only last bit and is transformed to 16r100000000000000.0 but is such tiny inexactness really going to be a problem if we compare to say -1000.0, 2.5, 1.0e100 or Float infinity? No, of course, the last few rounded bits won't change the comparison result... Unless the Float operand is sufficiently near the SmallInteger. How near?

Let's inquire this. In Squeak with simpler notations:
• a is the SmallInteger
• f is the Float whose exact value is b (either an Integer or a Fraction).
f asTrueFraction = b.
b asFloat = f.
b asFloat = b.

Above propositions are obvious, since in Smalltalk:

aFloat isAnExactFloat.

This was not true of those C using x87 extended 80bits extended precision float registers, but it is in C99 with appropriate flags, and easy when using SSE2 64bits registers.

And since they share same value:

aFloat asTrueFraction isAnExactFloat.

More interestingly, the conversion asFloat has this property of monotonicity by virtue of IEEE754 standard (because we round the exact to nearest inexact value):
a < b ==> (a asFloat <= b asFloat).

Even if different, a and b might be converted to same floating point value, the order can be blurred, but it can never be reversed! We can deduce from this property that:

a asFloat < b asFloat ==> (a < b).

Similarly:

a asFloat > b asFloat ==> (a > b).

The only case when we are in doubt is:

a asFloat = b asFloat.

From which one cannot deduce anything about ordering of a and b...

We got the bases of a new algorithm: naively convert the SmallInteger a asFloat, then compare to the Float f value: if both float values are equal, there is an ambiguity to be resolved, else we perform comparison in floating point and we're done, even if conversion asFloat were inexact, we know that it was innocuous (unordered case of f = Float nan, as well as case of f = Float infinity included).

If there is ambiguity, we have to fallback to asTrueFraction conversion... But we have additional properties:

a asFloat asTrueFraction isInteger.

That's a property of floating point representation, for any SmallInteger a, asFloat is either exact and thus preserve integerness, or will remove some precision (zero the last bits). In no way can we obtain a fractionPart, because that would mean adding more precision.
Only LargeInteger will loose this property when starting to overflow to Float infinity.

So:

a asFloat = f ==> f asTrueFraction isInteger.

Great! We thus do not need to deal with Fraction in case of ambiguity:

f asTrueFraction isInteger ==> (f asTrueFraction = f asInteger).

With this knowledge, we can sketch the whole algorithm in Smalltalk:

SmallInteger>>compareOpFloat: aFloat
^(self asFloat = aFloat)
ifTrue: [self compareOp: aFloat asInteger]
ifFalse: [self asFloat compareOp: aFloat]

So exactness costs only 1 additional comparison versus inexactness, and 1 additional conversion in case of unlucky ambiguity.
Note that aFloat asInteger could degenerate to a LargeInteger in Smalltalk due to upper rounding... For example, in 64 bits Squeak Images:

SmallInteger maxVal asFloat asInteger isLarge.

Similarly in C, for int64 a, there would be some edge case of signed integer overflow when performing (int64) f , and such situation would deserve a uint64 handling in order to avoid undefined behavior... Not less complex in assembler for the JIT.
I may have a solution for it, but YAGNI, our OpenSmalltalk SmallInteger are only on 61 bits, so we can write it naivly in primitives and avoid the image side case of LargeInteger! Rather than testing inexactness and failing in edge case, the primitive can always answer the result of exact comparison at a very low price, both in term of speed and simplicity. There's no reason to not do it.

Or there might be some good reasons... After all, I hate the fact that Visualworks has hardcoded naive inexact conversion in primitive!
If we want to change the behavior, there is no other solution than checking the preconditions in slower Smalltalk code (either via double dispatching or type testing) BEFORE calling fast primitives, which is ruining the benchmarks of number crunching applications... It's far more efficient to call the fast primitive first, and handle failure cases in fallback code.
Even if I'm convinced that exact comparison is a better thing than inexact, it's stil questionable to hardwire it in the VM. Juan Vuletich, the author of Cuis gave a sounding echo to the little voice inside me telling to not cut too much the world of possibilities...

I have thus prototyped a very simple behavior with a single boolean VM parameter: either the primitives and jitter do perform mixed arithmetic, and if so they do it exactly for comparisons, or they don't perform any mixed arithmetic at all and always fallback to the image code, whatever the operation (arithmetic or comparison). This way, the unmixed arithmetic Float op: Float and SmallInteger op: SmallInteger are still very fast.
With the help of Eliot Miranda, the master of OpenSmalltalk VM, I hope that we consolidate these features soon, have our cake and eat it too. That's what every serious Smalltalk implementation should do. If some vendor decide to hardcode comparison of inexact (rounded) values, then please provide a way to desactivate it! If choosing exactness, everyone now knows how to do it without wasting too many CPU cycles nor engineering, all the material is licensed MIT, and unless I re-invented some well known prior work, I hope patent-free, thanks to this publication.

## Friday, September 6, 2019

### How to JIT Float comparison on X64 in opensmalltalk-vm

I wanted to share what I recently learned while hacking OpenSmalltalk VM which hold code generated from the Smalltalk source code repository VMMaker.

Comparing Floats is tricky in general, and particularly tricky in IA32/X64 assembler.

Cog, the machine code generator (JIT) of OpenSmalltalk VM is using Intel SSE2 instruction UCOMISD for comparing double precision float registers in Xmm0 and Xmm1.

IEEE 754 standard mandates that NaN compare false to everything (except for NOT EQUAL which shall be always true), and we want to abide to this standard.

The problem is the specific set of status flags chosen by Intel to report the result of comparison, and particularly the case of comparison with NaN (which is unordered): there are 3 flags set
• Zero Flag (ZF)
• Parity Flag (PF)
• Carry Flag (CF)
The comparison of the two register is like a subtraction operand2 - operand1, and thus we very classically get the ZF set when equal, or the CF set when operand2 < operand1. What is very special is that the 3 flags are set to 1 in case of Unordered (when at least one operand is NaN).

The possible test-and-branch operations after the comparison operation are
• JAE (jump above or equal), equivalent to JNC (jump if CF=0).
In Cog, we use that as JumpFPGreaterOrEqual RTL instruction
• JA (jump above), equivalent to (jump if both CF=0 and ZF=0)
In Cog, this is used as JumpFPGreater
• JBE (jump below or equal), equivalent to (jump if either CF=1 or ZF=1)
In Cog, this is used as JumpFPLessOrEqual
• JB (jump below), equivalent to JC (jump if CF=1)
In Cog, this is used as JumpFPLess
• JE (jump equal), equivalent to JZ (jump if ZF=1)
In Cog, used as JumpFPEqual
• JNE (jump not equal), equivalent to JNZ (jump if ZF=0)
In Cog, used as JumpFPNotEqual
With this arrangement, we see that first 2 operations will work as expected with nan operands: >=  and > will always return false (because CF=1).
But the next 4 operations <= < = and ~= won't behave correctly with nan operands!

It is indeed necessary to first test for the parity flag and use a JP (linked to Cog RTL JumpFPUnordered) appropriately. What we should do is
• Let genJumpFPLessOrEqual: jump if both PF=0 and (either CF=1 or ZF=1)
• Let genJumpFPLess: jump if both PF=0 and CF=1
• Let genJumpFPEqual: jump if both PF=0 and ZF=1
• Let genJumpFPNotEqual: jump if either PF=1 or ZF=0
Cog is doing the appropriate thing for the last two operations by using an additional JumpFPUnordered, but fails to do so for genJumpFPLessOrEqual: and genJumpFPLess:. That's a problem, we cannot use them blindly, or they won't produce the expected output for NaN operands!

Instead, as a workaround, it's the client code that carefully, subtlely and cleverly avoid to use JumpFPLessOrEqual and JumpFPLess, and instead transform the operations:
• (a < b) <==> (b > a)
• (a <= b) <==> (b >= a)
All the code for generating float comparison primitives thus has a mysterious invert: keyword and parameter. Mysterious, because we don't see the point of complexifying client code when we have all the necessary RTL opcodes for < and <= which would enable a more straight forward implementation.

Though this is indeed clever, because instead of two test-and-branch (JP and JB/JBE) we only need one JA/JAE.
But clever hacks are also error-prone: especially when having to change 6 or 12 implementations of comparison in a row: at code reading/writing time, uniformity and simplicity pay more than cleverness. A mechanical change is a different thing than a change where one must carefully review the transformation for 1 out of 3. My first two attempts succeeded, but the third not unexpectedly failed see commit comment

When I decided to do the most simple thing that could possibly work and use mechanical uniformity VMMaker.oscog-nice.2549, I broke the SUnit tests for NaN comparisons, so had to undo my wild refactoring in

I could have read huge Intel reference manuals (the right thing), but found the hint on stackoverflow sooooo much rapidly!

SO save your day (my nights in this case)!

We shall correct genJumpFPLess:/genJumpFPLessOrEqual: ASAP in CogIA32Compiler and CogX64Compiler, but I preferred to write this post first, it's technical enough and deserve a longer explanation.

I know, this Smalltalk blog is degenerating, I was already talking of C or C++ from time to time and now assembler, pouah? Personally, I'm not completely reluctant to it, I wrote my first disassembler/assembler in basic for a 6502 on an ATARI 800 in the early 80s, then I rewrote the assembler in assembler to gain speed. The fun stopped when the low quality tape refused to restore my code,  redoing such low level stuff would not give me much greater fun anyway so I decided to move on and thougt that I would never return...

But don't forget that our assembler is written in Smalltalk, and that's fabulous!
Extending the RTL opcodes with a ClzRR (count leading zeros in dst/src registers) is easy, it takes roughly, 1 day to understand the code base, and 1 day to write and experiment it - see - maybe the longer part is to learn the conventions used in those assembler manuals.

## Thursday, May 16, 2019

### Tuning the large arithmetic thresholds

After the sqrtRem, let's now tune the other thresholds for multiplication, squaring and division.

The first thing is to have settable threshold parameters, and that's the object of Kernel-nice.1234. The settings can be changed manually through Preferences interface, but then it's tedious to tune by hand.

So the next thing is to have a strategy for auto-tuning. I have chosen this one:
1. set the threshold for Karatsuba and Toom-3 multiplication Mul22 and Mul33 at very high values so as to have unaccelerated arithmetic
2. gradually increase the Mul22 threshold starting at 16 bytes, and doubling until it beats reference un-accelerated performance
3. once found, try smaller threshold by setting the next two bits (if threshold is 512, test for 320, 384 and 448)
4. repeat the procedure for Mul33, starting at Mul22 threshold
The performance is measured on 2 series of well-balanced integers, of byte length 1xthreshold, 1.125xthreshold. The score is the sum of the 2 median_time_to_run/nbits/log(nbits) for each serie (the lower the score, the better).

I did the same for squaring with S1/2 and S1/4, but multiplication tuning must be performed first, because asymetrical squaring uses that.

For division, it's a bit different, since the recursive division did not procure an immediate gain at 1x threshold, I decided to engage digitDivSplit: at twice the threshold so as to benefit from at least 2 recursions, so I must start testing at twice the threshold and finally used series of byte length 2x, 3x and 4x threshold.

I have published this auto-tuning code in STEM repository in the LargeArithmeticBench package in its version -nice.11 at this time of writing.
To perform the tuning, doIt:

LargeArithmeticBench new autotuneAcceleratedArithmeticThresholds.

With this strategy, the tuning is relatively fast, about 30 seconds on my MBP. But we are not that patient nowadays, so I reported the progress in the Transcript - which is a very basic interface, but works.
The autotuned thresholds now are these ones on 64 bits Spur, but they somtimes vary a bit from 1 run to another:

best Mul22=448
best Mul33=1536
best S1/2=320
best S1/4=640
best D21=448

I also refactored benchReport to have constant ratios M(n,2n) and D(n,2n), and reduced the number of runs for large bit lengths. Performance is not much changed versus hardcoded thresholds, except for division in range 2kbits to 10kbits.

Beware, the D(10n,n) now tries the very large integers, and the primitive for digitDiv:neg: crashes the un-accelerated image. This does not happen with accelerated arithmetic, because we split in small chunks (if 224 bytes is small). That's a long time since I did not put my fingers on VM code, that's an occasion...

I believe that it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk, and that I marginally improved the performance by rounding up to next multiple of 4 digits, rather than down to with bitClear: 2r11, but I launched so many tests that this would need confirmation. I put those changes in Squeak inbox/Kernel-nice.1235. If someone wants to try, it's better to re-tune the thresholds for each code version.

LargeArithmeticBench new autotuneAcceleratedArithmeticThresholds; benchReport.

Here are the updated curves: