Tuesday, October 1, 2019

Simple piSin piCos mathematical function in Squeak

Working with radians is great, because we have the derivatives of sine function being the cosine function, and derivative of cosine being the sine negated. I always ask engineers to work with radians, which avoids many mistakes!

But when it comes to handling the periodicity modulo 2π electronically, radians are not so great because π is irrational and hardly representable as a Float value!

With Attitude and Heading Reference System delivering vehicle heading in degrees or other fractions of π, conversion to radians somehow introduce unecessary numerical noise.

To work around this, many libm provide a sinpi(x) which computes sin(π x), and there is a similar sin_pi in boost library.

I was seeking for simplest implementation that could possibly work, so I decided to give it a try in Squeak Smalltalk: these mathematical functions are called like this:

     1 piCos = -1.
     (1/2) piSin = 1.

See http://www.squeaksource.com/NumberExtensions/Math-Number-Extensions-nice.6.diff.
And the tests http://www.squeaksource.com/NumberExtensions/Math-Number-ExtensionsTests-nice.1.diff.

It's a very simple implementation: reduce receiver in the interval -1/4,1/4 that is reduce the angle argument in the interval -π/4,π/4, then fallback to Float approximation of sin(π x) or cos(π x).

The nice thing is that we can afford to answer an exact value for every case when the sine or cosine value takes a rational value. This could only happen for infinitely many irrational receivers that we cannot represent (so we don't care), and a handful of rational fraction of π once we performed argument (receiver in our case) reduction. That's the Nivens theorem - See http://mathworld.wolfram.com/NivensTheorem.html and I would not have imagined that it could be so cheap before implementing it!

The only thing that bothers me, is that this exactness cost me the weak monotonicity property locally. For example, we may expect that:

    (1/6 + (1<<100) reciprocal) piSin >= (1/6) piSin.

But it's not the case... It happens that:

    (1/6 + (1<<100) reciprocal) asFloat < (1/6) and:
        [ (Float pi) < (1 asArbitraryPrecisionFloatNumBits: 100) pi ].

With such cumulated rounding by default, the floating point approximation falls short:

    ((1/6+(1<<100) reciprocal) * Float pi) sin = 0.5 predecessor.

This is where a bit more precision before the float conversion would help:

    ((1 asArbitraryPrecisionFloatNumBits: 100) pi / 6) asFloat sin = 0.5.

Weak monotonicity is preserved for Float receiver, which is already something, but I see no easy solution for the case of exact arithmetic. If I try with a longer approximation of π, then the weak monotonicity may break for mixed arithmetic, I'll have to come back to it... Fortunately, π/6 is the only case to handle (all reductions should be exact).

In Squeak there is a degreeCos and degreeSin message already that I tweaked some years ago, but those could further benefit from this new piSin and piCos functions (err, messages!). This little experiment might end up in Squeak trunk if there is some interest.

Saturday, September 7, 2019

How to compare exact value of SmallInteger and Float in Smalltalk

Long time ago, mixed arithmetic has been very well thought in Scheme with the notion of exact arithmetic vs inexact arithmetic (the resulting value of an operation might be an approximation of the mathematically exact value).

The R5RS specification require that compare operations be transitive in §6.2.5 Numerical operations. Doing so preserve important mathematical properties, like equivalence relation or total ordering. These properties are more than welcome when building higher level abstractions.

No matter that the inexact operand value may already carry it's own error bounds: as an agnostic library, we have no license to waste ULPs, that would only worsen the task of application writers.

A consequence of transitivity requirement is that mixed arithmetic comparisons between integer and float shall better answer the result of comparing the exact value of operands, rather than convert the int to float, then compare the potentially inexact (rounded) values. I spent enough time to convince Smalltalk community that we should adopt such superior model too. Typically, early versions of Squeak did not fulfill the transitivity expectations (but neither did most Lisp implementations, nor any Smalltalk dialect I knew):

    | int1 int2 float |
    int1 := 16r100000000000000 "1 bitShift: 56".
    int2 := 16r100000000000001 "(1 bitShift: 56) + 1".

    float := 16r100000000000000.0 "1.0 timesTwoPower: 56".
    int1 = float & (float = int2) ==> (int1 = int2).

Nowadays, a majority of Smalltalk dialects will do the right thing, including the Squeak brand, the Cuis/Pharo derivatives, GNU Smalltalk and Dolphin Smalltalk. Visualworks does not abide, but I published Image side patches for 32bits version. I did not look at VisualAge for ages, the temporary licenses from Instantiations are not liberal enough to play with. Gemstone is deadly missing on my check list and has liberal enough licenses, but I have my own limitations and need to delegate... I'll take case of Smalltalk/X below.

Mathematics are one thing, but how do we technically achieve that feature at a reasonable price? Particularly in the statistically most common case of mixed SmallInteger op: Float.

With 32bits Smalltalk, and IEEE 754 double precision float providing a 53 bits precision significand, there is no problem, every conversion SmallInteger Float will be exact, and we can thus use naive conversion followed by float comparison. This is mostly done inside the VM in primitives or JIT.

At image side in Squeak, for LargeInteger and Fraction, the strategy is a bit more elaborated, because conversions can cost a lot! The idea is to test if the operand can be converted exactly (see isAnExactFloat in Squeak family), and if so, perform the conversion to Float, and compare floats (that's the fastest path). Else, exact arithmetic shall be used to perform the comparison (unless the Float is infinite or nan) and thus the Float is converted to it's exact value (a.k.a. asTrueFraction in Squeak or asExactFraction in gst).

With 64bits OpenSmalltalk VM which provides 61 bits signed SmallInteger, we now have to take care to not exceed the float precision. We had to rewrite the naive primitives and forbid an inexact conversion int64 double like what is done at image side for LargeInteger. A naive and simpler version of isAnExactFloat check has first been programmed, to eventually make the primitive fail: we just shift right by Float precision and check case of excessive high bits, rather than check the bit span highBit - lowBit + 1. False negative do not matter, they can be handled at image side where it is simpler to write more elaborated code. But those precautions were forgotten in the JIT, that's Issue 417.

In fact, all this started after observing that Smalltalk/X had the very same problem: the primitives were doing naive inexact conversions like in 64bits Visualworks instead of failing. But in Smalltalk/X, these are not really primitives, but rather inlined C-code that gets compiled dynamically and can be changed easily (see for example https://swing.fit.cvut.cz/projects/stx-jv/browser/stx.libbasic/Float.st#L1500). I thus started to think how I would resolve it in standard C so as to preserve speed. This had to be simple enough to be a viable evening hack (a few lines without too many compiler specific builtins).

So let's backtrack: when we convert 16r100000000000001 it might be converted upper or lower, and will loose last 4 bits (in OpenSmalltalk that's at most SmallInteger maxVal highBit - Float precision = 8). This specific integer loose only last bit and is transformed to 16r100000000000000.0 but is such tiny inexactness really going to be a problem if we compare to say -1000.0, 2.5, 1.0e100 or Float infinity? No, of course, the last few rounded bits won't change the comparison result... Unless the Float operand is sufficiently near the SmallInteger. How near?

Let's inquire this. In Squeak with simpler notations:
  • a is the SmallInteger
  • f is the Float whose exact value is b (either an Integer or a Fraction).
    f asTrueFraction = b.
    b asFloat = f.
    b asFloat = b.

Above propositions are obvious, since in Smalltalk:

    aFloat isAnExactFloat.

This was not true of those C using x87 extended 80bits extended precision float registers, but it is in C99 with appropriate flags, and easy when using SSE2 64bits registers.

And since they share same value:

    aFloat asTrueFraction isAnExactFloat.

More interestingly, the conversion asFloat has this property of monotonicity by virtue of IEEE754 standard (because we round the exact to nearest inexact value):
    a < b ==> (a asFloat <= b asFloat).

Even if different, a and b might be converted to same floating point value, the order can be blurred, but it can never be reversed! We can deduce from this property that:

    a asFloat < b asFloat ==> (a < b).

Similarly:

    a asFloat > b asFloat ==> (a > b).

The only case when we are in doubt is:

   a asFloat = b asFloat.

From which one cannot deduce anything about ordering of a and b...

We got the bases of a new algorithm: naively convert the SmallInteger a asFloat, then compare to the Float f value: if both float values are equal, there is an ambiguity to be resolved, else we perform comparison in floating point and we're done, even if conversion asFloat were inexact, we know that it was innocuous (unordered case of f = Float nan, as well as case of f = Float infinity included).

If there is ambiguity, we have to fallback to asTrueFraction conversion... But we have additional properties:

    a asFloat asTrueFraction isInteger.

That's a property of floating point representation, for any SmallInteger a, asFloat is either exact and thus preserve integerness, or will remove some precision (zero the last bits). In no way can we obtain a fractionPart, because that would mean adding more precision.
Only LargeInteger will loose this property when starting to overflow to Float infinity.

So:

    a asFloat = f ==> f asTrueFraction isInteger.

Great! We thus do not need to deal with Fraction in case of ambiguity:

    f asTrueFraction isInteger ==> (f asTrueFraction = f asInteger).

With this knowledge, we can sketch the whole algorithm in Smalltalk:

    SmallInteger>>compareOpFloat: aFloat
        ^(self asFloat = aFloat)
            ifTrue: [self compareOp: aFloat asInteger]
            ifFalse: [self asFloat compareOp: aFloat]

So exactness costs only 1 additional comparison versus inexactness, and 1 additional conversion in case of unlucky ambiguity.
Note that aFloat asInteger could degenerate to a LargeInteger in Smalltalk due to upper rounding... For example, in 64 bits Squeak Images:

    SmallInteger maxVal asFloat asInteger isLarge.

Similarly in C, for int64 a, there would be some edge case of signed integer overflow when performing (int64) f , and such situation would deserve a uint64 handling in order to avoid undefined behavior... Not less complex in assembler for the JIT.
I may have a solution for it, but YAGNI, our OpenSmalltalk SmallInteger are only on 61 bits, so we can write it naivly in primitives and avoid the image side case of LargeInteger! Rather than testing inexactness and failing in edge case, the primitive can always answer the result of exact comparison at a very low price, both in term of speed and simplicity. There's no reason to not do it.

Or there might be some good reasons... After all, I hate the fact that Visualworks has hardcoded naive inexact conversion in primitive!
If we want to change the behavior, there is no other solution than checking the preconditions in slower Smalltalk code (either via double dispatching or type testing) BEFORE calling fast primitives, which is ruining the benchmarks of number crunching applications... It's far more efficient to call the fast primitive first, and handle failure cases in fallback code.
Even if I'm convinced that exact comparison is a better thing than inexact, it's stil questionable to hardwire it in the VM. Juan Vuletich, the author of Cuis gave a sounding echo to the little voice inside me telling to not cut too much the world of possibilities...

I have thus prototyped a very simple behavior with a single boolean VM parameter: either the primitives and jitter do perform mixed arithmetic, and if so they do it exactly for comparisons, or they don't perform any mixed arithmetic at all and always fallback to the image code, whatever the operation (arithmetic or comparison). This way, the unmixed arithmetic Float op: Float and SmallInteger op: SmallInteger are still very fast.
With the help of Eliot Miranda, the master of OpenSmalltalk VM, I hope that we consolidate these features soon, have our cake and eat it too. That's what every serious Smalltalk implementation should do. If some vendor decide to hardcode comparison of inexact (rounded) values, then please provide a way to desactivate it! If choosing exactness, everyone now knows how to do it without wasting too many CPU cycles nor engineering, all the material is licensed MIT, and unless I re-invented some well known prior work, I hope patent-free, thanks to this publication.

Friday, September 6, 2019

How to JIT Float comparison on X64 in opensmalltalk-vm

I wanted to share what I recently learned while hacking OpenSmalltalk VM which hold code generated from the Smalltalk source code repository VMMaker.

Comparing Floats is tricky in general, and particularly tricky in IA32/X64 assembler.

Cog, the machine code generator (JIT) of OpenSmalltalk VM is using Intel SSE2 instruction UCOMISD for comparing double precision float registers in Xmm0 and Xmm1.

IEEE 754 standard mandates that NaN compare false to everything (except for NOT EQUAL which shall be always true), and we want to abide to this standard.

The problem is the specific set of status flags chosen by Intel to report the result of comparison, and particularly the case of comparison with NaN (which is unordered): there are 3 flags set
  • Zero Flag (ZF)
  • Parity Flag (PF)
  • Carry Flag (CF)
The comparison of the two register is like a subtraction operand2 - operand1, and thus we very classically get the ZF set when equal, or the CF set when operand2 < operand1. What is very special is that the 3 flags are set to 1 in case of Unordered (when at least one operand is NaN).

The possible test-and-branch operations after the comparison operation are
  • JAE (jump above or equal), equivalent to JNC (jump if CF=0).
    In Cog, we use that as JumpFPGreaterOrEqual RTL instruction
  • JA (jump above), equivalent to (jump if both CF=0 and ZF=0)
    In Cog, this is used as JumpFPGreater
  • JBE (jump below or equal), equivalent to (jump if either CF=1 or ZF=1)
    In Cog, this is used as JumpFPLessOrEqual
  • JB (jump below), equivalent to JC (jump if CF=1)
    In Cog, this is used as JumpFPLess
  • JE (jump equal), equivalent to JZ (jump if ZF=1)
    In Cog, used as JumpFPEqual
  • JNE (jump not equal), equivalent to JNZ (jump if ZF=0)
    In Cog, used as JumpFPNotEqual
With this arrangement, we see that first 2 operations will work as expected with nan operands: >=  and > will always return false (because CF=1).
But the next 4 operations <= < = and ~= won't behave correctly with nan operands!

It is indeed necessary to first test for the parity flag and use a JP (linked to Cog RTL JumpFPUnordered) appropriately. What we should do is
  • Let genJumpFPLessOrEqual: jump if both PF=0 and (either CF=1 or ZF=1)
  • Let genJumpFPLess: jump if both PF=0 and CF=1
  • Let genJumpFPEqual: jump if both PF=0 and ZF=1
  • Let genJumpFPNotEqual: jump if either PF=1 or ZF=0
Cog is doing the appropriate thing for the last two operations by using an additional JumpFPUnordered, but fails to do so for genJumpFPLessOrEqual: and genJumpFPLess:. That's a problem, we cannot use them blindly, or they won't produce the expected output for NaN operands!

Instead, as a workaround, it's the client code that carefully, subtlely and cleverly avoid to use JumpFPLessOrEqual and JumpFPLess, and instead transform the operations:
  • (a < b) <==> (b > a)
  • (a <= b) <==> (b >= a)
All the code for generating float comparison primitives thus has a mysterious invert: keyword and parameter. Mysterious, because we don't see the point of complexifying client code when we have all the necessary RTL opcodes for < and <= which would enable a more straight forward implementation.

Though this is indeed clever, because instead of two test-and-branch (JP and JB/JBE) we only need one JA/JAE.
But clever hacks are also error-prone: especially when having to change 6 or 12 implementations of comparison in a row: at code reading/writing time, uniformity and simplicity pay more than cleverness. A mechanical change is a different thing than a change where one must carefully review the transformation for 1 out of 3. My first two attempts succeeded, but the third not unexpectedly failed see commit comment VMMaker.oscog-nice.2546.

When I decided to do the most simple thing that could possibly work and use mechanical uniformity VMMaker.oscog-nice.2549, I broke the SUnit tests for NaN comparisons, so had to undo my wild refactoring in VMMaker.oscog-nice.2554!

I could have read huge Intel reference manuals (the right thing), but found the hint on stackoverflow sooooo much rapidly!
See intel-x86-64-assembly-compare-signed-double-precision-floats
SO save your day (my nights in this case)!

We shall correct genJumpFPLess:/genJumpFPLessOrEqual: ASAP in CogIA32Compiler and CogX64Compiler, but I preferred to write this post first, it's technical enough and deserve a longer explanation.

I know, this Smalltalk blog is degenerating, I was already talking of C or C++ from time to time and now assembler, pouah? Personally, I'm not completely reluctant to it, I wrote my first disassembler/assembler in basic for a 6502 on an ATARI 800 in the early 80s, then I rewrote the assembler in assembler to gain speed. The fun stopped when the low quality tape refused to restore my code,  redoing such low level stuff would not give me much greater fun anyway so I decided to move on and thougt that I would never return...

But don't forget that our assembler is written in Smalltalk, and that's fabulous!
Extending the RTL opcodes with a ClzRR (count leading zeros in dst/src registers) is easy, it takes roughly, 1 day to understand the code base, and 1 day to write and experiment it - see VMMaker.oscog-nice.2542 - maybe the longer part is to learn the conventions used in those assembler manuals.

Thursday, May 16, 2019

Tuning the large arithmetic thresholds

After the sqrtRem, let's now tune the other thresholds for multiplication, squaring and division.

The first thing is to have settable threshold parameters, and that's the object of Kernel-nice.1234. The settings can be changed manually through Preferences interface, but then it's tedious to tune by hand.



So the next thing is to have a strategy for auto-tuning. I have chosen this one:
  1. set the threshold for Karatsuba and Toom-3 multiplication Mul22 and Mul33 at very high values so as to have unaccelerated arithmetic
  2. gradually increase the Mul22 threshold starting at 16 bytes, and doubling until it beats reference un-accelerated performance
  3. once found, try smaller threshold by setting the next two bits (if threshold is 512, test for 320, 384 and 448)
  4. repeat the procedure for Mul33, starting at Mul22 threshold
The performance is measured on 2 series of well-balanced integers, of byte length 1xthreshold, 1.125xthreshold. The score is the sum of the 2 median_time_to_run/nbits/log(nbits) for each serie (the lower the score, the better).

I did the same for squaring with S1/2 and S1/4, but multiplication tuning must be performed first, because asymetrical squaring uses that.

For division, it's a bit different, since the recursive division did not procure an immediate gain at 1x threshold, I decided to engage digitDivSplit: at twice the threshold so as to benefit from at least 2 recursions, so I must start testing at twice the threshold and finally used series of byte length 2x, 3x and 4x threshold.

I have published this auto-tuning code in STEM repository in the LargeArithmeticBench package in its version -nice.11 at this time of writing.
To perform the tuning, doIt:

    LargeArithmeticBench new autotuneAcceleratedArithmeticThresholds.

With this strategy, the tuning is relatively fast, about 30 seconds on my MBP. But we are not that patient nowadays, so I reported the progress in the Transcript - which is a very basic interface, but works.
The autotuned thresholds now are these ones on 64 bits Spur, but they somtimes vary a bit from 1 run to another:

best Mul22=448
best Mul33=1536
best S1/2=320
best S1/4=640
best D21=448

I also refactored benchReport to have constant ratios M(n,2n) and D(n,2n), and reduced the number of runs for large bit lengths. Performance is not much changed versus hardcoded thresholds, except for division in range 2kbits to 10kbits.

Beware, the D(10n,n) now tries the very large integers, and the primitive for digitDiv:neg: crashes the un-accelerated image. This does not happen with accelerated arithmetic, because we split in small chunks (if 224 bytes is small). That's a long time since I did not put my fingers on VM code, that's an occasion...

I believe that it's a tiny bit better to recompose a longer least significant chunk with a shorter most significant chunk, and that I marginally improved the performance by rounding up to next multiple of 4 digits, rather than down to with bitClear: 2r11, but I launched so many tests that this would need confirmation. I put those changes in Squeak inbox/Kernel-nice.1235. If someone wants to try, it's better to re-tune the thresholds for each code version.

    LargeArithmeticBench new autotuneAcceleratedArithmeticThresholds; benchReport.

Here are the updated curves:




Sunday, May 12, 2019

Tuning the LargeArithmetic sqrtRem

After further analysis, it turns out that sqrtRem algorithm is always faster than naive Newton-Raphson iterations for LargePositiveInteger in Squeak with 64 bits Opensmalltalk Spur VM. 

So rather than tuning the threshold, I just removed the threshold, which is something we all like: less code to write, less code to test, simple, small and beautiful code 😀.

So here is the updated benchmark:

n (bits)
100
200
500
1 000
2 000
5 000
10 000
20 000
50 000
100 000
200 000
500 000
SR(n)
4 800
9 200
13 490
22 120
49 130
233 170
895 920
3 691 610
25 134 780
107 222 200
456 219 390
3 130 480 810
SRa(n)
1 042
3 644
8 742
11 994
16 288
27 678
67 182
181 219
645 797
1 843 856
5 275 917
19 623 314
gain
78 %
60 %
35 %
46 %
67 %
88 %
93 %
95 %
97 %
98 %
99 %
99 %

That's pretty good, but still has the same pot-bellied look in low bits, probably related to high division costs.

While at it, I totally revamped various implementation of sqrt in Squeak. I don't give further details here, the commit message is already pretty long!