Saturday, September 7, 2019

How to compare exact value of SmallInteger and Float in Smalltalk

Long time ago, mixed arithmetic has been very well thought in Scheme with the notion of exact arithmetic vs inexact arithmetic (the resulting value of an operation might be an approximation of the mathematically exact value).

The R5RS specification require that compare operations be transitive in §6.2.5 Numerical operations. Doing so preserve important mathematical properties, like equivalence relation or total ordering. These properties are more than welcome when building higher level abstractions.

No matter that the inexact operand value may already carry it's own error bounds: as an agnostic library, we have no license to waste ULPs, that would only worsen the task of application writers.

A consequence of transitivity requirement is that mixed arithmetic comparisons between integer and float shall better answer the result of comparing the exact value of operands, rather than convert the int to float, then compare the potentially inexact (rounded) values. I spent enough time to convince Smalltalk community that we should adopt such superior model too. Typically, early versions of Squeak did not fulfill the transitivity expectations (but neither did most Lisp implementations, nor any Smalltalk dialect I knew):

    | int1 int2 float |
    int1 := 16r100000000000000 "1 bitShift: 56".
    int2 := 16r100000000000001 "(1 bitShift: 56) + 1".

    float := 16r100000000000000.0 "1.0 timesTwoPower: 56".
    int1 = float & (float = int2) ==> (int1 = int2).

Nowadays, a majority of Smalltalk dialects will do the right thing, including the Squeak brand, the Cuis/Pharo derivatives, GNU Smalltalk and Dolphin Smalltalk. Visualworks does not abide, but I published Image side patches for 32bits version. I did not look at VisualAge for ages, the temporary licenses from Instantiations are not liberal enough to play with. Gemstone is deadly missing on my check list and has liberal enough licenses, but I have my own limitations and need to delegate... I'll take case of Smalltalk/X below.

Mathematics are one thing, but how do we technically achieve that feature at a reasonable price? Particularly in the statistically most common case of mixed SmallInteger op: Float.

With 32bits Smalltalk, and IEEE 754 double precision float providing a 53 bits precision significand, there is no problem, every conversion SmallInteger Float will be exact, and we can thus use naive conversion followed by float comparison. This is mostly done inside the VM in primitives or JIT.

At image side in Squeak, for LargeInteger and Fraction, the strategy is a bit more elaborated, because conversions can cost a lot! The idea is to test if the operand can be converted exactly (see isAnExactFloat in Squeak family), and if so, perform the conversion to Float, and compare floats (that's the fastest path). Else, exact arithmetic shall be used to perform the comparison (unless the Float is infinite or nan) and thus the Float is converted to it's exact value (a.k.a. asTrueFraction in Squeak or asExactFraction in gst).

With 64bits OpenSmalltalk VM which provides 61 bits signed SmallInteger, we now have to take care to not exceed the float precision. We had to rewrite the naive primitives and forbid an inexact conversion int64 double like what is done at image side for LargeInteger. A naive and simpler version of isAnExactFloat check has first been programmed, to eventually make the primitive fail: we just shift right by Float precision and check case of excessive high bits, rather than check the bit span highBit - lowBit + 1. False negative do not matter, they can be handled at image side where it is simpler to write more elaborated code. But those precautions were forgotten in the JIT, that's Issue 417.

In fact, all this started after observing that Smalltalk/X had the very same problem: the primitives were doing naive inexact conversions like in 64bits Visualworks instead of failing. But in Smalltalk/X, these are not really primitives, but rather inlined C-code that gets compiled dynamically and can be changed easily (see for example https://swing.fit.cvut.cz/projects/stx-jv/browser/stx.libbasic/Float.st#L1500). I thus started to think how I would resolve it in standard C so as to preserve speed. This had to be simple enough to be a viable evening hack (a few lines without too many compiler specific builtins).

So let's backtrack: when we convert 16r100000000000001 it might be converted upper or lower, and will loose last 4 bits (in OpenSmalltalk that's at most SmallInteger maxVal highBit - Float precision = 8). This specific integer loose only last bit and is transformed to 16r100000000000000.0 but is such tiny inexactness really going to be a problem if we compare to say -1000.0, 2.5, 1.0e100 or Float infinity? No, of course, the last few rounded bits won't change the comparison result... Unless the Float operand is sufficiently near the SmallInteger. How near?

Let's inquire this. In Squeak with simpler notations:
  • a is the SmallInteger
  • f is the Float whose exact value is b (either an Integer or a Fraction).
    f asTrueFraction = b.
    b asFloat = f.
    b asFloat = b.

Above propositions are obvious, since in Smalltalk:

    aFloat isAnExactFloat.

This was not true of those C using x87 extended 80bits extended precision float registers, but it is in C99 with appropriate flags, and easy when using SSE2 64bits registers.

And since they share same value:

    aFloat asTrueFraction isAnExactFloat.

More interestingly, the conversion asFloat has this property of monotonicity by virtue of IEEE754 standard (because we round the exact to nearest inexact value):
    a < b ==> (a asFloat <= b asFloat).

Even if different, a and b might be converted to same floating point value, the order can be blurred, but it can never be reversed! We can deduce from this property that:

    a asFloat < b asFloat ==> (a < b).

Similarly:

    a asFloat > b asFloat ==> (a > b).

The only case when we are in doubt is:

   a asFloat = b asFloat.

From which one cannot deduce anything about ordering of a and b...

We got the bases of a new algorithm: naively convert the SmallInteger a asFloat, then compare to the Float f value: if both float values are equal, there is an ambiguity to be resolved, else we perform comparison in floating point and we're done, even if conversion asFloat were inexact, we know that it was innocuous (unordered case of f = Float nan, as well as case of f = Float infinity included).

If there is ambiguity, we have to fallback to asTrueFraction conversion... But we have additional properties:

    a asFloat asTrueFraction isInteger.

That's a property of floating point representation, for any SmallInteger a, asFloat is either exact and thus preserve integerness, or will remove some precision (zero the last bits). In no way can we obtain a fractionPart, because that would mean adding more precision.
Only LargeInteger will loose this property when starting to overflow to Float infinity.

So:

    a asFloat = f ==> f asTrueFraction isInteger.

Great! We thus do not need to deal with Fraction in case of ambiguity:

    f asTrueFraction isInteger ==> (f asTrueFraction = f asInteger).

With this knowledge, we can sketch the whole algorithm in Smalltalk:

    SmallInteger>>compareOpFloat: aFloat
        ^(self asFloat = aFloat)
            ifTrue: [self compareOp: aFloat asInteger]
            ifFalse: [self asFloat compareOp: aFloat]

So exactness costs only 1 additional comparison versus inexactness, and 1 additional conversion in case of unlucky ambiguity.
Note that aFloat asInteger could degenerate to a LargeInteger in Smalltalk due to upper rounding... For example, in 64 bits Squeak Images:

    SmallInteger maxVal asFloat asInteger isLarge.

Similarly in C, for int64 a, there would be some edge case of signed integer overflow when performing (int64) f , and such situation would deserve a uint64 handling in order to avoid undefined behavior... Not less complex in assembler for the JIT.
I may have a solution for it, but YAGNI, our OpenSmalltalk SmallInteger are only on 61 bits, so we can write it naivly in primitives and avoid the image side case of LargeInteger! Rather than testing inexactness and failing in edge case, the primitive can always answer the result of exact comparison at a very low price, both in term of speed and simplicity. There's no reason to not do it.

Or there might be some good reasons... After all, I hate the fact that Visualworks has hardcoded naive inexact conversion in primitive!
If we want to change the behavior, there is no other solution than checking the preconditions in slower Smalltalk code (either via double dispatching or type testing) BEFORE calling fast primitives, which is ruining the benchmarks of number crunching applications... It's far more efficient to call the fast primitive first, and handle failure cases in fallback code.
Even if I'm convinced that exact comparison is a better thing than inexact, it's stil questionable to hardwire it in the VM. Juan Vuletich, the author of Cuis gave a sounding echo to the little voice inside me telling to not cut too much the world of possibilities...

I have thus prototyped a very simple behavior with a single boolean VM parameter: either the primitives and jitter do perform mixed arithmetic, and if so they do it exactly for comparisons, or they don't perform any mixed arithmetic at all and always fallback to the image code, whatever the operation (arithmetic or comparison). This way, the unmixed arithmetic Float op: Float and SmallInteger op: SmallInteger are still very fast.
With the help of Eliot Miranda, the master of OpenSmalltalk VM, I hope that we consolidate these features soon, have our cake and eat it too. That's what every serious Smalltalk implementation should do. If some vendor decide to hardcode comparison of inexact (rounded) values, then please provide a way to desactivate it! If choosing exactness, everyone now knows how to do it without wasting too many CPU cycles nor engineering, all the material is licensed MIT, and unless I re-invented some well known prior work, I hope patent-free, thanks to this publication.

Friday, September 6, 2019

How to JIT Float comparison on X64 in opensmalltalk-vm

I wanted to share what I recently learned while hacking OpenSmalltalk VM which hold code generated from the Smalltalk source code repository VMMaker.

Comparing Floats is tricky in general, and particularly tricky in IA32/X64 assembler.

Cog, the machine code generator (JIT) of OpenSmalltalk VM is using Intel SSE2 instruction UCOMISD for comparing double precision float registers in Xmm0 and Xmm1.

IEEE 754 standard mandates that NaN compare false to everything (except for NOT EQUAL which shall be always true), and we want to abide to this standard.

The problem is the specific set of status flags chosen by Intel to report the result of comparison, and particularly the case of comparison with NaN (which is unordered): there are 3 flags set
  • Zero Flag (ZF)
  • Parity Flag (PF)
  • Carry Flag (CF)
The comparison of the two register is like a subtraction operand2 - operand1, and thus we very classically get the ZF set when equal, or the CF set when operand2 < operand1. What is very special is that the 3 flags are set to 1 in case of Unordered (when at least one operand is NaN).

The possible test-and-branch operations after the comparison operation are
  • JAE (jump above or equal), equivalent to JNC (jump if CF=0).
    In Cog, we use that as JumpFPGreaterOrEqual RTL instruction
  • JA (jump above), equivalent to (jump if both CF=0 and ZF=0)
    In Cog, this is used as JumpFPGreater
  • JBE (jump below or equal), equivalent to (jump if either CF=1 or ZF=1)
    In Cog, this is used as JumpFPLessOrEqual
  • JB (jump below), equivalent to JC (jump if CF=1)
    In Cog, this is used as JumpFPLess
  • JE (jump equal), equivalent to JZ (jump if ZF=1)
    In Cog, used as JumpFPEqual
  • JNE (jump not equal), equivalent to JNZ (jump if ZF=0)
    In Cog, used as JumpFPNotEqual
With this arrangement, we see that first 2 operations will work as expected with nan operands: >=  and > will always return false (because CF=1).
But the next 4 operations <= < = and ~= won't behave correctly with nan operands!

It is indeed necessary to first test for the parity flag and use a JP (linked to Cog RTL JumpFPUnordered) appropriately. What we should do is
  • Let genJumpFPLessOrEqual: jump if both PF=0 and (either CF=1 or ZF=1)
  • Let genJumpFPLess: jump if both PF=0 and CF=1
  • Let genJumpFPEqual: jump if both PF=0 and ZF=1
  • Let genJumpFPNotEqual: jump if either PF=1 or ZF=0
Cog is doing the appropriate thing for the last two operations by using an additional JumpFPUnordered, but fails to do so for genJumpFPLessOrEqual: and genJumpFPLess:. That's a problem, we cannot use them blindly, or they won't produce the expected output for NaN operands!

Instead, as a workaround, it's the client code that carefully, subtlely and cleverly avoid to use JumpFPLessOrEqual and JumpFPLess, and instead transform the operations:
  • (a < b) <==> (b > a)
  • (a <= b) <==> (b >= a)
All the code for generating float comparison primitives thus has a mysterious invert: keyword and parameter. Mysterious, because we don't see the point of complexifying client code when we have all the necessary RTL opcodes for < and <= which would enable a more straight forward implementation.

Though this is indeed clever, because instead of two test-and-branch (JP and JB/JBE) we only need one JA/JAE.
But clever hacks are also error-prone: especially when having to change 6 or 12 implementations of comparison in a row: at code reading/writing time, uniformity and simplicity pay more than cleverness. A mechanical change is a different thing than a change where one must carefully review the transformation for 1 out of 3. My first two attempts succeeded, but the third not unexpectedly failed see commit comment VMMaker.oscog-nice.2546.

When I decided to do the most simple thing that could possibly work and use mechanical uniformity VMMaker.oscog-nice.2549, I broke the SUnit tests for NaN comparisons, so had to undo my wild refactoring in VMMaker.oscog-nice.2554!

I could have read huge Intel reference manuals (the right thing), but found the hint on stackoverflow sooooo much rapidly!
See intel-x86-64-assembly-compare-signed-double-precision-floats
SO save your day (my nights in this case)!

We shall correct genJumpFPLess:/genJumpFPLessOrEqual: ASAP in CogIA32Compiler and CogX64Compiler, but I preferred to write this post first, it's technical enough and deserve a longer explanation.

I know, this Smalltalk blog is degenerating, I was already talking of C or C++ from time to time and now assembler, pouah? Personally, I'm not completely reluctant to it, I wrote my first disassembler/assembler in basic for a 6502 on an ATARI 800 in the early 80s, then I rewrote the assembler in assembler to gain speed. The fun stopped when the low quality tape refused to restore my code,  redoing such low level stuff would not give me much greater fun anyway so I decided to move on and thougt that I would never return...

But don't forget that our assembler is written in Smalltalk, and that's fabulous!
Extending the RTL opcodes with a ClzRR (count leading zeros in dst/src registers) is easy, it takes roughly, 1 day to understand the code base, and 1 day to write and experiment it - see VMMaker.oscog-nice.2542 - maybe the longer part is to learn the conventions used in those assembler manuals.