Smallissimo: How to JIT Float comparison on X64 in opensmalltalk-vm

I wanted to share what I recently learned while hacking OpenSmalltalk VM which hold code generated from the Smalltalk source code repository VMMaker.

Comparing Floats is tricky in general, and particularly tricky in IA32/X64 assembler.

Cog, the machine code generator (JIT) of OpenSmalltalk VM is using Intel SSE2 instruction UCOMISD for comparing double precision float registers in Xmm0 and Xmm1.

IEEE 754 standard mandates that NaN compare false to everything (except for NOT EQUAL which shall be always true), and we want to abide to this standard.

The problem is the specific set of status flags chosen by Intel to report the result of comparison, and particularly the case of comparison with NaN (which is unordered): there are 3 flags set

Zero Flag (ZF)
Parity Flag (PF)
Carry Flag (CF)

The comparison of the two register is like a subtraction operand2 - operand1, and thus we very classically get the ZF set when equal, or the CF set when operand2 < operand1. What is very special is that the 3 flags are set to 1 in case of Unordered (when at least one operand is NaN).

The possible test-and-branch operations after the comparison operation are

JAE (jump above or equal), equivalent to JNC (jump if CF=0).
In Cog, we use that as JumpFPGreaterOrEqual RTL instruction
JA (jump above), equivalent to (jump if both CF=0 and ZF=0)
In Cog, this is used as JumpFPGreater
JBE (jump below or equal), equivalent to (jump if either CF=1 or ZF=1)
In Cog, this is used as JumpFPLessOrEqual
JB (jump below), equivalent to JC (jump if CF=1)
In Cog, this is used as JumpFPLess
JE (jump equal), equivalent to JZ (jump if ZF=1)
In Cog, used as JumpFPEqual
JNE (jump not equal), equivalent to JNZ (jump if ZF=0)
In Cog, used as JumpFPNotEqual

With this arrangement, we see that first 2 operations will work as expected with nan operands: >= and > will always return false (because CF=1).
But the next 4 operations <= < = and ~= won't behave correctly with nan operands!

It is indeed necessary to first test for the parity flag and use a JP (linked to Cog RTL JumpFPUnordered) appropriately. What we should do is

Let genJumpFPLessOrEqual: jump if both PF=0 and (either CF=1 or ZF=1)
Let genJumpFPLess: jump if both PF=0 and CF=1
Let genJumpFPEqual: jump if both PF=0 and ZF=1
Let genJumpFPNotEqual: jump if either PF=1 or ZF=0

Cog is doing the appropriate thing for the last two operations by using an additional JumpFPUnordered, but fails to do so for genJumpFPLessOrEqual: and genJumpFPLess:. That's a problem, we cannot use them blindly, or they won't produce the expected output for NaN operands!

Instead, as a workaround, it's the client code that carefully, subtlely and cleverly avoid to use JumpFPLessOrEqual and JumpFPLess, and instead transform the operations:

(a < b) <==> (b > a)
(a <= b) <==> (b >= a)

All the code for generating float comparison primitives thus has a mysterious invert: keyword and parameter. Mysterious, because we don't see the point of complexifying client code when we have all the necessary RTL opcodes for < and <= which would enable a more straight forward implementation.

Though this is indeed clever, because instead of two test-and-branch (JP and JB/JBE) we only need one JA/JAE.
But clever hacks are also error-prone: especially when having to change 6 or 12 implementations of comparison in a row: at code reading/writing time, uniformity and simplicity pay more than cleverness. A mechanical change is a different thing than a change where one must carefully review the transformation for 1 out of 3. My first two attempts succeeded, but the third not unexpectedly failed see commit comment VMMaker.oscog-nice.2546.

When I decided to do the most simple thing that could possibly work and use mechanical uniformity VMMaker.oscog-nice.2549, I broke the SUnit tests for NaN comparisons, so had to undo my wild refactoring in VMMaker.oscog-nice.2554!

I could have read huge Intel reference manuals (the right thing), but found the hint on stackoverflow sooooo much rapidly!
See intel-x86-64-assembly-compare-signed-double-precision-floats
SO save your day (my nights in this case)!

We shall correct genJumpFPLess:/genJumpFPLessOrEqual: ASAP in CogIA32Compiler and CogX64Compiler, but I preferred to write this post first, it's technical enough and deserve a longer explanation.

I know, this Smalltalk blog is degenerating, I was already talking of C or C++ from time to time and now assembler, pouah? Personally, I'm not completely reluctant to it, I wrote my first disassembler/assembler in basic for a 6502 on an ATARI 800 in the early 80s, then I rewrote the assembler in assembler to gain speed. The fun stopped when the low quality tape refused to restore my code, redoing such low level stuff would not give me much greater fun anyway so I decided to move on and thougt that I would never return...

But don't forget that our assembler is written in Smalltalk, and that's fabulous!
Extending the RTL opcodes with a ClzRR (count leading zeros in dst/src registers) is easy, it takes roughly, 1 day to understand the code base, and 1 day to write and experiment it - see VMMaker.oscog-nice.2542 - maybe the longer part is to learn the conventions used in those assembler manuals.

Smallissimo

Friday, September 6, 2019

How to JIT Float comparison on X64 in opensmalltalk-vm

No comments:

Post a Comment