Sunday, September 25, 2011

Reviewing fraction asFloat

After changing Integer>>asFloat, let's inspect Fraction>>asFloat. The spirit is quite simple: perform an Euclidean division after shifting the numerator or the denominator enough to obtain a certain precision in the quotient. The quotient will be the mantissa of the Float and the remainder are the truncated bits. Then come a dissertation on the conditions when to round upper or not. Let's see the code:

    "Answer a Float that closely approximates the value of the receiver.
    This implementation will answer the closest floating point number to
    the receiver.
    It uses the IEEE 754 round to nearest even mode
    (can happen in case denominator is a power of two)"
    | a b q r exponent floatExponent n ha hb hq q1 |
    a := numerator abs.
    b := denominator abs.
    ha := a highBit.
    hb := b highBit.
    "If both numerator and denominator are represented exactly in floating point number,
    then fastest thing to do is to use hardwired float division"
    (ha < 54 and: [hb < 54]) ifTrue: [^numerator asFloat / denominator asFloat].
    "Try and obtain a mantissa with 54 bits.
    First guess is rough, we might get one more bit or one less"
    exponent := ha - hb - 54.
    exponent > 0
        ifTrue: [b := b bitShift: exponent]
        ifFalse: [a := a bitShift: exponent negated].
    q := a quo: b.
    r := a - (q * b).
    hq := q highBit.
    "check for gradual underflow, in which case we should use less bits"
    floatExponent := exponent + hq - 1.
    n := floatExponent > -1023
        ifTrue: [54]
        ifFalse: [54 + floatExponent + 1022].
    hq > n
        ifTrue: [exponent := exponent + hq - n.
            r := (q bitAnd: (1 bitShift: hq - n) - 1) * b + r.
            q := q bitShift: n - hq].
    hq < n
        ifTrue: [exponent := exponent + hq - n.
            q1 := (r bitShift: n - hq) quo: b.
            q := (q bitShift: n - hq) bitAnd: q1.
            r := (r bitShift: n - hq) - (q1 * b)].
    "check if we should round upward.
    The case of exact half (q bitAnd: 1) isZero not & (r isZero)
    will be handled by Integer>>asFloat"
    ((q bitAnd: 1) isZero or: [r isZero])
        ifFalse: [q := q + 1].
    ^ (self positive
        ifTrue: [q asFloat]
        ifFalse: [q = 0
            ifTrue: [Float negativeZero]
            ifFalse: [q asFloat negated]])
        timesTwoPower: exponent

Let's review the three parts in yellow : first the comment (can happen in case denominator is a power of two) is obscure. It is supposed to explain that in case of tie, the algorithm will round to nearest even. The case of tie can only happen if the denominator is a power of two indeed, any other factor in the denominator will lead to an infinite series of digits and thus cannot produce a tie. But this information is not essential enough to be placed in the header. After all it is not essential at all.

The second part in yellow tells that the quotient could be one bit less. Let's see how false it is.
Let's write that b have hb bits, a ha bits and q hq bits, and the conditions of Euclidean division:
  • (1<< (ha-1)) <= a, a < (1 << ha)
  • b < (1 << hb)
  • q < (1 << hq)
  • a=(b*q+r)
  • r < b
From the last two relations we have a < (b*q+b), that is a < (b * (q+1)).
We still have, q + 1 <= (1 << hq), thus b * (q+1) < ((1 < hb) * (1 << hq)), that is a < (1 < (hb+hq)).
But we also have (1 << (ha - 1)) <= a, thus ha - 1 < (hb + hq).
This can be written hq > (ha - hb - 1), or hq >= (ha - hb). So the quotient will have at least ha - hb bits, maybe more, but never less.

The third part in yellow is trying to perform an addition with bitAnd: - ouch ! It could eventually be bitOr:, but why not simply writing q := (q bitShift: n - q) + q1...
Fortunately, this is the case of one bit less which can never happen.

There are other things to be enhanced. The numbers 54 and 1022 are kind of magical and use implicit knowledge that 54 = (Float precision + 1) and -1022 = Float emin, the minimum possible exponent of a Float before gradual underflow occurs.

So we should better re-write this code and:
  • change the heading comment;
  • remove the unreachable branch;
  • use the new floating point inquiry messages precision and emin;
  • name the mantissa mantissa rather than q;
  • avoid to compute the remainder r, but just check if it is zero or hasTruncatedBits;
  • let the floating point hardware handle the case of negative zero by itself
    (we just have to keep at least one bit in the mantissa).
    "Answer a Float that closely approximates the value of the receiver.
    This implementation will answer the closest floating point number to the receiver.
    In case of a tie, it will use the IEEE 754 round to nearest even mode.
    In case of overflow, it will answer +/- Float infinity."

    | a b mantissa exponent hasTruncatedBits lostBit n ha hb hm |
    a := numerator abs.
    b := denominator.    "denominator is always positive"
    ha := a highBit.
    hb := b highBit.
    "Number of bits to keep in mantissa plus one to handle rounding."
    n := 1 + Float precision.

    "If both numerator and denominator are represented exactly in floating point number,
    then fastest thing to do is to use hardwired float division."
    (ha < n and: [hb < n]) ifTrue: [^numerator asFloat / denominator asFloat].

    "Shift the fraction by a power of two exponent so as to obtain a mantissa with n bits.
    First guess is rough, the mantissa might have n+1 bits."
    exponent := ha - hb - n.
    exponent >= 0
        ifTrue: [b := b bitShift: exponent]
        ifFalse: [a := a bitShift: exponent negated].
    mantissa := a quo: b.
    hasTruncatedBits := a > (mantissa * b).
    hm := mantissa highBit.
    "Check for gradual underflow, in which case the mantissa will loose bits.
    Keep at least 1 bit to let the underflow preserve the sign of zero."
    lostBit := Float emin - (exponent + hm - 1).
    lostBit > 0 ifTrue: [n := n - lostBit max: 1].

    "Remove excess bits in the mantissa."
    hm > n
            [exponent := exponent + hm - n.
            hasTruncatedBits := hasTruncatedBits or: [mantissa anyBitOfMagnitudeFrom: 1 to: hm - n].
            mantissa := mantissa bitShift: n - hm].

    "Check if mantissa must be rounded upward.
    The case of tie (mantissa odd & hasTruncatedBits not)
    will be handled by Integer>>asFloat."
    (hasTruncatedBits and: [mantissa odd])
        ifTrue: [mantissa := mantissa + 1].

    ^ (self positive
            ifTrue: [mantissa asFloat]
            ifFalse: [mantissa asFloat negated])
        timesTwoPower: exponent

That's it. No revolution, but I hope it's a bit clearer, and for the same price, a tiny bit more efficient.
I could also explain in this blog why it is better to use floating point division if both numerator and denominator can be converted asFloat exactly... Or why the mantissa loose bits in case of underflow... But, oh, I'm so lazy, ain't that boring enough?


  1. Funny, I'm working on the same stuff right now... I ended up with a similar approach, but it's slightly different because it lets the FPU do some of the rounding. Also, when I need the numerator to be smaller, I just make the denominator bigger. I suspect that throwing numerator bits away will lead to some edge cases in which the floating point answer is not the closest to the fraction.

    For integer conversions to single precision floats, I did some tricks to avoid having to examine less significant bits by looking at the top small integer (the extra bits in a float will go away with FPU rounding)... basically you convert the integer plus e.g.: 4 bits as a fraction, if you need to round to even by going up you simply add 1 like you do, and then you let the FPU throw away what doesn't fit the mantissa. In most cases you don't have to look at more than a small integer, which makes the process faster.

    For conversions to double precision floats, you can play similar tricks. Also, you can arrange to do the conversion in small integer chunks and provided you have SmallInteger>>asFloat (or in VW, asDouble) implemented as a primitive, then it's much faster.

    You can also do the trick of keeping a few more bits than fit in the mantissa for fractions... it just requires a bit more care because the division may give you more bits than you asked for, so unlike with integers you cannot inline the masking constants.

    Also, I'm not done yet, so I can't quite share code right now.

  2. Sure, keeping several excess bits in the mantissa instead of one should fast things up by reducing probability to have to look down further truncated bits. But, at the end, you still have to test and handle the case of an even mantissa with tie when the excess+1 trailing bits are 2r0100...000. If you come up with a faster test, that will be interesting.
    The naive digit-loop splitting strategy I proposed for re-composing the Float was in the original code. It will always work with 1 excess bit, but must not be used blindly in case of more excess bits. For example, two excess bits in case of a precision of 8*n+7 bits, would lead to a first round off error at byte n+1 and a second at byte n+2, and we can afford only one. Splitting the mantissa bits into an upper part and a lower part which neither have more bits than target float precision should do the trick. Anyway, I'm convinced that bytewise handling of LargeInteger is not the fastest thing to do on modern CPU, but that's what I get handy and portable in Squeak VM...

  3. Andrés, I've posted two new versions for LargePositiveInteger>>asFloat in Squeak inbox. By using 7 excess bits, the test of even mantissa with tie is fast because it falls on a byte boundary. And the total bit count is 53+7=60=2*30, two positive SmallInteger... But COG JIT VM happens to be faster with naive digit loops than with bitShift: required to split the mantissa into two SmallIntegers... It may differ in VW.