Saturday, April 15, 2017

Astonishing C++ streams

Let's talk about about C++, as it deserves it from time to time.
Especially because I've lost a couple of days on a not-enough-advertised-feature:

    std::ifstream in;
    in.open("somefile",std::ifstream::out);

https://www.wired.com/wp-content/uploads/2012/01/watduck1.jpg
 (https://www.destroyallsoftware.com/talks/wat)

Opening an input file stream for output is not at all convoluted.
It's both a crystal clear code and a brilliant idea! 
Good enough to transcribe it in the standard library:

http://www.cplusplus.com/reference/fstream/ifstream/open/


What difference does it make with fstream? Or should I rather open an ofstream for input? It leaves plenty of room for speculation. if someone knows, he shall write a blog post immediately!
But if the standard offers some weird path, by virtue of Murphy's law, be sure that someone will follow it, whether accidentally or not.

Murphy's law has more to offer: Microsoft implementation changed somewhere between .NET 2003 and .NET 2010, so that opening an input file stream for output on a write-protected file does now fail, what it didn't previously...
Of course, the file has to be read only at deployment site, not in developer's configuration where we have debuggers, otherwise things would be much too trivial.

Recompiling a legacy application while not exerting enough code review is a dangerous thing, so let's not blame C++ for our own mistakes. Except that C++ did not especially help here.

My colleagues said: "tu-mourras-moins-bete" (auf deutsch). Not so sure: I feel like this kind of information is not going to reduce the entropy in my brain.
 
I'm not going to change C++. I can't. But I'm itching to simplify our own Squeak Stream hierarchy with such reduction of entropy in mind. Don't push me!

Friday, April 14, 2017

Alignment strikes back

Previous post was about alignment problems in Smallapack matrix inversion. Let us look at another one that crippled the jpeg plugin for 64 bits windows flavour of opensmalltalk Virtual Machine (issue 119).

The problem originate at win64 requirement for jmp_buf type used in setjmp/longjmp: it must be 16-bytes aligned. I couldn't find a reference to this requirement, but there is a definition in some header that ensure such alignment. Appropriate cygwin grep in /usr/x86_64-w64-mingw32/sys-root/mingw/include will reveal:

setjmp.h:  typedef _JBTYPE jmp_buf[_JBLEN];
setjmp.h:  typedef SETJMP_FLOAT128 _JBTYPE;
setjmp.h:  typedef _CRT_ALIGN(16) struct _SETJMP_FLOAT128 {

With such definitions, the C compiler will manage to properly align the data, so we don't have to worry... Or do we?

We use setjmp/longjmp for the purpose of error handling (and properly quiting the primitive). But we had the brilliant idea to put such jmp_buf member in a structure (see JPEGReadWriter2Plugin/Error.c).

The layout of the structure cannot vary for each instance, so if we want one member to be aligned on 16bytes boundary, the sole solution is to align the whole structure itself to 16bytes boundary, and fill enough gaps between members. Alas, both gcc and clang fail to do so. I don't know if I should report a bug for that.

Imposing that requirement to our own structure, in a portable and future-proof way is less than obvious. So the workaround was rather to use a pointer on a jmp_buf - see pull request #120.

This kind of bug is pretty tricky, but if you have to implement some VM parts in C, what do you expect?

Wednesday, March 8, 2017

Chasing Smallapack inversion BUG

Smallapack is the Smalltalk interface to LAPACK, the famous Linear Algebra Package (see github).

Despite the decent level of stability achieved by Smallapack these last years, interfacing a huge library like this can't be that simple, it's like asking for trouble. The last bug reported was about non reproducibility of matrix inversion. Indeed, it is easily reproducible on Mac OSX with this snippet:

(10 to: 100) detect: [:k |
    a := LapackSGEMatrix rows: k columns: k.
    a fillRandUniform.
    x := ((1 to: 100) collect: [:i | a reciprocal]) asSet asArray.

    x size > 1].


Above code does invariably return 32 on Mac OSX Squeak/Pharo, whatever VM, 32 or 64 bits, Spur or Cog.

On Squeak, a short path for trying above code is to first file this in:

((Installer ss project: 'Smallapack') package: 'ConfigurationOfSmallapack') latest install.!
ConfigurationOfSmallapack loadLatestVersion.!


I've long been clueless about this one, debugging 32x32 matrices is no fun (prepare to scan thousands of digits), especially when the differences between matrices are small, and especially when results differ from one shot to another... Inspection of code and step by step debugging did not reveal anything. Anyway, why does it happen only in single precision and not double? And why not in Linux?

I tried to reproduce above behavior with a C program without success for a while. Until I discovered that the results of C did differ in 32 and 64 bits.  That's been the enlightment: Apple internally use different optimized paths for the Accelerate.framework. Especially when the alignment of data varies, the proof in C below:

//
//  main.c
//  test_sgetri
//
//  Created by Nicolas Cellier on 30/01/2017.
//

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <Accelerate/Accelerate.h>

typedef struct {
    int nr;
    int nc;
    float *data;
    float *alloc;
} sge;

unsigned int align = 0;

sge sgealloc( int nr , int nc )
{
    sge new;
    new.nr = nr;
    new.nc = nc;
    new.alloc = (float *)malloc( sizeof(float) * (nr * nc + align) );
    new.data = new.alloc + align;
    return new;
}

void sgefree( sge const * m )
{
    free(m->alloc);
}

sge sgedup( sge const * m)
{
    sge new = sgealloc( m->nr , m->nc );
    memcpy( new.data , m->data , sizeof(float) * m->nr * m->nc);
    return new;
}

sge sgerand(int nr,int nc)
{
    sge m = sgealloc(nr,nc);
    int isUniform = 1;
    int seed[4] = {1 , 11 , 111 , 1111 };
    int size = nr * nc;
    slarnv_(
            & isUniform,
            seed,
            & size,
            m.data);
    return m;
}

sge sgeinv( sge const * m )
{
    sge a = sgedup( m );
    int *ipiv = malloc( sizeof(long) * m->nr);
    int info;
    sgetrf_(
        & a.nr,
        & a.nc,
        a.data,
        & a.nr,
        ipiv,
        &info
    );
    if( info ) {
        fprintf(stderr,"sgetrf failed info=%d\n",info);
        abort();
    }

    float w;
    int lwork=-1;
    sgetri_(
            & a.nr,
            a.data,
            & a.nr,
            ipiv,
            & w,
            & lwork,
            & info
    );
    lwork = (info==0)
        ? (int) w
        : a.nr * 10;
    float *work = malloc( sizeof(float) * lwork );
    sgetri_(
            & a.nr,
            a.data,
            & a.nr,
            ipiv,
            work,
            & lwork,
            & info
            );
    if( info ) {
        fprintf(stderr,"sgetri failed info=%d\n",info);
        abort();
    }
    free(work);
    free(ipiv);
    return a;
}

int main(int argc, const char * argv[]) {
    sge m = sgerand(32,32);
    sge ref = sgeinv( &m );
    for( unsigned int i=0;i < 16;i++) {
        align = i % 8;
        sge tmp = sgeinv( &m );
        if( memcmp( ref.data , tmp.data , sizeof(float) * m.nr * m.nc) ) {
            fprintf(stderr,"reciprocal differs from reference at step i=%u\n",i);
            abort();
        }
        sgefree( &tmp );
    }
    sgefree( &ref );
    sgefree( &m );
    return 0;
}

 


Conclusion, every problem I encounter in Smallapack can be solved by the same swiss knife solution: just check the alignment. Objects are rellocatable by garbage collector in Smalltalk, so there is no easy fix. Maybe I'll have to insist on using well aligned heap instead of Smalltalk memory, I thought I was relieved of such heaviness since the advent of 8 byte alignment in Spur. Alas 8 is not enough... Who knows if a future SSE5 won't require a 2^5 alignment. Tsss!

Monday, August 24, 2015

Smallapack moved to github

After ArbitraryPrecisionFloat, I also moved the Smallapack project from https://code.google.com/p/smallapack to https://github.com/nicolas-cellier-aka-nice/smallapack.

Like ArbitraryPrecisionFloat, there is no much code on the repository, except a bit outdated Dolphin verison. The Visualworks and Squeak versions are handled with Store and Monticello in traditional dialect repositories...

Maybe it's time to update a bit the project and open a Pharo 5 branch based on Opal Compiler, but we'll see that...

Tuesday, August 11, 2015

Moving ArbitraryPrecisionFloat from google-code to github

I've tested the export to github button for https://code.google.com/p/arbitrary-precision-float then completed the wiki export with https://github.com/morgant/finishGoogleCodeGitHubWikiMigration.

It mostly worked except a few links that I manually restored, and I can say that I'm satisfied with the result.

Note that the code is still not hosted on github, but rather on http://www.squeaksource.com/ArbitraryPrecisionFl.html or cincom public store...

There is also a fork in SciSmalltalk but for now I prefer to keep a standalone project because it has some value per se, and also to maintain code for more dialects. This will imply some extra sync burden, but it should be manageable.


Sunday, August 9, 2015

Compiling Cog.Spur on MacOSX Yosemite

The MacOSX target of Squeak VM currently uses some old Mac Carbon API.

This might be a problem with a recent version of Xcode, none of which is distributed with such backward compatibility support.

However, there is at least one solution: using the shell script provided at https://github.com/devernay/xcodelegacy worked like a charm for me (on OSX 10.10.4, with Xcode 6.4).

I can even compile my on brand of squeak.cog.spur (VMMaker.oscog-nice.1428 from http://smalltalkhub.com/mc/nice/NiceVMExperiments/main) with Apple LLVM 6.1.0 (clang-602.0.53) and get no regression against Eliot's r3410 when running the trunk tests.

That's allways good to know.

Sunday, April 5, 2015

Removing UB in bytecodePrimMultiply


In last post, I said that LLVM did not seem to be a problem w.r.t. testing overflow in post-condition in bytecodePrimMultiply.

But I was fooled by the Xcode assembly output tool, and when I compile a VM with Xcode 42, LLVM 3.0, I get failures like this one:

self assert: (Duration days: 1 hours: 2 minutes: 3 seconds: 4 nanoSeconds: 5 ) asMilliSeconds positive.

It should be 93784000, but I get -906.

The conversion asMilliSeconds is going through a 47 bits intermediate integer holding the number of nanoseconds in the delay:
     ^ ((seconds * NanosInSecond) + nanos) // (10 raisedToInteger: 6)

With an undetected overflow it would be equivalent to something like:
    ^((seconds * NanosInSecond) + nanos bitAnd: 16rFFFFFFFF) - (1<<32) // (10 raisedToInteger: 6)

...exactly -906 BINGO! Typically the overflow we were supposed to detect in post-condition!

OK, that's not a new thing, the bug was already known when I published this http://lists.squeakfoundation.org/pipermail/vm-dev/2012-August/011161.html  (Stefan Marr reported the bug somewhere in the thread). I provided a fix at the google-code issue tracker, but this was closed by Pharo team, so now the fix is roting at Pharo issue tracker du jour (unfortunately with restricted access): https://pharo.fogbugz.com/f/cases/11364/Don-t-test-signed-SmallInteger-multiply-overflow-in-post-condition

Anyway, with 64bits Spur, the fix must be revised, so let's go back to Smalltalk code. We can easily implement the overflow test in precondition as prescribed in the link of last post:
bytecodePrimMultiply
    | rcvr arg result overflow oop |
    <var: #result type: 'sqLong'>
    rcvr := self internalStackValue: 1.
    arg := self internalStackValue: 0.
    (objectMemory areIntegers: rcvr and: arg)
        ifTrue: [rcvr := objectMemory integerValueOf: rcvr.
                arg := objectMemory integerValueOf: arg.
                result := rcvr.
                (self sizeof: rcvr) + (self sizeof: arg) <= 8
                    ifTrue:
                        [overflow := false.]
                    ifFalse:
                        [overflow := rcvr > 0
                            ifTrue:  [arg > 0
                                ifTrue: [rcvr > (16r7FFFFFFFFFFFFFFF / arg)]
                                ifFalse: [arg < (-16r8000000000000000 / rcvr)]]
                            ifFalse: [arg > 0
                                ifTrue: [rcvr < (-16r8000000000000000 / arg)]
                                ifFalse: [(rcvr < 0) and: [arg < (16r7FFFFFFFFFFFFFFF / rcvr)]]]].
                overflow
                    ifFalse:
                        [result := result * arg.
                        oop := self signed64BitIntegerFor: result.
                          self internalPop: 2 thenPush: oop.
                        ^self fetchNextBytecode "success"]]
        ifFalse: [self initPrimCall.
                self externalizeIPandSP.
                self primitiveFloatMultiply: rcvr byArg: arg.
                self internalizeIPandSP.
                self successful ifTrue: [^ self fetchNextBytecode "success"]].

    messageSelector := self specialSelector: 8.
    argumentCount := 1.
    self normalSend


If we are on a 32 bits machine, then we never generate overflow because the product is evaluated on 64 bits, so we have a very fast solution using Large Integers.

If SmallInteger can be longer than 32bits (which is the case of Spur64), then the overflow might happen - unless sqLong is 128 bits but we don't even bother to test it, since we don't have a portable signed128BitIntegerFor: available anyway. 

All is well, except that... bad luck, the code generated is incorrect on 64bits for other reasons: the literals are generated as unsigned long long!!! Comparing signed and unsigned of same length will promote to unsigned, and the overflow tests will miserably fail. Writing correct C code is hard enough, but generating correct C code in all cases is a nightmare, especially when whe hope to do it with clever hacks...

After some iterations, I came to this implementation in CCodeGenerator:
cLiteralFor: anObject
    "Return a string representing the C literal value for the given object."
    anObject isNumber
        ifTrue:
            [anObject isInteger ifTrue:
                [| printString |
                 printString := (anObject > 0
                                and: [(anObject >> anObject lowBit + 1) isPowerOfTwo
                                and: [(anObject highBit = anObject lowBit and: [anObject > 65536])
                                      or: [anObject highBit - anObject lowBit >= 4]]])
                                    ifTrue: ['0x', (anObject printStringBase: 16)]
                                    ifFalse: [anObject printString].
                ^anObject > 16r7FFFFFFFFFFFFFFF
                        ifTrue: [printString, self unsignedLong64Suffix]
                        ifFalse: [anObject > 16rFFFFFFFF
                            ifTrue: [printString,self signedLong64Suffix]
                                ifFalse: [anObject > 16r7FFFFFFF
                                ifTrue: [printString,self unsignedLong32Suffix]
                                ifFalse: [anObject < -16r80000000
                                    ifTrue: [anObject = -16r8000000000000000
                                            ifTrue: ['(-0x7FFFFFFFFFFFFFFF',self signedLong64Suffix,'-1)']
                                            ifFalse: [printString,self signedLong64Suffix]]
                                    ifFalse: [anObject = -16r80000000
                                            ifTrue: ['(-0x7FFFFFFF-1)']
                                            ifFalse: [printString]]]]]].
            anObject isFloat ifTrue:
                [^anObject printString]]
        ifFalse:
            [anObject isSymbol ifTrue:
                [^self cFunctionNameFor: anObject].
            anObject isString ifTrue:
                [^'"', (anObject copyReplaceAll: (String with: Character cr) with: '\n') , '"'].
            anObject == nil ifTrue: [^ 'null' ].
            anObject == true ifTrue: [^ '1' ].
            anObject == false ifTrue: [^ '0' ].
            anObject isCharacter ifTrue:
                [^anObject == $'
                    ifTrue: ['''\'''''] "i.e. '\''"
                    ifFalse: [anObject asString printString]]].
    self error: 'Warning: A Smalltalk literal could not be translated into a C constant: ', anObject printString.
    ^'"XXX UNTRANSLATABLE CONSTANT XXX"'


Note that the generation is all but uniform, it goes through signed, unsigned, signed, unsigned when the magnitude grows... These hacks are those reducing the number of compiler warnings and currently lead to mostly correct code (I can't remember if it was necessary to force in slang code an asUnsignedInteger here or there, or the contrary). Note that I also implemented unsignedLong32Suffix, signedLong64Suffix and unsignedLong64Suffix in the CCodeGenerator, like this:
unsignedLong64Suffix
    "Answer the suffix that should be appended to a 64bits literal in generated code."

    ^(self sizeOfIntegralCType: #'unsigned long') = 4
        ifTrue: ['ULL']
        ifFalse: ['UL']


The method cLiteralFor: has a sister cLiteralFor:name: which should be modified too.

So, for compiling cog and spur 32bits with LLVM 3.0 and without exotic (heretic) compiler options, one should at least fix bytecodePrimMultiply, and for 64bits spur, at least cLiteralFor: and co.

I can tell that my Spur32 and cog32 VM are working fine with these changes. Not yet the Spur64, I still can't compile it for other reasons (MacOSX SDK problems...).

As usual, all the material is published on my VMMaker branch at smalltalkhub.com NiceVMExperiments VMMaker.oscog-nice.1157, but mixed with other changes.