Wednesday, March 8, 2017

Chasing Smallapack inversion BUG

Smallapack is the Smalltalk interface to LAPACK, the famous Linear Algebra Package (see github).

Despite the decent level of stability achieved by Smallapack these last years, interfacing a huge library like this can't be that simple, it's like asking for trouble. The last bug reported was about non reproducibility of matrix inversion. Indeed, it is easily reproducible on Mac OSX with this snippet:

(10 to: 100) detect: [:k |
    a := LapackSGEMatrix rows: k columns: k.
    a fillRandUniform.
    x := ((1 to: 100) collect: [:i | a reciprocal]) asSet asArray.

    x size > 1].


Above code does invariably return 32 on Mac OSX Squeak/Pharo, whatever VM, 32 or 64 bits, Spur or Cog.

On Squeak, a short path for trying above code is to first file this in:

((Installer ss project: 'Smallapack') package: 'ConfigurationOfSmallapack') latest install.!
ConfigurationOfSmallapack loadLatestVersion.!


I've long been clueless about this one, debugging 32x32 matrices is no fun (prepare to scan thousands of digits), especially when the differences between matrices are small, and especially when results differ from one shot to another... Inspection of code and step by step debugging did not reveal anything. Anyway, why does it happen only in single precision and not double? And why not in Linux?

I tried to reproduce above behavior with a C program without success for a while. Until I discovered that the results of C did differ in 32 and 64 bits.  That's been the enlightment: Apple internally use different optimized paths for the Accelerate.framework. Especially when the alignment of data varies, the proof in C below:

//
//  main.c
//  test_sgetri
//
//  Created by Nicolas Cellier on 30/01/2017.
//

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <Accelerate/Accelerate.h>

typedef struct {
    int nr;
    int nc;
    float *data;
    float *alloc;
} sge;

unsigned int align = 0;

sge sgealloc( int nr , int nc )
{
    sge new;
    new.nr = nr;
    new.nc = nc;
    new.alloc = (float *)malloc( sizeof(float) * (nr * nc + align) );
    new.data = new.alloc + align;
    return new;
}

void sgefree( sge const * m )
{
    free(m->alloc);
}

sge sgedup( sge const * m)
{
    sge new = sgealloc( m->nr , m->nc );
    memcpy( new.data , m->data , sizeof(float) * m->nr * m->nc);
    return new;
}

sge sgerand(int nr,int nc)
{
    sge m = sgealloc(nr,nc);
    int isUniform = 1;
    int seed[4] = {1 , 11 , 111 , 1111 };
    int size = nr * nc;
    slarnv_(
            & isUniform,
            seed,
            & size,
            m.data);
    return m;
}

sge sgeinv( sge const * m )
{
    sge a = sgedup( m );
    int *ipiv = malloc( sizeof(long) * m->nr);
    int info;
    sgetrf_(
        & a.nr,
        & a.nc,
        a.data,
        & a.nr,
        ipiv,
        &info
    );
    if( info ) {
        fprintf(stderr,"sgetrf failed info=%d\n",info);
        abort();
    }

    float w;
    int lwork=-1;
    sgetri_(
            & a.nr,
            a.data,
            & a.nr,
            ipiv,
            & w,
            & lwork,
            & info
    );
    lwork = (info==0)
        ? (int) w
        : a.nr * 10;
    float *work = malloc( sizeof(float) * lwork );
    sgetri_(
            & a.nr,
            a.data,
            & a.nr,
            ipiv,
            work,
            & lwork,
            & info
            );
    if( info ) {
        fprintf(stderr,"sgetri failed info=%d\n",info);
        abort();
    }
    free(work);
    free(ipiv);
    return a;
}

int main(int argc, const char * argv[]) {
    sge m = sgerand(32,32);
    sge ref = sgeinv( &m );
    for( unsigned int i=0;i < 16;i++) {
        align = i % 8;
        sge tmp = sgeinv( &m );
        if( memcmp( ref.data , tmp.data , sizeof(float) * m.nr * m.nc) ) {
            fprintf(stderr,"reciprocal differs from reference at step i=%u\n",i);
            abort();
        }
        sgefree( &tmp );
    }
    sgefree( &ref );
    sgefree( &m );
    return 0;
}

 


Conclusion, every problem I encounter in Smallapack can be solved by the same swiss knife solution: just check the alignment. Objects are rellocatable by garbage collector in Smalltalk, so there is no easy fix. Maybe I'll have to insist on using well aligned heap instead of Smalltalk memory, I thought I was relieved of such heaviness since the advent of 8 byte alignment in Spur. Alas 8 is not enough... Who knows if a future SSE5 won't require a 2^5 alignment. Tsss!