Smallapack is the Smalltalk interface to LAPACK, the famous Linear Algebra Package (see github).
Despite the decent level of stability achieved by Smallapack these last years, interfacing a huge library like this can't be that simple, it's like asking for trouble. The last bug reported was about non reproducibility of matrix inversion. Indeed, it is easily reproducible on Mac OSX with this snippet:
(10 to: 100) detect: [:k |
a := LapackSGEMatrix rows: k columns: k.
a fillRandUniform.
x := ((1 to: 100) collect: [:i | a reciprocal]) asSet asArray.
x size > 1].
Above code does invariably return 32 on Mac OSX Squeak/Pharo, whatever VM, 32 or 64 bits, Spur or Cog.
On Squeak, a short path for trying above code is to first file this in:
((Installer ss project: 'Smallapack') package: 'ConfigurationOfSmallapack') latest install.!
ConfigurationOfSmallapack loadLatestVersion.!
I've long been clueless about this one, debugging 32x32 matrices is no fun (prepare to scan thousands of digits), especially when the differences between matrices are small, and especially when results differ from one shot to another... Inspection of code and step by step debugging did not reveal anything. Anyway, why does it happen only in single precision and not double? And why not in Linux?
I tried to reproduce above behavior with a C program without success for a while. Until I discovered that the results of C did differ in 32 and 64 bits. That's been the enlightment: Apple internally use different optimized paths for the Accelerate.framework. Especially when the alignment of data varies, the proof in C below:
//
// main.c
// test_sgetri
//
// Created by Nicolas Cellier on 30/01/2017.
//
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <Accelerate/Accelerate.h>
typedef struct {
int nr;
int nc;
float *data;
float *alloc;
} sge;
unsigned int align = 0;
sge sgealloc( int nr , int nc )
{
sge new;
new.nr = nr;
new.nc = nc;
new.alloc = (float *)malloc( sizeof(float) * (nr * nc + align) );
new.data = new.alloc + align;
return new;
}
void sgefree( sge const * m )
{
free(m->alloc);
}
sge sgedup( sge const * m)
{
sge new = sgealloc( m->nr , m->nc );
memcpy( new.data , m->data , sizeof(float) * m->nr * m->nc);
return new;
}
sge sgerand(int nr,int nc)
{
sge m = sgealloc(nr,nc);
int isUniform = 1;
int seed[4] = {1 , 11 , 111 , 1111 };
int size = nr * nc;
slarnv_(
& isUniform,
seed,
& size,
m.data);
return m;
}
sge sgeinv( sge const * m )
{
sge a = sgedup( m );
int *ipiv = malloc( sizeof(long) * m->nr);
int info;
sgetrf_(
& a.nr,
& a.nc,
a.data,
& a.nr,
ipiv,
&info
);
if( info ) {
fprintf(stderr,"sgetrf failed info=%d\n",info);
abort();
}
float w;
int lwork=-1;
sgetri_(
& a.nr,
a.data,
& a.nr,
ipiv,
& w,
& lwork,
& info
);
lwork = (info==0)
? (int) w
: a.nr * 10;
float *work = malloc( sizeof(float) * lwork );
sgetri_(
& a.nr,
a.data,
& a.nr,
ipiv,
work,
& lwork,
& info
);
if( info ) {
fprintf(stderr,"sgetri failed info=%d\n",info);
abort();
}
free(work);
free(ipiv);
return a;
}
int main(int argc, const char * argv[]) {
sge m = sgerand(32,32);
sge ref = sgeinv( &m );
for( unsigned int i=0;i < 16;i++) {
align = i % 8;
sge tmp = sgeinv( &m );
if( memcmp( ref.data , tmp.data , sizeof(float) * m.nr * m.nc) ) {
fprintf(stderr,"reciprocal differs from reference at step i=%u\n",i);
abort();
}
sgefree( &tmp );
}
sgefree( &ref );
sgefree( &m );
return 0;
}
Conclusion, every problem I encounter in Smallapack can be solved by the same swiss knife solution: just check the alignment. Objects are rellocatable by garbage collector in Smalltalk, so there is no easy fix. Maybe I'll have to insist on using well aligned heap instead of Smalltalk memory, I thought I was relieved of such heaviness since the advent of 8 byte alignment in Spur. Alas 8 is not enough... Who knows if a future SSE5 won't require a 2^5 alignment. Tsss!
Not knowing the answer, how hard would it be to have Spur ask the class/instance what it's alignment should be when relocated, with the fallback being 8 byte alignment? Are there spare flags around to allow for that mechanism to be quick?
ReplyDelete-cbc
Currently the raw bytes are stored into a generic and agnostic ByteArray (with the meaning of VW UninterpretedBytes). Maybe if using something like a 128-bit QuadWordArray (Eliot just added the 64-bits DoubleWordArray)? Ah, it seems that the object format of Spur has no room for it...
ReplyDelete