Monday, March 26, 2018

Interfacing Smalltalk with HDF5

I am going to speak about connecting Smalltalk with external (scientific) data again.

This time, it's not with Matlab, but HDF5, the hierarchical data format supported by http://www.hdfgroup.org 

Every language targetting science/engineering niche must have an interface to HDF5. That is the case of Python and Matlab to only cite two. And you know that I'd like to promote the usage of Smalltalk in this area too. So lets do it: here comes the HDF5 bundle in Cincom public store.

If interfacing with Matlab mat-file format was a piece of cake, HDF5 is much more involved. First, because HDF5 is like a file system. There are recursive named Group of data, like directories. And groups are not necessarily arranged as a tree, but can form arbitrary graphs (circular) thanks to links. Second, because HDF5 comes with a type system. It can hold arbitrary types, whether atomic (integer, floating point,bit fields) or composite (structure, arrays). The types can even be references to other objects, which means that it is sufficiently general to describe heterogeneous collections of dynamically typed data.

There are two kind of way to store data: the first is Dataset which are named entries in Groups (a bit like files in directories). a Dataset is a rectilinear multiple dimension array (like the MultipleDimensionArray that I recently promoted in Visualworks public store and Squeak STEM). The second is Attributes. All named entries (Group, Dataset and named types) can have attributes, which are kind of string-key arbitrary-value pairs. Attributes generally hold meta data. They lack the ability to perform the read/write of sub-regions of data. We somehow can compare attributes to the property list of Morphic.

For additional complexity from the interfacing point of view, arrays can be of variable length. Thus the buffer for holding complex data cannot be preallocated from within Smalltalk before the data transfer, but has to be allocated on the fly by the HDF5 library. Thats potentially means either memory leaks or dangling references to freed memory, or the two if the programmer worked too late at night.

HDF5 comes with lot of documentation including reference, user guide, tutorial and examples if you want to learn more.

Among the many features of HDF5, let's focus of some of the most essential:
  • the ability to perform type transformations during transfer operations. This enables language interoperability, since the type system is flexible enough to accomodate many languages (if not all thru C interface).
  • the ability to read/write sub-regions of dataset. This enables handling of huge data, the whole dataset does not have to reside in memory.
  • it scales well (or at least it can if used adequately).
That being said, this comes with a price: HDF5 is complex, and like the manipulated data, the API is bigger than big too.

It's clear that the target languages that HDF5 creators had in mind were statically typed. So the mapping to dynamically types objects is not straightforward, nor optimized, especially for the compound type (structure). The transfer necessarily involves two steps: the transfer or raw data, followed by a transformation to Smalltalk data for read, et vice et versa for write. For handling all cases, including references, a visitor pattern will be necessary (references can be cyclic too).

For huge data, the way to go is to use proxy to HDF5 objects. That means giving minimal behavior to the proxys, at least for storing/retrieving whole data or subregions. Application specific behavior should better lie in application specific objects, and those objects would use a generic HDF5 proxy. Since data structure is hierarchical (recursive), an application specific visitor should construct the application specific object graph. The early implementation that I just published is not yet there. It does not even use a visitor, but hardcode an arbitrary visit in the HDF5 proxies. It's more a minimal proof of concept at this stage, but a usable proof of concept though.

So, how do we use it? For creating a HDF5 file, first do this:

out:= H5File create: 'foo.h5'.
H5Dataset createPath: 'float1' parent: out rootGroup value: 1.3e0.
H5Dataset createPath: 'double2' parent: out rootGroup value: Double pi.
H5Dataset createPath: 'int3' parent: out rootGroup value: -357.
out close.


The close operation is not strictly necessary, because I implemented a registry of opened hdf5 objects with auto-close facility when the entries are reclaimed. But forcing a close fushes the file, otherwise the close will be delayed until all opened HDF5 entities have been reclaimed by garbage collector.

For reading, it's like this:

in:= H5File readOnly: 'foo.h5'.
float1 := (in / 'float1') value.
double2 := (in / 'double2') value.

int3 := (in / 'int3') value.
in close.


This example is a bit poor. We can also create/query groups, attributes and handle more complex data, thanks to MultipleDimensionArray, RawArray for atomic data types, the CArrayAccessor of Smallapack for arbitrary data. 

For writing Smalltalk objects on HDF5 files, one must define these 3 essential selectors:
  • h5mem returns a buffer containing the data to transfer and usable by HDF5 API (must be compatible with DLL-C-Connect interface)
  • h5type gives a HDF5 type description of buffer contents
  • h5space gives  a HDF5 description of buffer layout (dimensions of the dataset)
Last note: I've used HDF5 version 1.8.10 for the interface. Unfortunately, HDF5 use macros for enabling backward compatibility, but the way DLL/C-connect works, we will have to subclass the H5Interface in order to support various versions.

That's all for this post, I may add more implementation details later on.

No comments:

Post a Comment