ZSpy - HyperSpy’s Zarr Specification#

Similarly to the hspy format, the .zspy format guarantees that no information will be lost in the writing process and that supports saving data of arbitrary dimensions. It is based on the Zarr project. Which exists as a drop in replacement for hdf5 with the intention to fix some of the speed and scaling issues with the hdf5 format and is therefore suitable for saving big data. Example using HyperSpy:

>>> import hyperspy.api as hs
>>> s = hs.signals.BaseSignal([0])
>>> s.save('test.zspy') # will save in nested directory
>>> hs.load('test.zspy') # loads the directory

When saving to zspy, all supported objects in the signal’s metadata is stored. This includes lists, tuples and signals. Please note that in order to increase saving efficiency and speed, if possible, the inner-most structures are converted to numpy arrays when saved. This procedure homogenizes any types of the objects inside, most notably casting numbers as strings if any other strings are present:

By default, a zarr.storage.NestedDirectoryStore is used, but other zarr store can be used by providing a zarr.storage instead as argument to the save() or the load() function. If a .zspy file has been saved with a different store, it would need to be loaded by passing a store of the same type:

>>> import zarr
>>> filename = 'test.zspy'
>>> store = zarr.LMDBStore(filename)
>>> signal.save(store) # saved to LMDB

To load this file again

>>> import zarr
>>> filename = 'test.zspy'
>>> store = zarr.LMDBStore(filename)
>>> s = hs.load(store) # load from LMDB

API functions#

rsciio.zspy.file_reader(filename, lazy=False, **kwds)#

Read data from zspy files saved with the HyperSpy zarr format specification.

Parameters:
  • filename (str, pathlib.Path) – Filename of the file to read or corresponding pathlib.Path.

  • lazy (bool, Default=False) – Whether to open the file lazily or not.

  • **kwds (optional) – Pass keyword arguments to the zarr.open() function.

Returns:

List of dictionaries containing the following fields:

  • ’data’ – multidimensional numpy array

  • ’axes’ – list of dictionaries describing the axes containing the fields ‘name’, ‘units’, ‘index_in_array’, and either ‘size’, ‘offset’, and ‘scale’ or a numpy array ‘axis’ containing the full axes vector

  • ’metadata’ – dictionary containing the parsed metadata

  • ’original_metadata’ – dictionary containing the full metadata tree from the input file

Return type:

list of dicts

rsciio.zspy.file_writer(filename, signal, close_file=True, **kwds)#

Writes data to HyperSpy’s zarr format.

Parameters:
  • filename (str, pathlib.Path) – Filename of the file to write to or corresponding pathlib.Path.

  • signal (dict) –

    Dictionary containing the signal object. Should contain the following fields:

    • ’data’ – multidimensional numpy array

    • ’axes’ – list of dictionaries describing the axes containing the fields ‘name’, ‘units’, ‘index_in_array’, and either ‘size’, ‘offset’, and ‘scale’ or a numpy array ‘axis’ containing the full axes vector

    • ’metadata’ – dictionary containing the metadata tree

  • close_file (bool, default=True) – Close the file after writing. Only relevant for some zarr storages (zarr.storage.ZipStore, zarr.storage.DBMStore) requiring store to flush data to disk. If False, doesn’t close the file after writing. The file should not be closed if the data needs to be accessed lazily after saving.

  • chunks (tuple of integer or None, default=None) – Define the chunking used for saving the dataset. If None, calculates chunks for the signal, with preferably at least one chunk per signal space.

  • compressor (numcodecs compression, optional) – A compressor can be passed to the save function to compress the data efficiently, see Numcodecs codec. The default is to use a Blosc compressor.

  • write_dataset (bool, default=True) – If False, doesn’t write the dataset when writing the file. This can be useful to overwrite signal attributes only (for example axes_manager) without having to write the whole dataset, which can take time.

  • **kwds – The keyword arguments are passed to the zarr.hierarchy.Group.require_dataset() function.

Examples

>>> from numcodecs import Blosc
>>> compressor=Blosc(cname='zstd', clevel=1, shuffle=Blosc.SHUFFLE) # Used by default
>>> file_writer('test.zspy', s, compressor = compressor) # will save with Blosc compression

Note

Lazy operations are often i-o bound. Reading and writing the data creates a bottle neck in processes due to the slow read write speed of many hard disks. In these cases, compressing your data is often beneficial to the speed of some operations. Compression speeds up the process as there is less to read/write with the trade off of slightly more computational work on the CPU.