SpatialData objects

Sparrow stores the input data in a SpatialData object. SpatialData is the standard format for storing spatial data in Python. It consists of several layers for storing different types of data:

  • images: H&E, DAPI, polyT, membrane stainings
  • labels: annotations of pixels, regions of interest or cell segmentation masks (which pixel belong to which cell)
  • points: transcript or spot locations
  • polygons: shapes, typically shapes of cells (takes less memory than storing cell segmentation masks as labels)
  • tables: gene or protein expression

SpatialData has a read function for each vendor and stores the data on disk in chunks so you never have to load the full dataset in memory.

Images, points and labels are lazy: this data is not pulled in the memory when you read it from the zarr store.

What is Dask?

Dask is a Python package for parallel computing. It provides for instance Dask DataFrames, a collection of multiple pandas DataFrames, that can be processed in parallel, speeding up the analysis and avoiding the need to put the full data in the memory of your computer. In a similar way it also offers Dask Arrays, a collection of multiple numpy arrays.

You can see that this package is very useful when working with big data like spatial omics data.

Layers in the SpatialData objects: Images

Computers don’t see images, but values that are attributed to pixels. As a result computers can interpret images differently as we do.

In spatial omics analysis, an image is a xarray.DataArray, that allows to store pixel intensities, along with coordinates and transformations. The Xarray package is a wrapper around Dask Array. so images are essentially stored as Dask Arrays. Via compute() you can convert a Dask Array into a regular numpy array. Both array types share many attributes and methods.

Layers in the SpatialData objects: Labels

Labels represent cell segmentation masks, the areas of the cells, that are generated by a segmentation algorithm like CellPose. These cell areas are confined by cell boundaries, that are stored in the shapes layer of the SpatialData object.

Layers in the SpatialData objects: Shapes

Shapes are GeoDataFrames that are manipulated via the geopandas package. They contain a cell ID and a geometry and they are generated by the sparrow.im.segment() or the harpy.sh.vectorize() function. This shape represents the boundary of the cell: via the cell ID each boundary is linked to one segmentation mask in the labels layer.

Layers in the SpatialData objects: Points

Points are Dask DataFrames (very similar to pandas DataFrames but lazy). The represent the location of the transcripts.

Layers in the SpatialData objects: Tables

Tables are AnnData objects, annotated data frames. They contain the following attributes:

  • X: gene expression matrix, rows are cells, columns are genes
  • obs: pandas DataFrame with cell metadata: cluster, cell type, batch information…
  • var: gene metadata: gene name…
  • uns: dictionaries with additional unstructured data
  • obsm: numpy arrays with centroid coordinates of cells

Region and instance keys

Anndata objects generated by Sparrow are annotated by the labels layer (segmentation masks). Cell ID forms a link between the segmentation masks, the shapes and the anndata object. The columns that link the different layers in the SpatialData object are called keys.

Vectorization and rasterization

You can use the following functions:

  • harpy.sh.vectorize() to convert labels (segmentation masks) into shapes (cell boundaries)
  • harpy.im.rasterize() to convert shapes into labels

Coordinate systems

Elements in a SpatialData object can have different coordinate systems, e.g. multiple samples.

More info