Reading and writing files

Reading and writing files#

One of Xarray’s most widely used features is its ability to read from and write to a variety of data formats. For example, Xarray can read the following formats using open_dataset/open_mfdataset:

Support for additional formats is possible using external packages

https://www.unidata.ucar.edu/images/logos/netcdf-400x400.png

NetCDF#

The recommended way to store xarray data structures is NetCDF, which is a binary file format for self-described datasets that originated in the geosciences. Xarray is based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects.

Xarray reads and writes to NetCDF files using the open_dataset / open_dataarray functions and the to_netcdf method.

Let’s first create some datasets and write them to disk using to_netcdf, which takes the path we want to write to:

import numpy as np
import xarray as xr

# Ensure random arrays are the same each time
np.random.seed(0)

The constructor of Dataset takes three parameters:

  • data_vars: dict-like mapping names to values. Values are either DataArray objects or defined with tuples consisting of of dimension names and arrays.

  • coords: same as for DataArray

  • attrs: same as for DataArray

ds1 = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.random.randn(4, 2)),
        "b": (("z", "x"), np.random.randn(6, 4)),
    },
    coords={
        "x": np.arange(4),
        "y": np.arange(-2, 0),
        "z": np.arange(-3, 3),
    },
)
ds2 = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.random.randn(7, 3)),
        "b": (("z", "x"), np.random.randn(2, 7)),
    },
    coords={
        "x": np.arange(6, 13),
        "y": np.arange(3),
        "z": np.arange(3, 5),
    },
)

# write datasets
ds1.to_netcdf("ds1.nc")
ds2.to_netcdf("ds2.nc")

# write dataarray
ds1.a.to_netcdf("da1.nc")

Reading those files is just as simple:

xr.open_dataset("ds1.nc")
<xarray.Dataset> Size: 352B
Dimensions:  (x: 4, y: 2, z: 6)
Coordinates:
  * x        (x) int64 32B 0 1 2 3
  * y        (y) int64 16B -2 -1
  * z        (z) int64 48B -3 -2 -1 0 1 2
Data variables:
    a        (x, y) float64 64B ...
    b        (z, x) float64 192B ...
<xarray.DataArray 'a' (x: 4, y: 2)> Size: 64B
[8 values with dtype=float64]
Coordinates:
  * x        (x) int64 32B 0 1 2 3
  * y        (y) int64 16B -2 -1
https://zarr.readthedocs.io/en/stable/_static/logo1.png

Zarr#

Zarr is a Python package and data format providing an implementation of chunked, compressed, N-dimensional arrays. Zarr has the ability to store arrays in a range of ways, including in memory, in files, and in cloud-based object storage such as Amazon S3 and Google Cloud Storage. Xarray’s Zarr backend allows xarray to leverage these capabilities.

Zarr files can be written with:

ds1.to_zarr("ds1.zarr", mode="w")
<xarray.backends.zarr.ZarrStore at 0x7f151ecd5240>

We can then read the created file with:

xr.open_zarr("ds1.zarr", chunks=None)
<xarray.Dataset> Size: 352B
Dimensions:  (x: 4, y: 2, z: 6)
Coordinates:
  * x        (x) int64 32B 0 1 2 3
  * y        (y) int64 16B -2 -1
  * z        (z) int64 48B -3 -2 -1 0 1 2
Data variables:
    a        (x, y) float64 64B ...
    b        (z, x) float64 192B ...

setting the chunks parameter to None avoids dask (more on that in a later session)

tip: You can write to any dictionary-like (MutableMapping) interface:

mystore = {}

ds1.to_zarr(store=mystore)
<xarray.backends.zarr.ZarrStore at 0x7f151e9b3140>

Raster files using rioxarray#

rioxarray is an Xarray extension that allows reading and writing a wide variety of geospatial image formats compatible with Geographic Information Systems (GIS), for example GeoTIFF.

If rioxarray is installed your environment it will be automatically detected and give you access to the .rio accessor:

da = xr.DataArray(
    data=ds1.a.data,
    coords={
        "y": np.linspace(47.5, 47.8, 4),
        "x": np.linspace(-122.9, -122.7, 2),
    },
)

# Add Geospatial Coordinate Reference https://epsg.io/4326
# this is stored as a 'spatial_ref' coordinate
da.rio.write_crs("epsg:4326", inplace=True)
da
<xarray.DataArray (y: 4, x: 2)> Size: 64B
array([[ 1.76405235,  0.40015721],
       [ 0.97873798,  2.2408932 ],
       [ 1.86755799, -0.97727788],
       [ 0.95008842, -0.15135721]])
Coordinates:
  * y            (y) float64 32B 47.5 47.6 47.7 47.8
  * x            (x) float64 16B -122.9 -122.7
    spatial_ref  int64 8B 0
da.rio.to_raster('ds1_a.tiff')

NOTE: you can now load this file into GIS tools like QGIS! Or open back into Xarray:

DA = xr.open_dataarray('ds1_a.tiff', engine='rasterio')
DA.rio.crs
CRS.from_epsg(4326)