Xarray in 45 minutes

Xarray in 45 minutes#

In this lesson, we cover the basics of Xarray data structures. By the end of the lesson, we will be able to:

Understand the basic data structures in Xarray
Inspect DataArray and Dataset objects.
Read and write netCDF files using Xarray.
Understand that there are many packages that build on top of xarray

We’ll start by reviewing the various components of the Xarray data model, represented here visually:

https://docs.xarray.dev/en/stable/_images/dataset-diagram.png

import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

xr.set_options(keep_attrs=True, display_expand_data=False)
np.set_printoptions(threshold=10, edgeitems=2)

%xmode minimal
%matplotlib inline
%config InlineBackend.figure_format='retina'

Exception reporting mode: Minimal

Xarray has a few small real-world tutorial datasets hosted in the xarray-data GitHub repository.

xarray.tutorial.load_dataset is a convenience function to download and open DataSets by name (listed at that link).

Here we’ll use air temperature from the National Center for Environmental Prediction. Xarray objects have convenient HTML representations to give an overview of what we’re working with:

Note that behind the scenes the tutorial.open_dataset downloads a file. It then uses xarray.open_dataset function to open that file (which for this datasets is a netCDF file).

A few things are done automatically upon opening, but controlled by keyword arguments. For example, try passing the keyword argument mask_and_scale=False… what happens?

Why Xarray?#

Metadata provides context and provides code that is more legible. This reduces the likelihood of errors from typos and makes analysis more intuitive and fun!

Analysis without xarray: `X(`#

# plot the first timestep
lat = ds.air.lat.data  # numpy array
lon = ds.air.lon.data  # numpy array
temp = ds.air.data  # numpy array

plt.figure()
plt.pcolormesh(lon, lat, temp[0, :, :]);

../_images/d51b4c12e613b65f717792b0574cad6284f9391bbfd47e49a7398ca94f59ce52.png

temp.mean(axis=1)  ## what did I just do? I can't tell by looking at this line.

array([[279.398 , 279.6664, ..., 280.3152, 280.6624],
       [279.0572, 279.538 , ..., 280.27  , 280.7976],
       ...,
       [279.398 , 279.666 , ..., 280.342 , 280.834 ],
       [279.27  , 279.354 , ..., 279.97  , 280.482 ]], shape=(2920, 53))

Analysis with xarray `=)`#

How readable is this code?

ds.air.isel(time=0).plot(x="lon");

../_images/b3e1924584e480bebc5c57ad32a92e7ea588a0e784658414d5d6b1b1916328c0.png

Use dimension names instead of axis numbers

ds.air.mean(dim="time").plot(x="lon")

<matplotlib.collections.QuadMesh at 0x7fe62f55a420>

../_images/6d30e0ea2d699c3f730a61f3fdfe3fc76ce2eccdea0e197df6e14cbbbcfe2b3e.png

Visualization#

(.plot)

We have seen very simple plots earlier. Xarray also lets you easily visualize 3D and 4D datasets by presenting multiple facets (or panels or subplots) showing variations across rows and/or columns.

# facet the seasonal_mean
seasonal_mean.air.plot(col="season", col_wrap=2);

../_images/e0671b19f897f5d5dddb8c0c5ac5de27426a1cbd826fe7b0857504ec01207612.png

# contours
seasonal_mean.air.plot.contour(col="season", levels=20, add_colorbar=True);

../_images/e47cc5830710c5d64978edc2c5e4358e58222027245a0be524cb3bfa010a9718.png

# line plots too? wut
seasonal_mean.air.mean("lon").plot.line(hue="season", y="lat");

../_images/11c3f5d3af09909f8568d7bbcfcbace76f998814f2e31d177fe32d0567e9b631.png

For more see the user guide, the gallery, and the tutorial material.

The scientific python ecosystem#

Xarray ties in to the larger scientific python ecosystem and in turn many packages build on top of xarray. A long list of such packages is here: https://docs.xarray.dev/en/stable/user-guide/ecosystem.html.

Now we will demonstrate some cool features.

Pandas: tabular data structures#

You can easily convert between xarray and pandas structures. This allows you to conveniently use the extensive pandas ecosystem of packages (like seaborn) for your work.

# convert to pandas dataframe
df = ds.isel(time=slice(10)).to_dataframe()
df

			air
lat	time	lon
75.0	2013-01-01 00:00:00	200.0	241.20
		202.5	242.50
		205.0	243.50
		207.5	244.00
		210.0	244.10
...	...	...	...
15.0	2013-01-03 06:00:00	320.0	297.00
		322.5	297.29
		325.0	296.90
		327.5	296.79
		330.0	297.10

13250 rows × 1 columns

Alternative array types#

This notebook has focused on Numpy arrays. Xarray can wrap other array types! For example:

distributed parallel arrays & Xarray user guide on Dask

pydata/sparse : sparse arrays

GPU arrays & cupy-xarray

pint : unit-aware arrays & pint-xarray

Dask#

Dask cuts up NumPy arrays into blocks and parallelizes your analysis code across these blocks

# demonstrate dask dataset
dasky = xr.tutorial.open_dataset(
    "air_temperature",
    chunks={"time": 10},  # 10 time steps in each block
)

dasky.air

<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB
dask.array<chunksize=(10, 25, 53), meta=np.ndarray>
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

All computations with dask-backed xarray objects are lazy, allowing you to build up a complicated chain of analysis steps quickly

# demonstrate lazy mean
dasky.air.mean("lat")

<xarray.DataArray 'air' (time: 2920, lon: 53)> Size: 1MB
dask.array<chunksize=(10, 53), meta=np.ndarray>
Coordinates:
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

To get concrete values, call .compute or .load

HoloViz#

Quickly generate interactive plots from your data!

The hvplot package attaches itself to all xarray objects under the .hvplot namespace. So instead of using .plot use .hvplot

import hvplot.xarray

ds.air.hvplot(groupby="time", clim=(270, 300), widget_location='bottom')

Note

The time slider will only work if you’re executing the notebook, rather than viewing the website

Other cool packages#

xgcm : grid-aware operations with xarray objects
xrft : fourier transforms with xarray
xclim : calculating climate indices with xarray objects
intake-xarray : forget about file paths
rioxarray : raster files and xarray
xesmf : regrid using ESMF
MetPy : tools for working with weather data

Check the Xarray Ecosystem page and this tutorial for even more packages and demonstrations.

Next#

Read the tutorial material and user guide
See the description of common terms used in the xarray documentation:
Answers to common questions on “how to do X” with Xarray are here
Ryan Abernathey has a book on data analysis with a chapter on Xarray
Project Pythia has foundational and more advanced material on Xarray. Pythia also aggregates other Python learning resources.
The Xarray Github Discussions and Pangeo Discourse are good places to ask questions.
Tell your friends! Tweet!

Welcome!#

Xarray is an open-source project and gladly welcomes all kinds of contributions. This could include reporting bugs, discussing new enhancements, contributing code, helping answer user questions, contributing documentation (even small edits like fixing spelling mistakes or rewording to make the text clearer). Welcome!

Xarray in 45 minutes

Contents

Xarray in 45 minutes#

What’s in a Dataset?#

What’s in a DataArray?#

Name (optional)#

Named dimensions#

Coordinate variables#

Arbitrary attributes#

Underlying data#

Review#

Why Xarray?#

Analysis without xarray: `X(`#

Analysis with xarray `=)`#

Extracting data or “indexing”#

Label-based indexing#

Position-based indexing#

Concepts for computation#

Broadcasting: expanding data#

Alignment: putting data on the same grid#

High level computation#

groupby#

resample#

weighted#

Visualization#

Reading and writing files#

The scientific python ecosystem#

Pandas: tabular data structures#

Alternative array types#

Dask#

HoloViz#

cf_xarray#

Other cool packages#

Next#

Welcome!#

Xarray in 45 minutes

Contents

Xarray in 45 minutes#

What’s in a Dataset?#

What’s in a DataArray?#

Name (optional)#

Named dimensions#

Coordinate variables#

Arbitrary attributes#

Underlying data#

Review#

Why Xarray?#

Analysis without xarray: X(#

Analysis with xarray =)#

Extracting data or “indexing”#

Label-based indexing#

Position-based indexing#

Concepts for computation#

Broadcasting: expanding data#

Alignment: putting data on the same grid#

High level computation#

groupby#

resample#

weighted#

Visualization#

Reading and writing files#

The scientific python ecosystem#

Pandas: tabular data structures#

Alternative array types#

Dask#

HoloViz#

cf_xarray#

Other cool packages#

Next#

Welcome!#

Analysis without xarray: `X(`#

Analysis with xarray `=)`#