Computing with Multiple Objects#

Learning goals:

  • Perform operations across multiple datasets

  • Understand two important concepts: broadcasting and alignment.

import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

%config InlineBackend.figure_format='retina'

plt.style.use("bmh")

np.random.seed(0)

Here is a motivating calculation where we subtract two DataArrays with data available at different locations in the (space, time) plane.

arr1 = xr.DataArray(
    np.arange(12).reshape(3, 4),
    dims=("space", "time"),
    coords={"space": ["a", "b", "c"], "time": [0, 1, 2, 3]},
)
arr1
<xarray.DataArray (space: 3, time: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * space    (space) <U1 'a' 'b' 'c'
  * time     (time) int64 0 1 2 3
arr2 = xr.DataArray(
    [0, 1],
    dims="space",
    coords={"space": ["b", "d"]},
)
arr2
<xarray.DataArray (space: 2)>
array([0, 1])
Coordinates:
  * space    (space) <U1 'b' 'd'

Note that arr1 is 2D; while arr2 is 1D along space and has values at two locations only.

Now subtract the two.

arr1 - arr2
<xarray.DataArray (space: 1, time: 4)>
array([[4, 5, 6, 7]])
Coordinates:
  * space    (space) <U1 'b'
  * time     (time) int64 0 1 2 3

To understand this output, we must understand two fundamental concepts underlying computation with Xarray objects

  1. Broadcasting: The objects need to have compatible shapes.

  2. Alignment: The objects need to have values at the same coordinate labels

Broadcasting: adjusting arrays to the same shape#

Broadcasting allows an operator or a function to act on two or more arrays to operate even if these arrays do not have the same shape. That said, not all the dimensions can be subjected to broadcasting; they must meet certain rules. The image below illustrates how an operation on arrays with different coordinates will result in automatic broadcasting

Credit: Stephan Hoyer – xarray ECMWF Python workshop

Numpy’s broadcasting rules, based on array shape, can sometimes be difficult to understand and remember. Xarray does broadcasting by dimension name, rather than array shape. This is a huge convenience.

Here are two 1D arrays

array1 = xr.DataArray(
    np.arange(3),
    dims="space",
    coords={"space": ["a", "b", "c"]},
    name="array1",
)
array2 = xr.DataArray(
    np.arange(4),
    dims="time",
    coords={"time": [0, 1, 2, 3]},
    name="array2",
)
display(array1)
display(array2)
<xarray.DataArray 'array1' (space: 3)>
array([0, 1, 2])
Coordinates:
  * space    (space) <U1 'a' 'b' 'c'
<xarray.DataArray 'array2' (time: 4)>
array([0, 1, 2, 3])
Coordinates:
  * time     (time) int64 0 1 2 3

Let’s subtract the two:

array1 - array2
<xarray.DataArray (space: 3, time: 4)>
array([[ 0, -1, -2, -3],
       [ 1,  0, -1, -2],
       [ 2,  1,  0, -1]])
Coordinates:
  * space    (space) <U1 'a' 'b' 'c'
  * time     (time) int64 0 1 2 3

We see that the result is a 2D array.

When subtracting, Xarray first realizes that array1 is missing the dimension time and array2 is missing the dimension space. Xarray then broadcasts or “expands” both arrays to 2D with dimensions space, time. Here is an illustration:

While this detail is hidden, we can explicitly broadcast any number of arrays against each other using xr.broadcast

array1_broadcasted, array2_broadcasted = xr.broadcast(array1, array2)
display(array1_broadcasted.dims)
display(array2_broadcasted.dims)
('space', 'time')
('space', 'time')

To get the final anomaly, Xarray calculates

# identical to array1 - array2
array1_broadcasted - array2_broadcasted
<xarray.DataArray (space: 3, time: 4)>
array([[ 0, -1, -2, -3],
       [ 1,  0, -1, -2],
       [ 2,  1,  0, -1]])
Coordinates:
  * space    (space) <U1 'a' 'b' 'c'
  * time     (time) int64 0 1 2 3

Broadcasting in numpy#

For contrast let us examine the pure numpy version of this calculation. We use .data to extract the underlying numpy array object.

array1.data - array2.data
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 array1.data - array2.data

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

To get this calculation to work, we need to insert new axes manually using np.newaxis.

array1.data[:, np.newaxis] - array2.data[np.newaxis, :]
array([[ 0, -1, -2, -3],
       [ 1,  0, -1, -2],
       [ 2,  1,  0, -1]])

Because xarray knows about dimension names we avoid having to create unnecessary size-1 dimensions using np.newaxis or .reshape. This is yet another example where the metadata (dimension names) reduces the mental overhead associated with coding a calculation

For more, see the Xarray documentation and the numpy documentation on broadcasting.

Exercise 5

Consider the following 2D array. What are the dimensions of array - array.mean("time")?

array = xr.DataArray(
    np.arange(12).reshape(3, 4),
    dims=("space", "time"),
    coords={"space": ["a", "b", "c"], "time": [0, 1, 2, 3]},
    name="array",
)

Alignment: putting data on the same grid#

When combining two input arrays using an arithmetic operation, both arrays must first be converted to the same coordinate system. This is “alignment”.

Here are two 2D DataArrays with different shapes.

arr1 = xr.DataArray(
    np.arange(12).reshape(3, 4),
    dims=("space", "time"),
    coords={"space": ["a", "b", "c"], "time": [0, 1, 2, 3]},
)
arr1
<xarray.DataArray (space: 3, time: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
  * space    (space) <U1 'a' 'b' 'c'
  * time     (time) int64 0 1 2 3
arr2 = xr.DataArray(
    np.arange(14).reshape(2, 7),
    dims=("space", "time"),
    coords={"space": ["b", "d"], "time": [-2, -1, 0, 1, 2, 3, 4]},
)
arr2
<xarray.DataArray (space: 2, time: 7)>
array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 7,  8,  9, 10, 11, 12, 13]])
Coordinates:
  * space    (space) <U1 'b' 'd'
  * time     (time) int64 -2 -1 0 1 2 3 4

arr1 and arr2 have the same dimensions (space, time) but have values at different locations in the (space, time) plane with some locations in common.

Note

xarray assumes coordinate labels are in the same coordinate system such that space=’b’ in arr1 is the same as space=’b’ in arr2. For more sophisticated handling of coordinate systems see rioxarray

Hide code cell source
def visualize_mesh(array, *args, ax=None, **kwargs):
    """Visualizes array's mesh ((points at which values are present)."""

    # Use broadcast to generate 2D x_, y_ arrays from the 1D x,y arrays
    space_, time_ = xr.broadcast(array.space, array.time)
    if ax is None:
        ax = plt.gca()
    kwargs.setdefault("fillstyle", "none")
    kwargs.setdefault("markersize", 10)
    ax.plot(space_.data.ravel(), time_.data.ravel(), *args, **kwargs)
    ax.set_xlabel("space")
    ax.set_ylabel("time")


visualize_mesh(arr1, "<")
visualize_mesh(arr2, ">")
plt.ylim([-3, 6])
plt.legend(["arr1", "arr2"]);
../_images/f778d83eca14dce437c8dccbab7d9943edb7acc15f1be8396b1fc4a69d8743a9.png

We see that both arrays only have values in common at x="b" and y=[0, 1, 2, 3]. Before applying an arithmetic operation we must first modify each DataArray so that they have values at the same points. This is “alignment”.

Controlling alignment#

We can explicitly align objects using xr.align. The key decision to make is how to decide which points must be kept. The other way to think of alignment is that objects must be converted to a common grid prior to any operation combining multiiple objects. This decision is controlled by the "join" keyword argument. Xarray provides 5 ways to convert the coordinate labels of multiple Datasets to a common grid. This terminology originates in the database community.

  1. join="inner" or reindex to the “intersection set” of coordinate labels

  2. join="outer" or reindex to the “union set” of coordinate labels

  3. join="left" or reindex to the coordinate labels of the leftmost object

  4. join="right" or reindex to the coordinate labels of the rightmost object

  5. join="exact" checks for exact equality of coordinate labels before the operation.

First lets try an inner join. This is the default for arithmetic operations in Xarray. We see that the result has values for locations that arr1 and arr2 have in common: x="b" and y=[0, 1, 2, 3]. Here is an illustration

a1_aligned, a2_aligned = xr.align(arr1, arr2, join="inner")
a1_aligned
<xarray.DataArray (space: 1, time: 4)>
array([[4, 5, 6, 7]])
Coordinates:
  * space    (space) <U1 'b'
  * time     (time) int64 0 1 2 3
a2_aligned
<xarray.DataArray (space: 1, time: 4)>
array([[2, 3, 4, 5]])
Coordinates:
  * space    (space) <U1 'b'
  * time     (time) int64 0 1 2 3

Here’s a visual depiction of all the join options

Hide code cell source
def visualize_join(a1, a2, join, ax=None):
    a1_aligned, a2_aligned = xr.align(arr1, arr2, join=join)

    visualize_mesh(a1, "<", ax=ax)
    visualize_mesh(a2, ">", ax=ax)
    visualize_mesh(a1_aligned, ".", markersize=32, color="C3", ax=ax)

    ax.set_ylim([-3, 6])
    ax.set_title(f"join={join!r}")


f, ax = plt.subplots(1, 4, sharex=True, sharey=True)
for axx, join in zip(ax, ["inner", "outer", "left", "right"]):
    visualize_join(arr1, arr2, join, ax=axx)
ax[-1].legend(["arr1", "arr2", "after align"], bbox_to_anchor=(1, 1))
f.set_size_inches(10, 4);
../_images/1e62a726e9e9438125fd2ca88f31abd3c58e90c15c59964b8ca0e0f213d79273.png

Exercise 6

Consider the following two arrays. Write down the x and y coordinate locations for da1 - da2

da1 = xr.DataArray(
    np.arange(12).reshape(3, 4),
    dims=("space", "time"),
    coords={"space": ["a", "b", "c"], "time": [0, 1, 2, 3]},
)
da2 = xr.DataArray(
    [0, 1],
    dims="space",
    coords={"space": ["b", "d"]},
)

Further control over alignment#

Controlling the fill value#

For all join options other than "inner" Xarray will insert a fill_value at locations not present in the original dataset. By default this is NaN

arr1_aligned, _ = xr.align(arr1, arr2, join="outer")
arr1_aligned
<xarray.DataArray (space: 4, time: 7)>
array([[nan, nan,  0.,  1.,  2.,  3., nan],
       [nan, nan,  4.,  5.,  6.,  7., nan],
       [nan, nan,  8.,  9., 10., 11., nan],
       [nan, nan, nan, nan, nan, nan, nan]])
Coordinates:
  * space    (space) <U1 'a' 'b' 'c' 'd'
  * time     (time) int64 -2 -1 0 1 2 3 4

Control the “fill value” by specifying the fill_value keyword argument

arr1_aligned, _ = xr.align(arr1, arr2, join="outer", fill_value=0)
arr1_aligned
<xarray.DataArray (space: 4, time: 7)>
array([[ 0,  0,  0,  1,  2,  3,  0],
       [ 0,  0,  4,  5,  6,  7,  0],
       [ 0,  0,  8,  9, 10, 11,  0],
       [ 0,  0,  0,  0,  0,  0,  0]])
Coordinates:
  * space    (space) <U1 'a' 'b' 'c' 'd'
  * time     (time) int64 -2 -1 0 1 2 3 4

Checking that objects are aligned#

join="exact" is special in that it checks to make sure that the objects are aligned.

For arr1 and arr2 this will raise an error since arr1.x is not identical to arr2.x (and similarly for y)

xr.align(arr1, arr2, join="exact")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 xr.align(arr1, arr2, join="exact")

File ~/micromamba/envs/xarray-tutorial/lib/python3.11/site-packages/xarray/core/alignment.py:888, in align(join, copy, indexes, exclude, fill_value, *objects)
    692 """
    693 Given any number of Dataset and/or DataArray objects, returns new
    694 objects with aligned indexes and dimension sizes.
   (...)
    878 
    879 """
    880 aligner = Aligner(
    881     objects,
    882     join=join,
   (...)
    886     fill_value=fill_value,
    887 )
--> 888 aligner.align()
    889 return aligner.results

File ~/micromamba/envs/xarray-tutorial/lib/python3.11/site-packages/xarray/core/alignment.py:574, in Aligner.align(self)
    572 self.find_matching_unindexed_dims()
    573 self.assert_no_index_conflict()
--> 574 self.align_indexes()
    575 self.assert_unindexed_dim_sizes_equal()
    577 if self.join == "override":

File ~/micromamba/envs/xarray-tutorial/lib/python3.11/site-packages/xarray/core/alignment.py:421, in Aligner.align_indexes(self)
    419 if need_reindex:
    420     if self.join == "exact":
--> 421         raise ValueError(
    422             "cannot align objects with join='exact' where "
    423             "index/labels/sizes are not equal along "
    424             "these coordinates (dimensions): "
    425             + ", ".join(f"{name!r} {dims!r}" for name, dims in key[0])
    426         )
    427     joiner = self._get_index_joiner(index_cls)
    428     joined_index = joiner(matching_indexes)

ValueError: cannot align objects with join='exact' where index/labels/sizes are not equal along these coordinates (dimensions): 'space' ('space',)

Controlling automatic alignment#

Xarray’s default for arithmetic operations is join="inner". This is controllable using the xr.set_options context manager

with xr.set_options(arithmetic_join="outer"):
    result = arr1 - arr2
result
<xarray.DataArray (space: 4, time: 7)>
array([[nan, nan, nan, nan, nan, nan, nan],
       [nan, nan,  2.,  2.,  2.,  2., nan],
       [nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan]])
Coordinates:
  * space    (space) <U1 'a' 'b' 'c' 'd'
  * time     (time) int64 -2 -1 0 1 2 3 4