Core dimensions

Core dimensions#

Previously we learned to use apply_ufunc on simple functions that acted element by element.

Here we move on to slightly more complex functions like np.mean that can act along a subset of an input array’s dimensions.

Such operations involve the concept of “core dimensions”.

Our learning goals are:

  • Learn how to identify “core dimensions” for the function you’re applying.

  • Learn that “core dimensions” are automatically moved or transposed to the end of the array.

Introduction#

For using more complex operations that consider some array values collectively, it’s important to understand the idea of core dimensions. Usually, they correspond to the fundamental dimensions over which an operation is defined, e.g., the summed axis in np.sum. One way to think about core dimensions is to consider the smallest dimensionality of data that the function acts on.

Important

A good clue that core dimensions are needed is the presence of an axis argument on the corresponding NumPy function.

Setup#

%xmode minimal

import numpy as np
import xarray as xr

# limit the amount of information printed to screen
xr.set_options(display_expand_data=False)
np.set_printoptions(threshold=10, edgeitems=2)
Exception reporting mode: Minimal

Let’s load a dataset

ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Reducing with np.mean#

Let’s write a function that computes the mean along time for a provided xarray object.

This function requires one core dimension time. For ds.air note that time is the 0th axis.

ds.air.dims
('time', 'lat', 'lon')

get_axis_num is a useful method.

ds.air.get_axis_num("time")
0
np.mean(ds.air, axis=ds.air.get_axis_num("time"))
<xarray.DataArray 'air' (lat: 25, lon: 53)>
260.4 260.2 259.9 259.5 259.0 258.6 ... 298.0 297.9 297.8 297.3 297.3 297.3
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
np.mean(ds.air.data, axis=0)
array([[260.37564, 260.1826 , ..., 251.93733, 253.43741],
       [262.7337 , 262.7936 , ..., 251.5852 , 254.35849],
       ...,
       [298.1287 , 297.93646, ..., 296.77686, 296.44348],
       [298.36594, 298.38593, ..., 297.28104, 297.30502]], dtype=float32)

Let’s try to use apply_ufunc to replicate np.mean(ds.air.data, axis=0)

xr.apply_ufunc(
    # function to apply
    np.mean,
    # object with data to pass to function
    ds,
    # keyword arguments to pass to np.mean
    kwargs={"axis": 0},
)
ValueError: applied function returned data with an unexpected number of dimensions. Received 2 dimension(s) but expected 3 dimensions with names ('time', 'lat', 'lon'), from:

array([[260.37564, 260.1826 , 259.88593, ..., 250.81511, 251.93733, 253.43741],
       [262.7337 , 262.7936 , 262.7489 , ..., 249.75496, 251.5852 , 254.35849],
       [264.7681 , 264.3271 , 264.0614 , ..., 250.60707, 253.58247, 257.71475],
       ...,
       [297.64932, 296.95294, 296.62912, ..., 296.81033, 296.28793, 295.81622],
       [298.1287 , 297.93646, 297.47006, ..., 296.8591 , 296.77686, 296.44348],
       [298.36594, 298.38593, 298.11386, ..., 297.33777, 297.28104, 297.30502]],
      dtype=float32)

The error here

applied function returned data with unexpected number of dimensions. 
Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')

means that while np.mean did indeed reduce one dimension, we did not tell apply_ufunc that this would happen. That is, we need to specify the core dimensions on the input.

Do that by passing a list of dimension names for each input object. For this function we have one input : ds and with a single core dimension "time" so we have input_core_dims=[["time"]]

xr.apply_ufunc(
    np.mean,
    ds,
    # specify core dimensions as a list of lists
    # here 'time' is the core dimension on `ds`
    input_core_dims=[
        ["time"],  # core dimension for ds
    ],
    kwargs={"axis": 0},
)
ValueError: size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size. The data returned was:

array([[279.39798, 279.05722, 279.0104 , ..., 279.63   , 279.398  , 279.27   ],
       [279.6664 , 279.538  , 279.2808 , ..., 279.934  , 279.66602, 279.354  ],
       [279.66122, 279.7296 , 279.5508 , ..., 280.534  , 280.31796, 279.88202],
       ...,
       [279.9508 , 279.77563, 279.682  , ..., 279.802  , 279.766  , 279.42596],
       [280.31522, 280.27002, 280.19763, ..., 280.346  , 280.34198, 279.96997],
       [280.6624 , 280.79764, 280.81403, ..., 280.77798, 280.834  , 280.48196]],
      dtype=float32)

This next error is a little confusing.

size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. 
Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.

A good trick here is to pass a little wrapper function to apply_ufunc instead and inspect the shapes of data received by the wrapper.

def wrapper(array, **kwargs):
    print(f"received {type(array)} shape: {array.shape}, kwargs: {kwargs}")
    result = np.mean(array, **kwargs)
    print(f"result.shape: {result.shape}")
    return result


xr.apply_ufunc(
    wrapper,
    ds,
    # specify core dimensions as a list of lists
    # here 'time' is the core dimension on `ds`
    input_core_dims=[["time"]],
    kwargs={"axis": 0},
)
received <class 'numpy.ndarray'> shape: (25, 53, 2920), kwargs: {'axis': 0}
result.shape: (53, 2920)
ValueError: size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size. The data returned was:

array([[279.39798, 279.05722, 279.0104 , ..., 279.63   , 279.398  , 279.27   ],
       [279.6664 , 279.538  , 279.2808 , ..., 279.934  , 279.66602, 279.354  ],
       [279.66122, 279.7296 , 279.5508 , ..., 280.534  , 280.31796, 279.88202],
       ...,
       [279.9508 , 279.77563, 279.682  , ..., 279.802  , 279.766  , 279.42596],
       [280.31522, 280.27002, 280.19763, ..., 280.346  , 280.34198, 279.96997],
       [280.6624 , 280.79764, 280.81403, ..., 280.77798, 280.834  , 280.48196]],
      dtype=float32)

Now we see the issue:

received <class 'numpy.ndarray'> shape: (25, 53, 2920), kwargs: {'axis': 0}
result.shape: (53, 2920)

The time dimension is of size 2920 and is now the last axis of the array but was initially the first axis

ds.air.get_axis_num("time")
0

Important

This illustrates an important concept. Arrays are transposed so that core dimensions are at the end.

With apply_ufunc, core dimensions are recognized by name, and then moved to the last dimension of any input arguments before applying the given function. This means that for functions that accept an axis argument, you usually need to set axis=-1

Such behaviour means that our functions (like wrapper or np.mean) do not need to know the exact order of dimensions. They can rely on the core dimensions being at the end allowing us to write very general code!

We can fix our apply_ufunc call by specifying axis=-1 instead.

def wrapper(array, **kwargs):
    print(f"received {type(array)} shape: {array.shape}, kwargs: {kwargs}")
    result = np.mean(array, **kwargs)
    print(f"result.shape: {result.shape}")
    return result


xr.apply_ufunc(
    wrapper,
    ds,
    input_core_dims=[["time"]],
    kwargs={"axis": -1},
)
received <class 'numpy.ndarray'> shape: (25, 53, 2920), kwargs: {'axis': -1}
result.shape: (25, 53)
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Data variables:
    air      (lat, lon) float32 260.4 260.2 259.9 259.5 ... 297.3 297.3 297.3

Exercise 24

Use apply_ufunc to apply scipy.integrate.trapz along the time axis.