Core dimensions#
Previously we learned to use apply_ufunc
on simple functions that acted element by element.
Here we move on to slightly more complex functions like np.mean
that can act along a subset of an input array’s dimensions.
Such operations involve the concept of “core dimensions”.
Our learning goals are:
Learn how to identify “core dimensions” for the function you’re applying.
Learn that “core dimensions” are automatically moved or transposed to the end of the array.
Introduction#
For using more complex operations that consider some array values collectively,
it’s important to understand the idea of core dimensions.
Usually, they correspond to the fundamental dimensions over
which an operation is defined, e.g., the summed axis in np.sum
. One way to think about core dimensions
is to consider the smallest dimensionality of data that the function acts on.
Important
A good clue that core dimensions are needed is the presence of an axis
argument on the
corresponding NumPy function.
Setup#
%xmode minimal
import numpy as np
import xarray as xr
# limit the amount of information printed to screen
xr.set_options(display_expand_data=False)
np.set_printoptions(threshold=10, edgeitems=2)
Exception reporting mode: Minimal
Let’s load a dataset
ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Size: 31MB Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Reducing with np.mean
#
Let’s write a function that computes the mean along time
for a provided xarray object.
This function requires one core dimension time
. For ds.air
note that time
is the 0th axis.
ds.air.dims
('time', 'lat', 'lon')
get_axis_num
is a useful method.
ds.air.get_axis_num("time")
0
np.mean(ds.air, axis=ds.air.get_axis_num("time"))
<xarray.DataArray 'air' (lat: 25, lon: 53)> Size: 11kB 260.4 260.2 259.9 259.5 259.0 258.6 ... 298.0 297.9 297.8 297.3 297.3 297.3 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
np.mean(ds.air.data, axis=0)
array([[260.37644178, 260.18305137, ..., 251.93811644, 253.43804795],
[262.73439384, 262.79397603, ..., 251.58575685, 254.35926027],
...,
[298.12920205, 297.93700685, ..., 296.7770274 , 296.44383562],
[298.36615068, 298.38573973, ..., 297.28144521, 297.30510274]])
Let’s try to use apply_ufunc
to replicate np.mean(ds.air.data, axis=0)
xr.apply_ufunc(
# function to apply
np.mean,
# object with data to pass to function
ds,
# keyword arguments to pass to np.mean
kwargs={"axis": 0},
)
ValueError: applied function returned data with an unexpected number of dimensions. Received 2 dimension(s) but expected 3 dimensions with names ('time', 'lat', 'lon'), from:
array([[260.376442, 260.183051, 259.886627, ..., 250.815901, 251.938116,
253.438048],
[262.734394, 262.793976, 262.749339, ..., 249.755904, 251.585757,
254.35926 ],
[264.768764, 264.327308, 264.061695, ..., 250.60789 , 253.58351 ,
257.715599],
...,
[297.649863, 296.953332, 296.629315, ..., 296.810925, 296.287962,
295.816455],
[298.129202, 297.937007, 297.470394, ..., 296.859548, 296.777027,
296.443836],
[298.366151, 298.38574 , 298.114144, ..., 297.338205, 297.281445,
297.305103]])
The error here
applied function returned data with unexpected number of dimensions.
Received 2 dimension(s) but expected 3 dimensions with names: ('time', 'lat', 'lon')
means that while np.mean
did indeed reduce one dimension, we did not tell apply_ufunc
that this would happen. That is, we need to specify the core dimensions on the input.
Do that by passing a list of dimension names for each input object. For this function we have one input : ds
and with a single core dimension "time"
so we have input_core_dims=[["time"]]
xr.apply_ufunc(
np.mean,
ds,
# specify core dimensions as a list of lists
# here 'time' is the core dimension on `ds`
input_core_dims=[
["time"], # core dimension for ds
],
kwargs={"axis": 0},
)
ValueError: size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size. The data returned was:
array([[279.398 , 279.0572, 279.0104, ..., 279.63 , 279.398 , 279.27 ],
[279.6664, 279.538 , 279.2808, ..., 279.934 , 279.666 , 279.354 ],
[279.6612, 279.7296, 279.5508, ..., 280.534 , 280.318 , 279.882 ],
...,
[279.9508, 279.7756, 279.682 , ..., 279.802 , 279.766 , 279.426 ],
[280.3152, 280.27 , 280.1976, ..., 280.346 , 280.342 , 279.97 ],
[280.6624, 280.7976, 280.814 , ..., 280.778 , 280.834 , 280.482 ]])
This next error is a little confusing.
size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53.
Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size.
A good trick here is to pass a little wrapper function to apply_ufunc
instead and inspect the shapes of data received by the wrapper.
def wrapper(array, **kwargs):
print(f"received {type(array)} shape: {array.shape}, kwargs: {kwargs}")
result = np.mean(array, **kwargs)
print(f"result.shape: {result.shape}")
return result
xr.apply_ufunc(
wrapper,
ds,
# specify core dimensions as a list of lists
# here 'time' is the core dimension on `ds`
input_core_dims=[["time"]],
kwargs={"axis": 0},
)
received <class 'numpy.ndarray'> shape: (25, 53, 2920), kwargs: {'axis': 0}
result.shape: (53, 2920)
ValueError: size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 53. Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size. The data returned was:
array([[279.398 , 279.0572, 279.0104, ..., 279.63 , 279.398 , 279.27 ],
[279.6664, 279.538 , 279.2808, ..., 279.934 , 279.666 , 279.354 ],
[279.6612, 279.7296, 279.5508, ..., 280.534 , 280.318 , 279.882 ],
...,
[279.9508, 279.7756, 279.682 , ..., 279.802 , 279.766 , 279.426 ],
[280.3152, 280.27 , 280.1976, ..., 280.346 , 280.342 , 279.97 ],
[280.6624, 280.7976, 280.814 , ..., 280.778 , 280.834 , 280.482 ]])
Now we see the issue:
received <class 'numpy.ndarray'> shape: (25, 53, 2920), kwargs: {'axis': 0}
result.shape: (53, 2920)
The time
dimension is of size 2920
and is now the last axis of the array but was initially the first axis
ds.air.get_axis_num("time")
0
Important
This illustrates an important concept. Arrays are transposed so that core dimensions are at the end.
With apply_ufunc
, core dimensions are recognized by name, and then moved to
the last dimension of any input arguments before applying the given function.
This means that for functions that accept an axis
argument, you usually need
to set axis=-1
Such behaviour means that our functions (like wrapper
or np.mean
) do not need to know the exact order of dimensions. They can rely on the core dimensions being at the end allowing us to write very general code!
We can fix our apply_ufunc
call by specifying axis=-1
instead.
received <class 'numpy.ndarray'> shape: (25, 53, 2920), kwargs: {'axis': -1}
result.shape: (25, 53)
<xarray.Dataset> Size: 11kB Dimensions: (lat: 25, lon: 53) Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 Data variables: air (lat, lon) float64 11kB 260.4 260.2 259.9 ... 297.3 297.3 297.3
Exercise
Use apply_ufunc
to apply scipy.integrate.trapezoid
along the time
axis.
Solution
import scipy as sp
import scipy.integrate
xr.apply_ufunc(scipy.integrate.trapezoid, ds, input_core_dims=[["time"]], kwargs={"axis": -1})