Automatic Vectorization

Automatic Vectorization#

Previously we looked at applying functions on numpy arrays, and the concept of core dimensions. We learned that functions commonly support specifying “core dimensions” through the axis keyword argument.

However many functions exist, that implicitly have core dimensions, but do not provide an axis keyword argument. Applying such functions to a nD array usually involves one or multiple loops over the other dimensions — termed “loop dimensions” or “broadcast dimensions”.

A good example is numpy’s 1D interpolate function numpy.interp:

    Signature: np.interp(x, xp, fp, left=None, right=None, period=None)
    Docstring:
        One-dimensional linear interpolation.

    Returns the one-dimensional piecewise linear interpolant to a function
    with given discrete data points (`xp`, `fp`), evaluated at `x`.

This function expects 1D arrays as input, so there is one core dimension and we cannot easily apply it to a nD array since there is no axis keyword argument.

Our goal here is to

Understand the difference between core dimensions and loop dimensions
Understand vectorization
Learn how to apply such functions without loops using apply_ufunc by providing the vectorize keyword argument.

Core dimensions and looping#

Let’s say we want to interpolate an array with two dimensions (space, time) over the time dimension, we might

loop over the space dimension,
subset the array to a 1D array at that space location,
Interpolate the 1D arrays to the new time vector, and
Assign that new interpolated 1D array to the appropriate location of a 2D output array

In pseudo-code this might look like

for index in range(size_of_space_axis):
    out[index, :] = np.interp(..., array[index, :], ...)

Exercise

Consider the example problem of interpolating a 2D array with dimensions space and time along the time dimension. Which dimension is the core dimension, and which is the “loop dimension”?

Solution

time is the core dimension, and space is the loop dimension.

Vectorization#

The pattern of looping over any number of “loop dimensions” and applying a function along “core dimensions” is so common that numpy provides wrappers that automate these steps:

apply_ufunc provides an easy interface to numpy.vectorize through the keyword argument vectorize. Here we see how to use that to automatically apply np.interp along a single axis of a nD array

Review#

We’ll work with the apply_ufunc call from the section on handling dimensions that change size. See the “Handling Complex Output” section for how to get here.

This version only works with 1D vectors. We will expand that to work with inputs of any number of dimensions.

newlat = np.linspace(15, 75, 100)

xr.apply_ufunc(
    np.interp,  # first the function
    newlat,
    air.lat,
    air.isel(lon=0, time=0),  # this version only works with 1D vectors
    input_core_dims=[["lat"], ["lat"], ["lat"]],
    output_core_dims=[["lat"]],
    exclude_dims={"lat"},
)

<xarray.DataArray (lat: 100)> Size: 800B
296.3 296.2 296.1 296.0 295.9 296.0 ... 245.1 243.7 243.1 242.5 241.8 241.2
Coordinates:
    lon      float32 4B 200.0
    time     datetime64[ns] 8B 2013-01-01
Dimensions without coordinates: lat

Try nD input#

Our goal is to interpolate latitude at every longitude and time, such that we go from a dataset with dimensions (time: 4, lat: 25, lon: 3) to (time: 4, lat: 100, lon: 3).

If we blindly try passing air (a 3D DataArray), we get a hard-to-understand error

newlat = np.linspace(15, 75, 100)

xr.apply_ufunc(
    np.interp,  # first the function
    newlat,
    air.lat,
    air,
    input_core_dims=[["lat"], ["lat"], ["lat"]],
    output_core_dims=[["lat"]],
    exclude_dims={"lat"},
)

ValueError: object too deep for desired array

We will use a “wrapper” function debug_interp to examine what gets passed to numpy.interp.

Tip

Such wrapper functions are a great way to understand and debug apply_ufunc use cases.

def debug_interp(xi, x, data):
    print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
    return np.interp(xi, x, data)


interped = xr.apply_ufunc(
    debug_interp,  # first the function
    newlat,
    air.lat,
    air,
    input_core_dims=[["lat"], ["lat"], ["lat"]],
    output_core_dims=[["lat"]],
    exclude_dims={"lat"},  # dimensions allowed to change size. Must be set!
)

data: (4, 3, 25) | x: (25,) | xi: (100,)

ValueError: object too deep for desired array

That’s a hard-to-interpret error from NumPy but our print call helpfully printed the shapes of the input data:

data: (4, 3, 25) | x: (25,) | xi: (100,)

We see that apply_ufunc passes the full 3D array to interp1d_np which in turn passes that on to numpy.interp. But numpy.interp requires a 1D input, and thus the error.

Instead of passing the full 3D array we want loop over all combinations of lon and time; and apply our function to each corresponding vector of data along lat.

Conclusion#

This is why apply_ufunc is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.

The vectorize keyword argument, when set to True, will use numpy.vectorize to apply the function by looping over the “loop dimensions” — dimensions that are not the core dimensions for the applied function.

Automatic Vectorization

Contents

Automatic Vectorization#

Core dimensions and looping#

Vectorization#

Load data#

Review#

Try nD input#

Vectorization with `np.vectorize`#

Conclusion#

Automatic Vectorization

Contents

Automatic Vectorization#

Core dimensions and looping#

Vectorization#

Load data#

Review#

Try nD input#

Vectorization with np.vectorize#

Conclusion#

Vectorization with `np.vectorize`#