High-level computational patterns#

Motivation / Learning goals#

From https://toolz.readthedocs.io/en/latest/control.html

The Toolz library contains dozens of patterns like map and groupby. Learning a core set (maybe a dozen) covers the vast majority of common programming tasks often done by hand. A rich vocabulary of core control functions conveys the following benefits:

  • You identify new patterns

  • You make fewer errors in rote coding

  • You can depend on well tested and benchmarked implementations

The same is true for xarray

Xarray’s high-level patterns#

Xarray allows you to leverage dataset metadata to write more readable analysis code. The metadata is stored with the data; not in your head.

  1. Dimension names: dim="latitude" instead of axis=0

  2. Coordinate “labels”: or axis tick labels. data.sel(latitude=45) instead of data[10]

Xarray also provides high-level computational patterns that cover many data analysis tasks.

  1. rolling : Operate on rolling windows of your data e.g. running mean

  2. coarsen : Downsample your data

  3. groupby : Bin data in to groups and reduce

  4. groupby_bins: GroupBy after discretizing a numeric variable.

  5. resample : Groupby specialized for time axes. Either downsample or upsample your data.

  6. weighted : Weight your data before reducing

Load example dataset#

import numpy as np
import xarray as xr

xr.set_options(keep_attrs=True, display_expand_data=False)

da = xr.tutorial.load_dataset("air_temperature", engine="netcdf4").air
monthly = da.resample(time="M").mean()
data = da.isel(time=0)
data.plot()
<matplotlib.collections.QuadMesh at 0x7f02f8ea1720>
../_images/01-high-level-computation-patterns_4_1.png

Concept: “index space” vs “label space”#

These are windowed operations with a window of a fixed size.

  • rolling: sliding window operations e.g. running mean

  • coarsen: decimating; reshaping

data
<xarray.DataArray 'air' (lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 298.0 297.8 297.6 296.9 296.8 296.6
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
# index space
data[10, :]  # 10th element along the first axis; ¯\_(ツ)_/¯
<xarray.DataArray 'air' (lon: 53)>
277.3 277.4 277.8 278.6 279.5 280.1 ... 280.5 282.9 284.7 286.1 286.9 286.6
Coordinates:
    lat      float32 50.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
# slightly better index space
data.isel(lat=10)  # slightly better, 10th element in latitude
<xarray.DataArray 'air' (lon: 53)>
277.3 277.4 277.8 278.6 279.5 280.1 ... 280.5 282.9 284.7 286.1 286.9 286.6
Coordinates:
    lat      float32 50.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
# "label" space
data.sel(lat=50)  # much better! lat=50°N
<xarray.DataArray 'air' (lon: 53)>
277.3 277.4 277.8 278.6 279.5 280.1 ... 280.5 282.9 284.7 286.1 286.9 286.6
Coordinates:
    lat      float32 50.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
# What I wanted to do
data.sel(lat=50)

# What I had to do (if I wasn't using xarray)
data[10, :]
<xarray.DataArray 'air' (lon: 53)>
277.3 277.4 277.8 278.6 279.5 280.1 ... 280.5 282.9 284.7 286.1 286.9 286.6
Coordinates:
    lat      float32 50.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Xarray provides high-level patterns in both “index space” and “label space”#

Index space#

  1. rolling : Operate on rolling windows of your data e.g. running mean

  2. coarsen : Downsample your data

Label space#

  1. groupby : Bin data in to groups and reduce

  2. groupby_bins: GroupBy after discretizing a numeric variable.

  3. resample : Groupby specialized for time axes. Either downsample or upsample your data.


Index space: windows of fixed width#

Sliding windows of fixed length: rolling#

  • returns object of same shape as input

  • pads with NaNs to make this happen

  • supports multiple dimensions

Here’s the dataset

data.plot()
<matplotlib.collections.QuadMesh at 0x7f02f0e4fc70>
../_images/01-high-level-computation-patterns_13_1.png

And now smoothed 5 point running mean in lat and lon

data.rolling(lat=5, lon=5, center=True).mean().plot()
<matplotlib.collections.QuadMesh at 0x7f02f0cb2440>
../_images/01-high-level-computation-patterns_15_1.png

Apply an existing numpy-only function with reduce#

Tip: The reduce method expects a function that can receive and return plain arrays (e.g. numpy). The map method expects a function that can receive and return Xarray objects.

Here’s an example function: np.mean

Exercise Calculate the rolling mean in 5 point bins along both latitude and longitude using rolling(...).reduce

Answer:

# exactly equivalent to data.rolling(...).mean()
data.rolling(lat=5, lon=5, center=True).reduce(np.mean).plot()
<matplotlib.collections.QuadMesh at 0x7f02f0c548e0>
../_images/01-high-level-computation-patterns_19_1.png

For more complicated analysis, construct a new array with a new dimension.#

Allows things like short-time fourier transform, spectrogram, windowed rolling etc.

simple = xr.DataArray(np.arange(10), dims="time", coords={"time": np.arange(10)})
simple
<xarray.DataArray (time: 10)>
0 1 2 3 4 5 6 7 8 9
Coordinates:
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9
# adds a new dimension "window"
simple.rolling(time=5, center=True).construct("window")
<xarray.DataArray (time: 10, window: 5)>
nan nan 0.0 1.0 2.0 nan 0.0 1.0 2.0 3.0 ... 7.0 8.0 9.0 nan 7.0 8.0 9.0 nan nan
Coordinates:
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9
Dimensions without coordinates: window

Exercise Calculate the 5 point running mean in time using rolling.construct

Answer

(simple.rolling(time=5, center=True).construct("window").mean("window"))
<xarray.DataArray (time: 10)>
1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0 7.5 8.0
Coordinates:
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9

construct is clever.

  1. It constructs a view of the original array, so it is memory-efficient. but you didn’t have to know that.

  2. It does something sensible for dask arrays (though generally you want big chunksizes for the dimension you’re sliding along).

  3. It also works with rolling along multiple dimensions!

Advanced: Another construct example#

This is a 2D rolling example; we need to provide two new dimension names

(data.rolling(lat=5, lon=5, center=True).construct(lat="lat_roll", lon="lon_roll"))
<xarray.DataArray 'air' (lat: 25, lon: 53, lat_roll: 5, lon_roll: 5)>
nan nan nan nan nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Dimensions without coordinates: lat_roll, lon_roll
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Block windows of fixed length: coarsen#

For non-overlapping windows or “blocks” use coarsen. The syntax is very similar to rolling. You will need to specify boundary if the length of the dimension is not a multiple of the block size

data
<xarray.DataArray 'air' (lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 298.0 297.8 297.6 296.9 296.8 296.6
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
data.plot()
<matplotlib.collections.QuadMesh at 0x7f02f0b327a0>
../_images/01-high-level-computation-patterns_31_1.png
data.coarsen(lat=5, lon=5, boundary="trim").std()
<xarray.DataArray 'air' (lat: 5, lon: 10)>
12.73 12.19 11.7 6.48 2.827 2.738 2.491 ... 4.97 1.925 1.98 1.824 1.17 0.9438
Coordinates:
  * lat      (lat) float32 70.0 57.5 45.0 32.5 20.0
  * lon      (lon) float32 205.0 217.5 230.0 242.5 ... 280.0 292.5 305.0 317.5
    time     datetime64[ns] 2013-01-01
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
(data.coarsen(lat=5, lon=5, boundary="trim").mean().plot())
<matplotlib.collections.QuadMesh at 0x7f02f0a0bd60>
../_images/01-high-level-computation-patterns_33_1.png

coarsen supports reduce for custom reductions#

Exercise Use coarsen.reduce to apply np.mean in 5x5 (latxlon) point blocks of data

Answer

(data.coarsen(lat=5, lon=5, boundary="trim").reduce(np.mean).plot())
<matplotlib.collections.QuadMesh at 0x7f02f0901b70>
../_images/01-high-level-computation-patterns_36_1.png

coarsen supports construct for block reshaping#

This is usually a good alternative to np.reshape

A simple example splits a 2-year long monthly 1D time series into a 2D array shaped (year x month)

months = xr.DataArray(
    np.tile(np.arange(1, 13), reps=2),
    dims="time",
    coords={"time": np.arange(1, 25)},
)
months
<xarray.DataArray (time: 24)>
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Coordinates:
  * time     (time) int64 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 22 23 24
# break "time" into two new dimensions: "year", "month"
months.coarsen(time=12).construct(time=("year", "month"))
<xarray.DataArray (year: 2, month: 12)>
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Coordinates:
    time     (year, month) int64 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24
Dimensions without coordinates: year, month

Note two things:

  1. The time dimension was also reshaped.

  2. The new dimensions year and month don’t have any coordinate labels associated with them.

What if the data had say 23 instead of 24 values? In that case we specify a different boundary , here we pad to 24 values.

months.isel(time=slice(1, None)).coarsen(time=12, boundary="pad").construct(time=("year", "month"))
<xarray.DataArray (year: 2, month: 12)>
2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ... 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 nan
Coordinates:
    time     (year, month) float64 2.0 3.0 4.0 5.0 6.0 ... 22.0 23.0 24.0 nan
Dimensions without coordinates: year, month

This ends up adding values at the end of the array, not so sensible for this problem. We have some control of the padding through the side kwarg to coarsen. For side="right" we get more sensible output.

months.isel(time=slice(1, None)).coarsen(time=12, boundary="pad", side="right").construct(
    time=("year", "month")
)
<xarray.DataArray (year: 2, month: 12)>
nan 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 ... 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
Coordinates:
    time     (year, month) float64 nan 2.0 3.0 4.0 5.0 ... 21.0 22.0 23.0 24.0
Dimensions without coordinates: year, month

Note that coarsen pads with NaNs. For more control over paddnig, use DataArray.pad explicitly.

(
    months.isel(time=slice(1, None))
    .pad(time=(1, 0), constant_values=-1)
    .coarsen(time=12)
    .construct(time=("year", "month"))
)
<xarray.DataArray (year: 2, month: 12)>
-1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Coordinates:
    time     (year, month) float64 nan 2.0 3.0 4.0 5.0 ... 21.0 22.0 23.0 24.0
Dimensions without coordinates: year, month

Exercise Reshape the time dimension of the DataArray monthly to year x month and visualize the seasonal cycle for two years at 250°E

Answer

# splits time dimension into year x month
year_month = monthly.coarsen(time=12).construct(time=("year", "month"))

# assign a nice coordinate value for month
year_month["month"] = [
    "jan",
    "feb",
    "mar",
    "apr",
    "may",
    "jun",
    "jul",
    "aug",
    "sep",
    "oct",
    "nov",
    "dec",
]

# assign a nice coordinate value for year
year_month["year"] = [2013, 2014]

# seasonal cycle for two years
year_month.sel(lon=250).plot.contourf(col="year", x="month", y="lat")
<xarray.plot.facetgrid.FacetGrid at 0x7f02f0794b50>
../_images/01-high-level-computation-patterns_48_1.png

This exercise came up during the live lecture.

Exercise Calculate the rolling 4 month average, averaged across years.

Answer

  1. We first reshape using coarsen.construct to add year as a new dimension.

  2. Then rolling on the month dimension.

  3. It turns out that roll.mean(["year", "month"]) doesn’t work. So we use roll.construct to get a DataArray with a new dimension window and then take the mean over window and year

reshaped = months.coarsen(time=12).construct(time=("year", "month"))
roll = reshaped.rolling(month=4, center=True)
roll.construct("window").mean(["window", "year"])
<xarray.DataArray (month: 12)>
1.5 2.0 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.0
Dimensions without coordinates: month

Index space summary#

Use rolling and coarsen for fixed size windowing operations.

  1. rolling for overlapping windows

  2. coarsen for non-overlapping windows.

Both provide the usual reductions as methods (.mean() and friends), and also reduce and construct for custom operations.


Label space “windows” or bins : GroupBy#

Generalization of coarsen: sometimes the windows you want are not regular.

  • groupby: e.g. climatologies, composites; works when “groups” are exact: e.g. characters or integers; not floats

  • groupby_bins: binning operations e.g. histograms

  • resample: groupby but specialized for time grouping (so far)

tip Both groupby_bins and resample are implemented as GroupBy with a specific way of constructing group labels.

Deconstructing GroupBy#

Commonly called “split-apply-combine”.

  1. “split” : break dataset into groups

  2. “apply” : apply an operation, usually a reduction like mean

  3. “combine” : concatenate results from apply step along new “group” dimension

But really there is a first step: “identifying groups” also called “factorization” (or “binning”). Usually this is the hard part.

So “identify groups” → “split into groups” → “apply function” → “combine results”.

da.groupby("time.month")
DataArrayGroupBy, grouped over 'month'
12 groups with labels 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.
da.groupby("time.month").mean()
<xarray.DataArray 'air' (month: 12, lat: 25, lon: 53)>
246.3 246.4 246.2 245.8 245.2 244.6 ... 298.1 298.0 298.0 297.6 297.6 297.5
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

This is how xarray identifies “groups” for the monthly climatology computation

da.time.dt.month.plot()
[<matplotlib.lines.Line2D at 0x7f02f0a73d60>]
../_images/01-high-level-computation-patterns_56_1.png

Similarly for binning,

data.groupby_bins("lat", bins=[20, 35, 40, 45, 50])
DataArrayGroupBy, grouped over 'lat_bins'
4 groups with labels (45.0,, 50.0], ..., (20.0,, 35.0].

and resampling…

da.resample(time="M")
DataArrayResample, grouped over '__resample_dim__'
24 groups with labels 2013-01-31, ..., 2014-12-31.

Constructing group labels#

Xarray uses pandas.factorize for groupby and pandas.cut for groupby_bins.

If the automatic group detection doesn’t work for your problem then these functions are useful for constructing “group labels” in many cases

  1. numpy.digitize (binning)

  2. numpy.searchsorted supports many other data types

  3. pandas.factorize supports characters, strings etc.

  4. pandas.cut for binning

  5. DataArray.isin

More commonly useful are “datetime components”#

See a full list here

Accessed using DataArray.dt.*

da.time
<xarray.DataArray 'time' (time: 2920)>
2013-01-01 2013-01-01T06:00:00 ... 2014-12-31T12:00:00 2014-12-31T18:00:00
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    standard_name:  time
    long_name:      Time
da.time.dt.day
<xarray.DataArray 'day' (time: 2920)>
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 ... 28 28 28 29 29 29 29 30 30 30 30 31 31 31 31
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
da["time.day"]
<xarray.DataArray 'day' (time: 2920)>
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 ... 28 28 28 29 29 29 29 30 30 30 30 31 31 31 31
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
da.time.dt.season
<xarray.DataArray 'season' (time: 2920)>
'DJF' 'DJF' 'DJF' 'DJF' 'DJF' 'DJF' ... 'DJF' 'DJF' 'DJF' 'DJF' 'DJF' 'DJF'
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

Demo Grouping over a custom definition of seasons using numpy.isin.

We want to group over 4 seasons: DJF, MAM, JJAS, ON - this makes physical sense in the Indian Ocean basin

Start by extracting months.

month = da.time.dt.month.data
month
array([ 1,  1,  1, ..., 12, 12, 12])

Create a new empty array

season = np.full(month.shape, "    ")
season
array(['    ', '    ', '    ', ..., '    ', '    ', '    '], dtype='<U4')

Use isin to assign custom seasons,

season[np.isin(month, [12, 1, 2])] = "DJF"
season[np.isin(month, [3, 4, 5])] = "MAM"
season[np.isin(month, [6, 7, 8, 9])] = "JJAS"
season[np.isin(month, [10, 11])] = "ON"
season = da.time.copy(data=season)
season
<xarray.DataArray 'time' (time: 2920)>
'DJF' 'DJF' 'DJF' 'DJF' 'DJF' 'DJF' ... 'DJF' 'DJF' 'DJF' 'DJF' 'DJF' 'DJF'
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    standard_name:  time
    long_name:      Time
(
    # Calculate climatology
    da.groupby(season)
    .mean()
    # reindex to get seasons in logical order (not alphabetical order)
    .reindex(time=["DJF", "MAM", "JJAS", "ON"])
    .plot(col="time")
)
<xarray.plot.facetgrid.FacetGrid at 0x7f02f05fff70>
../_images/01-high-level-computation-patterns_73_1.png

floor, ceil and round time#

Basically “resampling”

da.time
<xarray.DataArray 'time' (time: 2920)>
2013-01-01 2013-01-01T06:00:00 ... 2014-12-31T12:00:00 2014-12-31T18:00:00
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    standard_name:  time
    long_name:      Time
# remove roundoff error in timestamps
# floor to daily frequency
da.time.dt.floor("D")
<xarray.DataArray 'floor' (time: 2920)>
2013-01-01 2013-01-01 2013-01-01 2013-01-01 ... 2014-12-31 2014-12-31 2014-12-31
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

strftime can be extremely useful#

So useful and so unintuitive that it has its own website: https://strftime.org/

This example avoids merging “Feb-29” and “Mar-01” for a daily climatology

da.time.dt.strftime("%b-%d")
<xarray.DataArray 'strftime' (time: 2920)>
'Jan-01' 'Jan-01' 'Jan-01' 'Jan-01' ... 'Dec-31' 'Dec-31' 'Dec-31' 'Dec-31'
Coordinates:
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00

groupby supports reduce for custom reductions#

This applies to groupby_bins and resample

(da.groupby("time.month").reduce(np.mean).plot(col="month", col_wrap=4))
<xarray.plot.facetgrid.FacetGrid at 0x7f02f052fee0>
../_images/01-high-level-computation-patterns_80_1.png

tip map is for functions that expect and return xarray objects (see also Dataset.map). reduce is for functions that expect and return plain arrays (like numpy or scipy functions)

GroupBy does not provide construct#

All the groups need not be the same “length” (e.g. months can have 28, 29, 30, or 31 days)

Instead looping over groupby objects is possible#

Maybe you want to plot data in each group separately?

for label, group in da.groupby("time.month"):
    print(label)
1
2
3
4
5
6
7
8
9
10
11
12

This is a DataArray contain data for all December days

group
<xarray.DataArray 'air' (time: 248, lat: 25, lon: 53)>
268.8 266.6 263.8 260.7 257.6 254.8 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-12-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Maybe you want a histogram of December temperatures?

group.plot.hist()
(array([  218.,  6763., 32167., 34735., 26251., 29243., 40491., 41906.,
        72286., 44540.]),
 array([221.        , 229.33900452, 237.67799377, 246.01699829,
        254.35598755, 262.69500732, 271.03399658, 279.37298584,
        287.7119751 , 296.05099487, 304.38998413]),
 <BarContainer object of 10 artists>)
../_images/01-high-level-computation-patterns_87_1.png

In most cases, avoid a for loop using map#

Apply functions that expect xarray Datasets or DataArrays.

Avoid having to manually combine results using concat

def iqr(da, dim):
    """Calculates interquartile range"""
    return (da.quantile(q=0.75, dim=dim) - da.quantile(q=0.25, dim=dim)).rename("iqr")


da.groupby("time.month").map(iqr, dim="time")
<xarray.DataArray 'iqr' (month: 12, lat: 25, lon: 53)>
7.528 7.425 7.025 6.65 6.152 6.135 6.127 ... 0.9 0.9 0.855 1.0 1.3 1.432 1.5
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Summary#

Xarray provides methods for high-level analysis patterns:

  1. rolling : Operate on rolling windows of your data e.g. running mean

  2. coarsen : Downsample your data

  3. groupby : Bin data in to groups and reduce

  4. groupby_bins: GroupBy after discretizing a numeric variable.

  5. resample : Groupby specialized for time axes. Either downsample or upsample your data.

  6. weighted : Weight your data before reducing

More resources#

  1. More tutorials here: https://tutorial.xarray.dev/

  2. Answers to common questions on “how to do X” are here: https://docs.xarray.dev/en/stable/howdoi.html