https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png

Working with labeled data#

Learning goals:

  • Use different forms of indexing to select data based on position and coordinates

  • Select datetime ranges

Scientific data is inherently labeled. For example, time series data includes timestamps that label individual periods or points in time, spatial data has coordinates (e.g. longitude, latitude, elevation), and model or laboratory experiments are often identified by unique identifiers. In this notebook we’ll see that labeled dimensions make code much easier to understand!

import numpy as np
import pandas as pd
import xarray as xr

We’ll start by comparing common indexing operations with a numpy array and equivalent xarray DataArray:

# axis0: x, axis1: y
np_array = np.arange(10).reshape(2, 5)
np_array
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
da = xr.DataArray(np_array, dims=("x", "y"))
da
<xarray.DataArray (x: 2, y: 5)>
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])
Dimensions without coordinates: x, y

Position-based indexing#

Indexing#

Recall that indexing is selecting a value from an array based on its position

np_array[0, 3]
3
da.isel(x=0, y=3)  # or da[{"x": 0, "y": 3}]
<xarray.DataArray ()>
array(3)

Slicing#

And slicing retrieves a range of values

np_array[:2, 1:]
array([[1, 2, 3, 4],
       [6, 7, 8, 9]])
da.isel(x=slice(None, 2), y=slice(1, None))
<xarray.DataArray (x: 2, y: 4)>
array([[1, 2, 3, 4],
       [6, 7, 8, 9]])
Dimensions without coordinates: x, y

Label-based indexing#

Remembering the axis order can be challenging even with 2D arrays (is np_array[0,3] the first row and third column or first column and third row? or did I store these samples by row or by column when I saved the data?!). The difficulty is compounded with added dimensions. Xarray objects eliminate much of the mental overhead by adding coordinate labels:

arr = xr.DataArray(
    data=np.arange(48).reshape(4, 2, 6),
    dims=("u", "v", "time"),
    coords={
        "u": [-3.2, 2.1, 5.3, 6.5],
        "v": [-1, 2.6],
        "time": pd.date_range("2009-01-05", periods=6, freq="M"),
    },
)
arr
<xarray.DataArray (u: 4, v: 2, time: 6)>
array([[[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]],

       [[12, 13, 14, 15, 16, 17],
        [18, 19, 20, 21, 22, 23]],

       [[24, 25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34, 35]],

       [[36, 37, 38, 39, 40, 41],
        [42, 43, 44, 45, 46, 47]]])
Coordinates:
  * u        (u) float64 -3.2 2.1 5.3 6.5
  * v        (v) float64 -1.0 2.6
  * time     (time) datetime64[ns] 2009-01-31 2009-02-28 ... 2009-06-30

To select data by coordinate labels instead of integer indices we can use the same syntax, using sel instead of isel:

arr.sel(u=5.3, time="2009-04-30")  # or arr.loc[{"u": 5.3, "time": "2009-04-30"}]
<xarray.DataArray (v: 2)>
array([27, 33])
Coordinates:
    u        float64 5.3
  * v        (v) float64 -1.0 2.6
    time     datetime64[ns] 2009-04-30

this will require us to specify exact coordinate values. If we don’t have those, we can use the method parameter (see Dataset.sel for documentation):

arr.sel(u=5, time="2009-04-28", method="nearest")
<xarray.DataArray (v: 2)>
array([27, 33])
Coordinates:
    u        float64 5.3
  * v        (v) float64 -1.0 2.6
    time     datetime64[ns] 2009-04-30

We can also select multiple values:

arr.sel(u=[-3.2, 6.5], time=slice("2009-02-28", "2009-05-31"))
<xarray.DataArray (u: 2, v: 2, time: 4)>
array([[[ 1,  2,  3,  4],
        [ 7,  8,  9, 10]],

       [[37, 38, 39, 40],
        [43, 44, 45, 46]]])
Coordinates:
  * u        (u) float64 -3.2 6.5
  * v        (v) float64 -1.0 2.6
  * time     (time) datetime64[ns] 2009-02-28 2009-03-31 2009-04-30 2009-05-31

If instead of selecting data we want to drop it, we can use drop_sel:

arr.drop_sel(u=[-3.2, 6.5])
<xarray.DataArray (u: 2, v: 2, time: 6)>
array([[[12, 13, 14, 15, 16, 17],
        [18, 19, 20, 21, 22, 23]],

       [[24, 25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34, 35]]])
Coordinates:
  * u        (u) float64 2.1 5.3
  * v        (v) float64 -1.0 2.6
  * time     (time) datetime64[ns] 2009-01-31 2009-02-28 ... 2009-06-30

Exercises#

Practice the syntax you’ve learned with the xarray tutorial dataset!

ds = xr.tutorial.open_dataset("air_temperature")
ds
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
  1. Select the first 30 entries of latitude and 20th to 40th entries of longitude

ds.isel(lat=slice(None, 30), lon=slice(20, 40))
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 20)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 250.0 252.5 255.0 257.5 ... 290.0 292.5 295.0 297.5
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
  1. Select all data at 75 degree north and between Jan 1, 2013 and Oct 15, 2013

ds.sel(lat=75, time=slice("2013-01-01", "2013-10-15"))
<xarray.Dataset>
Dimensions:  (time: 1152, lon: 53)
Coordinates:
    lat      float32 75.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2013-10-15T18:00:00
Data variables:
    air      (time, lon) float32 241.2 242.5 243.5 244.0 ... 259.3 259.3 259.2
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
  1. Remove all entries at 260 and 270 degrees

ds.drop_sel(lon=[260, 270])
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 51)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...