Working with labeled data
Contents

Working with labeled data#
Learning goals:
Use different forms of indexing to select data based on position and coordinates
Select datetime ranges
Scientific data is inherently labeled. For example, time series data includes timestamps that label individual periods or points in time, spatial data has coordinates (e.g. longitude, latitude, elevation), and model or laboratory experiments are often identified by unique identifiers. In this notebook we’ll see that labeled dimensions make code much easier to understand!
import numpy as np
import pandas as pd
import xarray as xr
We’ll start by comparing common indexing operations with a numpy
array and equivalent xarray
DataArray:
# axis0: x, axis1: y
np_array = np.arange(10).reshape(2, 5)
np_array
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
da = xr.DataArray(np_array, dims=("x", "y"))
da
<xarray.DataArray (x: 2, y: 5)> array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]) Dimensions without coordinates: x, y
Position-based indexing#
Indexing#
Recall that indexing is selecting a value from an array based on its position
np_array[0, 3]
3
da.isel(x=0, y=3) # or da[{"x": 0, "y": 3}]
<xarray.DataArray ()> array(3)
Slicing#
And slicing retrieves a range of values
np_array[:2, 1:]
array([[1, 2, 3, 4],
[6, 7, 8, 9]])
da.isel(x=slice(None, 2), y=slice(1, None))
<xarray.DataArray (x: 2, y: 4)> array([[1, 2, 3, 4], [6, 7, 8, 9]]) Dimensions without coordinates: x, y
Label-based indexing#
Remembering the axis order can be challenging even with 2D arrays (is np_array[0,3] the first row and third column or first column and third row? or did I store these samples by row or by column when I saved the data?!). The difficulty is compounded with added dimensions. Xarray objects eliminate much of the mental overhead by adding coordinate labels:
arr = xr.DataArray(
data=np.arange(48).reshape(4, 2, 6),
dims=("u", "v", "time"),
coords={
"u": [-3.2, 2.1, 5.3, 6.5],
"v": [-1, 2.6],
"time": pd.date_range("2009-01-05", periods=6, freq="M"),
},
)
arr
<xarray.DataArray (u: 4, v: 2, time: 6)> array([[[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11]], [[12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23]], [[24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35]], [[36, 37, 38, 39, 40, 41], [42, 43, 44, 45, 46, 47]]]) Coordinates: * u (u) float64 -3.2 2.1 5.3 6.5 * v (v) float64 -1.0 2.6 * time (time) datetime64[ns] 2009-01-31 2009-02-28 ... 2009-06-30
To select data by coordinate labels instead of integer indices we can use the
same syntax, using sel
instead of isel
:
arr.sel(u=5.3, time="2009-04-30") # or arr.loc[{"u": 5.3, "time": "2009-04-30"}]
<xarray.DataArray (v: 2)> array([27, 33]) Coordinates: u float64 5.3 * v (v) float64 -1.0 2.6 time datetime64[ns] 2009-04-30
this will require us to specify exact coordinate values. If we don’t have those, we can use the method
parameter (see Dataset.sel
for documentation):
arr.sel(u=5, time="2009-04-28", method="nearest")
<xarray.DataArray (v: 2)> array([27, 33]) Coordinates: u float64 5.3 * v (v) float64 -1.0 2.6 time datetime64[ns] 2009-04-30
We can also select multiple values:
arr.sel(u=[-3.2, 6.5], time=slice("2009-02-28", "2009-05-31"))
<xarray.DataArray (u: 2, v: 2, time: 4)> array([[[ 1, 2, 3, 4], [ 7, 8, 9, 10]], [[37, 38, 39, 40], [43, 44, 45, 46]]]) Coordinates: * u (u) float64 -3.2 6.5 * v (v) float64 -1.0 2.6 * time (time) datetime64[ns] 2009-02-28 2009-03-31 2009-04-30 2009-05-31
If instead of selecting data we want to drop it, we can use drop_sel
:
arr.drop_sel(u=[-3.2, 6.5])
<xarray.DataArray (u: 2, v: 2, time: 6)> array([[[12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23]], [[24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35]]]) Coordinates: * u (u) float64 2.1 5.3 * v (v) float64 -1.0 2.6 * time (time) datetime64[ns] 2009-01-31 2009-02-28 ... 2009-06-30
Exercises#
Practice the syntax you’ve learned with the xarray tutorial dataset!
ds = xr.tutorial.open_dataset("air_temperature")
ds
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Select the first 30 entries of latitude and 20th to 40th entries of longitude
ds.isel(lat=slice(None, 30), lon=slice(20, 40))
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 20) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 250.0 252.5 255.0 257.5 ... 290.0 292.5 295.0 297.5 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Select all data at 75 degree north and between Jan 1, 2013 and Oct 15, 2013
ds.sel(lat=75, time=slice("2013-01-01", "2013-10-15"))
<xarray.Dataset> Dimensions: (time: 1152, lon: 53) Coordinates: lat float32 75.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2013-10-15T18:00:00 Data variables: air (time, lon) float32 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Remove all entries at 260 and 270 degrees
ds.drop_sel(lon=[260, 270])
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 51) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 ... Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...