Indexing and Selecting Data#
Learning Objectives#
Select data by position using
.iselwith values or slicesSelect data by label using
.selwith values or slicesSelect timeseries data by date/time with values or slices
Use nearest-neighbor lookups with
.sel
Why do we need label-based indexing?#
Scientific data is inherently labeled. For example, time series data includes timestamps that label individual periods or points in time, spatial data has coordinates (e.g. longitude, latitude, elevation), and model or laboratory experiments are often identified by unique identifiers.
import xarray as xr
%config InlineBackend.figure_format='retina'
ds = xr.open_dataset("../../data/sst.mnmean.nc")
ds
<xarray.Dataset> Size: 8MB
Dimensions: (lat: 89, lon: 180, time: 128)
Coordinates:
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
* time (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Data variables:
sst (time, lat, lon) float32 8MB ...
Attributes: (12/37)
climatology: Climatology is based on 1971-2000 SST, Xue, Y....
description: In situ data: ICOADS2.5 before 2007 and NCEP i...
keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sci...
keywords: Earth Science > Oceans > Ocean Temperature > S...
instrument: Conventional thermometers
source_comment: SSTs were observed by conventional thermometer...
... ...
creator_url_original: https://www.ncei.noaa.gov
license: No constraints on data access or use
comment: SSTs were observed by conventional thermometer...
summary: ERSST.v5 is developed based on v4 after revisi...
dataset_title: NOAA Extended Reconstructed SST V5
data_modified: 2020-09-07NumPy Positional Indexing#
When working with numpy, indexing is done by position (slices/ranges/scalars).
t = ds["sst"].data # numpy array
t
Show code cell output
array([[[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]],
[[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]],
[[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]],
...,
[[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]],
[[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]],
[[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]]],
shape=(128, 89, 180), dtype=float32)
t.shape
(128, 89, 180)
# extract a time-series for one spatial location
t[:, 20, 40]
Show code cell output
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
dtype=float32)
Indexing with xarray#
xarray offers extremely flexible indexing routines that combine the best features of NumPy and pandas for data selection.
da = ds["sst"] # Extract data array
da
Show code cell output
<xarray.DataArray 'sst' (time: 128, lat: 89, lon: 180)> Size: 8MB
array([[[-1.8, -1.8, ..., -1.8, -1.8],
[-1.8, -1.8, ..., -1.8, -1.8],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
[[-1.8, -1.8, ..., -1.8, -1.8],
[-1.8, -1.8, ..., -1.8, -1.8],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
...,
[[-1.8, -1.8, ..., -1.8, -1.8],
[-1.8, -1.8, ..., -1.8, -1.8],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]],
[[-1.8, -1.8, ..., -1.8, -1.8],
[-1.8, -1.8, ..., -1.8, -1.8],
...,
[ nan, nan, ..., nan, nan],
[ nan, nan, ..., nan, nan]]], shape=(128, 89, 180), dtype=float32)
Coordinates:
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
* time (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]NumPy style indexing still works (but preserves the labels/metadata)
da[:, 20, 40]
Show code cell output
<xarray.DataArray 'sst' (time: 128)> Size: 512B
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan], dtype=float32)
Coordinates:
lat float32 4B 48.0
lon float32 4B 80.0
* time (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]Positional indexing using dimension names
Label-based indexing
da.sel(lat=50.0, lon=200.0, time="2020")
<xarray.DataArray 'sst' (time: 8)> Size: 32B
array([ 5.501727, 5.015851, 4.808821, 5.837058, 7.285223, 8.64473 ,
11.524967, 12.405846], dtype=float32)
Coordinates:
lat float32 4B 50.0
lon float32 4B 200.0
* time (time) datetime64[ns] 64B 2020-01-01 2020-02-01 ... 2020-08-01
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]# demonstrate slicing
ds.sel(time=slice("2019-05", "2020-07"))
<xarray.Dataset> Size: 962kB
Dimensions: (lat: 89, lon: 180, time: 15)
Coordinates:
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
* time (time) datetime64[ns] 120B 2019-05-01 2019-06-01 ... 2020-07-01
Data variables:
sst (time, lat, lon) float32 961kB -1.8 -1.8 -1.8 -1.8 ... nan nan nan
Attributes: (12/37)
climatology: Climatology is based on 1971-2000 SST, Xue, Y....
description: In situ data: ICOADS2.5 before 2007 and NCEP i...
keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sci...
keywords: Earth Science > Oceans > Ocean Temperature > S...
instrument: Conventional thermometers
source_comment: SSTs were observed by conventional thermometer...
... ...
creator_url_original: https://www.ncei.noaa.gov
license: No constraints on data access or use
comment: SSTs were observed by conventional thermometer...
summary: ERSST.v5 is developed based on v4 after revisi...
dataset_title: NOAA Extended Reconstructed SST V5
data_modified: 2020-09-07Nearest Neighbor Lookups
da.sel(lat=52.25, lon=251.8998, method="nearest")
Show code cell output
<xarray.DataArray 'sst' (time: 128)> Size: 512B
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan], dtype=float32)
Coordinates:
lat float32 4B 52.0
lon float32 4B 252.0
* time (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]All of these indexing methods work on the dataset too:
ds.sel(lat=52.25, lon=251.8998, method="nearest")
<xarray.Dataset> Size: 2kB
Dimensions: (time: 128)
Coordinates:
lat float32 4B 52.0
lon float32 4B 252.0
* time (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Data variables:
sst (time) float32 512B nan nan nan nan nan nan ... nan nan nan nan nan
Attributes: (12/37)
climatology: Climatology is based on 1971-2000 SST, Xue, Y....
description: In situ data: ICOADS2.5 before 2007 and NCEP i...
keywords_vocabulary: NASA Global Change Master Directory (GCMD) Sci...
keywords: Earth Science > Oceans > Ocean Temperature > S...
instrument: Conventional thermometers
source_comment: SSTs were observed by conventional thermometer...
... ...
creator_url_original: https://www.ncei.noaa.gov
license: No constraints on data access or use
comment: SSTs were observed by conventional thermometer...
summary: ERSST.v5 is developed based on v4 after revisi...
dataset_title: NOAA Extended Reconstructed SST V5
data_modified: 2020-09-07Vectorized Indexing#
Like numpy and pandas, xarray supports indexing many array elements at once in a vectorized manner:
# generate a coordinates for a transect of points
lat_points = xr.DataArray([60, 80, 90], dims="points")
lon_points = xr.DataArray([250, 250, 250], dims="points")
lat_points
<xarray.DataArray (points: 3)> Size: 24B array([60, 80, 90]) Dimensions without coordinates: points
<xarray.DataArray (points: 3)> Size: 24B array([250, 250, 250]) Dimensions without coordinates: points
# nearest neighbor selection along the transect
da.sel(lat=lat_points, lon=lon_points, method="nearest").plot();
Indexing with where()#
# Let's replace the missing values (nan) with some placeholder
ds.sst.where(ds.sst.notnull(), -99)
Show code cell output
<xarray.DataArray 'sst' (time: 128, lat: 89, lon: 180)> Size: 8MB
array([[[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ]],
[[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ]],
[[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
...
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ]],
[[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ]],
[[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
[ -1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
...,
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ],
[-99. , -99. , -99. , ..., -99. , -99. , -99. ]]],
shape=(128, 89, 180), dtype=float32)
Coordinates:
* lat (lat) float32 356B 88.0 86.0 84.0 82.0 ... -82.0 -84.0 -86.0 -88.0
* lon (lon) float32 720B 0.0 2.0 4.0 6.0 8.0 ... 352.0 354.0 356.0 358.0
* time (time) datetime64[ns] 1kB 2010-01-01 2010-02-01 ... 2020-08-01
Attributes:
long_name: Monthly Means of Sea Surface Temperature
units: degC
var_desc: Sea Surface Temperature
level_desc: Surface
statistic: Mean
dataset: NOAA Extended Reconstructed SST V5
parent_stat: Individual Values
actual_range: [-1.8 42.32636]
valid_range: [-1.8 45. ]Going Further#
Previous: xarray fundamentals
Next: Computation

