Xarray’s Data structures#
In this lesson, we cover the basics of Xarray data structures. By the end of the lesson, we will be able to:
Learning Goals
Understand the basic Xarray data structures
DataArray
andDataset
Customize the display of Xarray data structures
The connection between Pandas and Xarray data structures
Data structures#
Xarray provides two data structures: the DataArray
and Dataset
. The
DataArray
class attaches dimension names, coordinates and attributes to
multi-dimensional arrays while Dataset
combines multiple DataArrays.
Both classes are most commonly created by reading data. To learn how to create a DataArray or Dataset manually, see the Creating Data Structures tutorial.
import numpy as np
import xarray as xr
import pandas as pd
# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking
# The following settings reduce the amount of data displayed out by default
xr.set_options(display_expand_attrs=False, display_expand_data=False)
np.set_printoptions(threshold=10, edgeitems=2)
Dataset#
Dataset
objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.
Xarray has a few small real-world tutorial datasets hosted in this GitHub repository pydata/xarray-data.
We’ll use the xarray.tutorial.load_dataset convenience function to download and open the air_temperature
(National Centers for Environmental Prediction) Dataset by name.
ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Size: 31MB Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 Attributes: (5)
We can access “layers” of the Dataset (individual DataArrays) with dictionary syntax
ds["air"]
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
We can save some typing by using the “attribute” or “dot” notation. This won’t work for variable names that clash with built-in
method names (for example, mean
).
ds.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
HTML vs text representations#
Xarray has two representation types: "html"
(which is only available in
notebooks) and "text"
. To choose between them, use the display_style
option.
So far, our notebook has automatically displayed the "html"
representation (which we will continue using).
The "html"
representation is interactive, allowing you to collapse sections (▶) and
view attributes and values for each value (📄 and ≡).
with xr.set_options(display_style="html"):
display(ds)
<xarray.Dataset> Size: 31MB Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 Attributes: (5)
☝️ From top to bottom the output consists of:
Dimensions: summary of all dimensions of the
Dataset
(lat: 25, time: 2920, lon: 53)
: this tells us that the first dimension is namedlat
and has a size of25
, the second dimension is namedtime
and has a size of2920
, and the third dimension is namedlon
and has a size of53
. Because we will access the dimensions by name, the order doesn’t matter.Coordinates: an unordered list of coordinates or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be printed in bold font. dimensions without coordinates appear in plain font (there are none in this example, but you might imagine a ‘mask’ coordinate that has a value assigned at every point).
Data variables: names of each nD measurement in the dataset, followed by its dimensions
(time, lat, lon)
, dtype, and a preview of values.Indexes: Each dimension with coordinates is backed by an “Index”. In this example, each dimension is backed by a
PandasIndex
Attributes: an unordered list of metadata (for example, a paragraph describing the dataset)
Compare that with the string representation, which is very similar except the dimensions are given a *
prefix instead of bold and you cannot collapse or expand the outputs.
with xr.set_options(display_style="text"):
display(ds)
<xarray.Dataset> Size: 31MB Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 Attributes: (5)
To understand each of the components better, we’ll explore the “air” variable of our Dataset.
DataArray#
The DataArray
class consists of an array (data) and its associated dimension names, labels, and attributes (metadata).
da = ds["air"]
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
String representations#
We can use the same two representations ("html"
, which is only available in
notebooks, and "text"
) to display our DataArray
.
with xr.set_options(display_style="html"):
display(da)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
with xr.set_options(display_style="text"):
display(da)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
In the string representation of a DataArray
(versus a Dataset
), we also see:
the
DataArray
name (‘air’)a preview of the array data (collapsible in the
"html"
representation)
We can also access the data array directly:
ds.air.data # (or equivalently, `da.data`)
array([[[241.2 , 242.5 , ..., 235.5 , 238.6 ],
[243.8 , 244.5 , ..., 235.3 , 239.3 ],
...,
[295.9 , 296.2 , ..., 295.9 , 295.2 ],
[296.29, 296.79, ..., 296.79, 296.6 ]],
[[242.1 , 242.7 , ..., 233.6 , 235.8 ],
[243.6 , 244.1 , ..., 232.5 , 235.7 ],
...,
[296.2 , 296.7 , ..., 295.5 , 295.1 ],
[296.29, 297.2 , ..., 296.4 , 296.6 ]],
...,
[[245.79, 244.79, ..., 243.99, 244.79],
[249.89, 249.29, ..., 242.49, 244.29],
...,
[296.29, 297.19, ..., 295.09, 294.39],
[297.79, 298.39, ..., 295.49, 295.19]],
[[245.09, 244.29, ..., 241.49, 241.79],
[249.89, 249.29, ..., 240.29, 241.69],
...,
[296.09, 296.89, ..., 295.69, 295.19],
[297.69, 298.09, ..., 296.19, 295.69]]])
Named dimensions#
.dims
are the named axes of your data. They may (dimension coordinates) or may not (dimensions without coordinates) have associated values. Names can be anything that fits into a Python set
(i.e. calling hash()
on it doesn’t raise an error), but to be
useful they should be strings.
In this case we have 2 spatial dimensions (latitude
and longitude
are stored with shorthand names lat
and lon
) and one temporal dimension (time
).
ds.air.dims
('time', 'lat', 'lon')
Coordinates#
.coords
is a simple dict-like data container
for mapping coordinate names to values. These values can be:
another
DataArray
objecta tuple of the form
(dims, data, attrs)
whereattrs
is optional. This is roughly equivalent to creating a newDataArray
object withDataArray(dims=dims, data=data, attrs=attrs)
a 1-dimensional
numpy
array (or anything that can be coerced to one usingnumpy.array
, such as alist
) containing numbers, datetime objects, strings, etc. to label each point.
Here we see the actual timestamps and spatial positions of our air temperature data:
ds.air.coords
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
The difference between the dimension labels (dimension coordinates) and normal
coordinates is that for now it only is possible to use indexing operations
(sel
, reindex
, etc.) with dimension coordinates. Also, while coordinates can
have arbitrary dimensions, dimension coordinates have to be one-dimensional.
Attributes#
.attrs
is a dictionary that can contain arbitrary Python objects (strings, lists, integers, dictionaries, etc.) containing information about your data. Your only
limitation is that some attributes may not be writeable to certain file formats.
ds.air.attrs
{'long_name': '4xDaily Air temperature at sigma level 995',
'units': 'degK',
'precision': 2,
'GRIB_id': 11,
'GRIB_name': 'TMP',
'var_desc': 'Air temperature',
'dataset': 'NMC Reanalysis',
'level_desc': 'Surface',
'statistic': 'Individual Obs',
'parent_stat': 'Other',
'actual_range': array([185.16, 322.1 ], dtype=float32)}
To Pandas and back#
DataArray
and Dataset
objects are frequently created by converting from
other libraries such as pandas or by reading from
data storage formats such as
NetCDF or
zarr.
To convert from / to pandas
, we can use the
to_xarray
methods on pandas objects or the
to_pandas
methods on xarray
objects:
a 1.0
b 1.0
c 1.0
d 1.0
e 1.0
f 1.0
g 1.0
h 1.0
i 1.0
j 1.0
dtype: float64
arr = series.to_xarray()
arr
<xarray.DataArray (index: 10)> Size: 80B 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Coordinates: * index (index) object 80B 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'
arr.to_pandas()
index
a 1.0
b 1.0
c 1.0
d 1.0
e 1.0
f 1.0
g 1.0
h 1.0
i 1.0
j 1.0
dtype: float64
We can also control what pandas
object is used by calling to_series
/
to_dataframe
:
to_series#
This will always convert DataArray
objects to pandas.Series
, using a MultiIndex
for higher dimensions
ds.air.to_series()
time lat lon
2013-01-01 00:00:00 75.0 200.0 241.20
202.5 242.50
205.0 243.50
207.5 244.00
210.0 244.10
...
2014-12-31 18:00:00 15.0 320.0 297.39
322.5 297.19
325.0 296.49
327.5 296.19
330.0 295.69
Name: air, Length: 3869000, dtype: float64
to_dataframe#
This will always convert DataArray
or Dataset
objects to a pandas.DataFrame
. Note that DataArray
objects have to be named for this. Since columns in a DataFrame
need to have the same index, they are
broadcasted.
ds.air.to_dataframe()
air | |||
---|---|---|---|
time | lat | lon | |
2013-01-01 00:00:00 | 75.0 | 200.0 | 241.20 |
202.5 | 242.50 | ||
205.0 | 243.50 | ||
207.5 | 244.00 | ||
210.0 | 244.10 | ||
... | ... | ... | ... |
2014-12-31 18:00:00 | 15.0 | 320.0 | 297.39 |
322.5 | 297.19 | ||
325.0 | 296.49 | ||
327.5 | 296.19 | ||
330.0 | 295.69 |
3869000 rows × 1 columns