Xarray’s Data structures#
In this lesson, we cover the basics of Xarray data structures. By the end of the lesson, we will be able to:
Learning Goals
Understand the basic Xarray data structures
DataArrayandDatasetCustomize the display of Xarray data structures
The connection between Pandas and Xarray data structures
Data structures#
Xarray provides two data structures: the DataArray and Dataset. The
DataArray class attaches dimension names, coordinates and attributes to
multi-dimensional arrays while Dataset combines multiple DataArrays.
Both classes are most commonly created by reading data. To learn how to create a DataArray or Dataset manually, see the Creating Data Structures tutorial.
import numpy as np
import xarray as xr
import pandas as pd
# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking
# The following settings reduce the amount of data displayed out by default
xr.set_options(display_expand_attrs=False, display_expand_data=False)
np.set_printoptions(threshold=10, edgeitems=2)
Dataset#
Dataset objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.
Xarray has a few small real-world tutorial datasets hosted in this GitHub repository pydata/xarray-data.
We’ll use the xarray.tutorial.load_dataset convenience function to download and open the air_temperature (National Centers for Environmental Prediction) Dataset by name.
ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Size: 31MB
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes: (5)We can access “layers” of the Dataset (individual DataArrays) with dictionary syntax
ds["air"]
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
We can save some typing by using the “attribute” or “dot” notation. This won’t work for variable names that clash with built-in
method names (for example, mean).
ds.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
HTML vs text representations#
Xarray has two representation types: "html" (which is only available in
notebooks) and "text". To choose between them, use the display_style option.
So far, our notebook has automatically displayed the "html" representation (which we will continue using).
The "html" representation is interactive, allowing you to collapse sections (▶) and
view attributes and values for each value (📄 and ≡).
with xr.set_options(display_style="html"):
display(ds)
<xarray.Dataset> Size: 31MB
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes: (5)☝️ From top to bottom the output consists of:
Dimensions: summary of all dimensions of the
Dataset(lat: 25, time: 2920, lon: 53): this tells us that the first dimension is namedlatand has a size of25, the second dimension is namedtimeand has a size of2920, and the third dimension is namedlonand has a size of53. Because we will access the dimensions by name, the order doesn’t matter.Coordinates: an unordered list of coordinates or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be printed in bold font. dimensions without coordinates appear in plain font (there are none in this example, but you might imagine a ‘mask’ coordinate that has a value assigned at every point).
Data variables: names of each nD measurement in the dataset, followed by its dimensions
(time, lat, lon), dtype, and a preview of values.Indexes: Each dimension with coordinates is backed by an “Index”. In this example, each dimension is backed by a
PandasIndexAttributes: an unordered list of metadata (for example, a paragraph describing the dataset)
Compare that with the string representation, which is very similar except the dimensions are given a * prefix instead of bold and you cannot collapse or expand the outputs.
with xr.set_options(display_style="text"):
display(ds)
<xarray.Dataset> Size: 31MB
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes: (5)To understand each of the components better, we’ll explore the “air” variable of our Dataset.
DataArray#
The DataArray class consists of an array (data) and its associated dimension names, labels, and attributes (metadata).
da = ds["air"]
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
String representations#
We can use the same two representations ("html", which is only available in
notebooks, and "text") to display our DataArray.
with xr.set_options(display_style="html"):
display(da)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
with xr.set_options(display_style="text"):
display(da)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: (11)
In the string representation of a DataArray (versus a Dataset), we also see:
the
DataArrayname (‘air’)a preview of the array data (collapsible in the
"html"representation)
We can also access the data array directly:
ds.air.data # (or equivalently, `da.data`)
array([[[241.2 , 242.5 , ..., 235.5 , 238.6 ],
[243.8 , 244.5 , ..., 235.3 , 239.3 ],
...,
[295.9 , 296.2 , ..., 295.9 , 295.2 ],
[296.29, 296.79, ..., 296.79, 296.6 ]],
[[242.1 , 242.7 , ..., 233.6 , 235.8 ],
[243.6 , 244.1 , ..., 232.5 , 235.7 ],
...,
[296.2 , 296.7 , ..., 295.5 , 295.1 ],
[296.29, 297.2 , ..., 296.4 , 296.6 ]],
...,
[[245.79, 244.79, ..., 243.99, 244.79],
[249.89, 249.29, ..., 242.49, 244.29],
...,
[296.29, 297.19, ..., 295.09, 294.39],
[297.79, 298.39, ..., 295.49, 295.19]],
[[245.09, 244.29, ..., 241.49, 241.79],
[249.89, 249.29, ..., 240.29, 241.69],
...,
[296.09, 296.89, ..., 295.69, 295.19],
[297.69, 298.09, ..., 296.19, 295.69]]], shape=(2920, 25, 53))
Named dimensions#
.dims are the named axes of your data. They may (dimension coordinates) or may not (dimensions without coordinates) have associated values. Names can be anything that fits into a Python set (i.e. calling hash() on it doesn’t raise an error), but to be
useful they should be strings.
In this case we have 2 spatial dimensions (latitude and longitude are stored with shorthand names lat and lon) and one temporal dimension (time).
ds.air.dims
('time', 'lat', 'lon')
Coordinates#
.coords is a simple dict-like data container
for mapping coordinate names to values. These values can be:
another
DataArrayobjecta tuple of the form
(dims, data, attrs)whereattrsis optional. This is roughly equivalent to creating a newDataArrayobject withDataArray(dims=dims, data=data, attrs=attrs)a 1-dimensional
numpyarray (or anything that can be coerced to one usingnumpy.array, such as alist) containing numbers, datetime objects, strings, etc. to label each point.
Here we see the actual timestamps and spatial positions of our air temperature data:
ds.air.coords
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
The difference between the dimension labels (dimension coordinates) and normal
coordinates is that for now it only is possible to use indexing operations
(sel, reindex, etc.) with dimension coordinates. Also, while coordinates can
have arbitrary dimensions, dimension coordinates have to be one-dimensional.
Attributes#
.attrs is a dictionary that can contain arbitrary Python objects (strings, lists, integers, dictionaries, etc.) containing information about your data. Your only
limitation is that some attributes may not be writeable to certain file formats.
ds.air.attrs
{'long_name': '4xDaily Air temperature at sigma level 995',
'units': 'degK',
'precision': np.int16(2),
'GRIB_id': np.int16(11),
'GRIB_name': 'TMP',
'var_desc': 'Air temperature',
'dataset': 'NMC Reanalysis',
'level_desc': 'Surface',
'statistic': 'Individual Obs',
'parent_stat': 'Other',
'actual_range': array([185.16, 322.1 ], dtype=float32)}
To Pandas and back#
DataArray and Dataset objects are frequently created by converting from
other libraries such as pandas or by reading from
data storage formats such as
NetCDF or
zarr.
To convert from / to pandas, we can use the
to_xarray
methods on pandas objects or the
to_pandas
methods on xarray objects:
a 1.0
b 1.0
c 1.0
d 1.0
e 1.0
f 1.0
g 1.0
h 1.0
i 1.0
j 1.0
dtype: float64
arr = series.to_xarray()
arr
<xarray.DataArray (index: 10)> Size: 80B 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Coordinates: * index (index) object 80B 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'
arr.to_pandas()
index
a 1.0
b 1.0
c 1.0
d 1.0
e 1.0
f 1.0
g 1.0
h 1.0
i 1.0
j 1.0
dtype: float64
We can also control what pandas object is used by calling to_series /
to_dataframe:
to_series#
This will always convert DataArray objects to pandas.Series, using a MultiIndex for higher dimensions
ds.air.to_series()
time lat lon
2013-01-01 00:00:00 75.0 200.0 241.20
202.5 242.50
205.0 243.50
207.5 244.00
210.0 244.10
...
2014-12-31 18:00:00 15.0 320.0 297.39
322.5 297.19
325.0 296.49
327.5 296.19
330.0 295.69
Name: air, Length: 3869000, dtype: float64
to_dataframe#
This will always convert DataArray or Dataset objects to a pandas.DataFrame. Note that DataArray objects have to be named for this. Since columns in a DataFrame need to have the same index, they are
broadcasted.
ds.air.to_dataframe()
| air | |||
|---|---|---|---|
| time | lat | lon | |
| 2013-01-01 00:00:00 | 75.0 | 200.0 | 241.20 |
| 202.5 | 242.50 | ||
| 205.0 | 243.50 | ||
| 207.5 | 244.00 | ||
| 210.0 | 244.10 | ||
| ... | ... | ... | ... |
| 2014-12-31 18:00:00 | 15.0 | 320.0 | 297.39 |
| 322.5 | 297.19 | ||
| 325.0 | 296.49 | ||
| 327.5 | 296.19 | ||
| 330.0 | 295.69 |
3869000 rows × 1 columns