Xarray’s Data structures#

In this lesson, we cover the basics of Xarray data structures. By the end of the lesson, we will be able to:

Learning Goals

  • Understand the basic Xarray data structures DataArray and Dataset

  • Customize the display of Xarray data structures

  • The connection between Pandas and Xarray data structures

Data structures#

Xarray provides two data structures: the DataArray and Dataset. The DataArray class attaches dimension names, coordinates and attributes to multi-dimensional arrays while Dataset combines multiple DataArrays.

Both classes are most commonly created by reading data. To learn how to create a DataArray or Dataset manually, see the Creating Data Structures tutorial.

import numpy as np
import xarray as xr
import pandas as pd

# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking
# The following settings reduce the amount of data displayed out by default
xr.set_options(display_expand_attrs=False, display_expand_data=False)
np.set_printoptions(threshold=10, edgeitems=2)

Dataset#

Dataset objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.

Xarray has a few small real-world tutorial datasets hosted in this GitHub repository pydata/xarray-data. We’ll use the xarray.tutorial.load_dataset convenience function to download and open the air_temperature (National Centers for Environmental Prediction) Dataset by name.

ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes: (5)

We can access “layers” of the Dataset (individual DataArrays) with dictionary syntax

ds["air"]
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes: (11)

We can save some typing by using the “attribute” or “dot” notation. This won’t work for variable names that clash with built-in method names (for example, mean).

ds.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes: (11)

HTML vs text representations#

Xarray has two representation types: "html" (which is only available in notebooks) and "text". To choose between them, use the display_style option.

So far, our notebook has automatically displayed the "html" representation (which we will continue using). The "html" representation is interactive, allowing you to collapse sections (▶) and view attributes and values for each value (📄 and ≡).

with xr.set_options(display_style="html"):
    display(ds)
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes: (5)

☝️ From top to bottom the output consists of:

  • Dimensions: summary of all dimensions of the Dataset (lat: 25, time: 2920, lon: 53): this tells us that the first dimension is named lat and has a size of 25, the second dimension is named time and has a size of 2920, and the third dimension is named lon and has a size of 53. Because we will access the dimensions by name, the order doesn’t matter.

  • Coordinates: an unordered list of coordinates or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be printed in bold font. dimensions without coordinates appear in plain font (there are none in this example, but you might imagine a ‘mask’ coordinate that has a value assigned at every point).

  • Data variables: names of each nD measurement in the dataset, followed by its dimensions (time, lat, lon), dtype, and a preview of values.

  • Indexes: Each dimension with coordinates is backed by an “Index”. In this example, each dimension is backed by a PandasIndex

  • Attributes: an unordered list of metadata (for example, a paragraph describing the dataset)

Compare that with the string representation, which is very similar except the dimensions are given a * prefix instead of bold and you cannot collapse or expand the outputs.

with xr.set_options(display_style="text"):
    display(ds)
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes: (5)

To understand each of the components better, we’ll explore the “air” variable of our Dataset.

DataArray#

The DataArray class consists of an array (data) and its associated dimension names, labels, and attributes (metadata).

da = ds["air"]
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes: (11)

String representations#

We can use the same two representations ("html", which is only available in notebooks, and "text") to display our DataArray.

with xr.set_options(display_style="html"):
    display(da)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes: (11)
with xr.set_options(display_style="text"):
    display(da)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes: (11)

In the string representation of a DataArray (versus a Dataset), we also see:

  • the DataArray name (‘air’)

  • a preview of the array data (collapsible in the "html" representation)

We can also access the data array directly:

ds.air.data  # (or equivalently, `da.data`)
array([[[241.2 , 242.5 , ..., 235.5 , 238.6 ],
        [243.8 , 244.5 , ..., 235.3 , 239.3 ],
        ...,
        [295.9 , 296.2 , ..., 295.9 , 295.2 ],
        [296.29, 296.79, ..., 296.79, 296.6 ]],

       [[242.1 , 242.7 , ..., 233.6 , 235.8 ],
        [243.6 , 244.1 , ..., 232.5 , 235.7 ],
        ...,
        [296.2 , 296.7 , ..., 295.5 , 295.1 ],
        [296.29, 297.2 , ..., 296.4 , 296.6 ]],

       ...,

       [[245.79, 244.79, ..., 243.99, 244.79],
        [249.89, 249.29, ..., 242.49, 244.29],
        ...,
        [296.29, 297.19, ..., 295.09, 294.39],
        [297.79, 298.39, ..., 295.49, 295.19]],

       [[245.09, 244.29, ..., 241.49, 241.79],
        [249.89, 249.29, ..., 240.29, 241.69],
        ...,
        [296.09, 296.89, ..., 295.69, 295.19],
        [297.69, 298.09, ..., 296.19, 295.69]]])

Named dimensions#

.dims are the named axes of your data. They may (dimension coordinates) or may not (dimensions without coordinates) have associated values. Names can be anything that fits into a Python set (i.e. calling hash() on it doesn’t raise an error), but to be useful they should be strings.

In this case we have 2 spatial dimensions (latitude and longitude are stored with shorthand names lat and lon) and one temporal dimension (time).

ds.air.dims
('time', 'lat', 'lon')

Coordinates#

.coords is a simple dict-like data container for mapping coordinate names to values. These values can be:

  • another DataArray object

  • a tuple of the form (dims, data, attrs) where attrs is optional. This is roughly equivalent to creating a new DataArray object with DataArray(dims=dims, data=data, attrs=attrs)

  • a 1-dimensional numpy array (or anything that can be coerced to one using numpy.array, such as a list) containing numbers, datetime objects, strings, etc. to label each point.

Here we see the actual timestamps and spatial positions of our air temperature data:

ds.air.coords
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00

The difference between the dimension labels (dimension coordinates) and normal coordinates is that for now it only is possible to use indexing operations (sel, reindex, etc.) with dimension coordinates. Also, while coordinates can have arbitrary dimensions, dimension coordinates have to be one-dimensional.

Attributes#

.attrs is a dictionary that can contain arbitrary Python objects (strings, lists, integers, dictionaries, etc.) containing information about your data. Your only limitation is that some attributes may not be writeable to certain file formats.

ds.air.attrs
{'long_name': '4xDaily Air temperature at sigma level 995',
 'units': 'degK',
 'precision': 2,
 'GRIB_id': 11,
 'GRIB_name': 'TMP',
 'var_desc': 'Air temperature',
 'dataset': 'NMC Reanalysis',
 'level_desc': 'Surface',
 'statistic': 'Individual Obs',
 'parent_stat': 'Other',
 'actual_range': array([185.16, 322.1 ], dtype=float32)}

To Pandas and back#

DataArray and Dataset objects are frequently created by converting from other libraries such as pandas or by reading from data storage formats such as NetCDF or zarr.

To convert from / to pandas, we can use the to_xarray methods on pandas objects or the to_pandas methods on xarray objects:

series = pd.Series(np.ones((10,)), index=list("abcdefghij"))
series
a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64
arr = series.to_xarray()
arr
<xarray.DataArray (index: 10)> Size: 80B
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
Coordinates:
  * index    (index) object 80B 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'
arr.to_pandas()
index
a    1.0
b    1.0
c    1.0
d    1.0
e    1.0
f    1.0
g    1.0
h    1.0
i    1.0
j    1.0
dtype: float64

We can also control what pandas object is used by calling to_series / to_dataframe:

to_series#

This will always convert DataArray objects to pandas.Series, using a MultiIndex for higher dimensions

ds.air.to_series()
time                 lat   lon  
2013-01-01 00:00:00  75.0  200.0    241.20
                           202.5    242.50
                           205.0    243.50
                           207.5    244.00
                           210.0    244.10
                                     ...  
2014-12-31 18:00:00  15.0  320.0    297.39
                           322.5    297.19
                           325.0    296.49
                           327.5    296.19
                           330.0    295.69
Name: air, Length: 3869000, dtype: float64

to_dataframe#

This will always convert DataArray or Dataset objects to a pandas.DataFrame. Note that DataArray objects have to be named for this. Since columns in a DataFrame need to have the same index, they are broadcasted.

ds.air.to_dataframe()
air
time lat lon
2013-01-01 00:00:00 75.0 200.0 241.20
202.5 242.50
205.0 243.50
207.5 244.00
210.0 244.10
... ... ... ...
2014-12-31 18:00:00 15.0 320.0 297.39
322.5 297.19
325.0 296.49
327.5 296.19
330.0 295.69

3869000 rows × 1 columns