
Xarray’s Data structures#
In this lesson, we cover the basics of Xarray data structures. Our learning goals are as follows. By the end of the lesson, we will be able to:
Understand the basic data structures (
DataArray
andDataset
objects) in Xarray
Introduction#
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”) are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, NumPy provides the fundamental data structure and API for working with raw ND arrays. However, real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.
Here is an example of how we might structure a dataset for a weather forecast:

You’ll notice multiple data variables (temperature, precipitation), coordinate variables (latitude, longitude), and dimensions (x, y, t). We’ll cover how these fit into Xarray’s data structures below.
Xarray doesn’t just keep track of labels on arrays – it uses them to provide a powerful and concise interface. For example:
Apply operations over dimensions by name:
x.sum('time')
.Select values by label (or logical location) instead of integer location:
x.loc['2014-01-01']
orx.sel(time='2014-01-01')
.Mathematical operations (e.g.,
x - y
) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.Easily use the split-apply-combine paradigm with groupby:
x.groupby('time.dayofyear').mean()
.Database-like alignment based on coordinate labels that smoothly handles missing values:
x, y = xr.align(x, y, join='outer')
.Keep track of arbitrary metadata in the form of a Python dictionary:
x.attrs
.
The N-dimensional nature of xarray’s data structures makes it suitable for
dealing with multi-dimensional scientific data, and its use of dimension names
instead of axis labels (dim='time'
instead of axis=0
) makes such arrays much
more manageable than the raw numpy ndarray: with xarray, you don’t need to keep
track of the order of an array’s dimensions or insert dummy dimensions of size 1
to align arrays (e.g., using np.newaxis).
The immediate payoff of using xarray is that you’ll write less code. The long-term payoff is that you’ll understand what you were thinking when you come back to look at it weeks or months later.
Data structures#
Xarray provides two data structures: the DataArray
and Dataset
. The
DataArray
class attaches dimension names, coordinates and attributes to
multi-dimensional arrays while Dataset
combines multiple arrays.
Both classes are most commonly created by reading data. To learn how to create a DataArray or Dataset manually, see the Working with labeled data tutorial.
Xarray has a few small real-world tutorial datasets hosted in this GitHub repository pydata/xarray-data.
We’ll use the xarray.tutorial.load_dataset convenience function to download and open the air_temperature
(National Centers for Environmental Prediction) Dataset by name.
import numpy as np
import xarray as xr
Dataset#
Dataset
objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.
ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
We can access “layers” of the Dataset (individual DataArrays) with dictionary syntax
ds["air"]
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ... [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
We can save some typing by using the “attribute” or “dot” notation. This won’t work for variable names that clash with built-in
method names (for example, mean
).
ds.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ... [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
What is all this anyway? (String representations)#
Xarray has two representation types: "html"
(which is only available in
notebooks) and "text"
. To choose between them, use the display_style
option.
So far, our notebook has automatically displayed the "html"
representation (which we will continue using).
The "html"
representation is interactive, allowing you to collapse sections (left arrows) and
view attributes and values for each value (right hand sheet icon and data symbol).
with xr.set_options(display_style="html"):
display(ds)
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
The output consists of:
a summary of all dimensions of the
Dataset
(lat: 25, time: 2920, lon: 53)
: this tells us that the first dimension is namedlat
and has a size of25
, the second dimension is namedtime
and has a size of2920
, and the third dimension is namedlon
and has a size of53
. Because we will access the dimensions by name, the order doesn’t matter.an unordered list of coordinates or dimensions with coordinates with one item per line. Each item has a name, one or more dimensions in parentheses, a dtype and a preview of the values. Also, if it is a dimension coordinate, it will be marked with a
*
.an alphabetically sorted list of dimensions without coordinates (if there are any)
an unordered list of attributes, or metadata
Compare that with the string representation, which is very similar except the dimensions are given a *
prefix instead of bold and you cannot collapse or expand the outputs.
with xr.set_options(display_style="text"):
display(ds)
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
To understand each of the components better, we’ll explore the “air” variable of our Dataset.
DataArray#
The DataArray
class consists of an array (data) and its associated dimension names, labels, and attributes (metadata).
da = ds["air"]
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ... [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
String representations#
We can use the same two representations ("html"
, which is only available in
notebooks, and "text"
) to display our DataArray
.
with xr.set_options(display_style="html"):
display(da)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ... [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
with xr.set_options(display_style="text"):
display(da)
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ... [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
In the string representation of a DataArray
(versus a Dataset
), we also see:
the
DataArray
name (‘air’)a preview of the array data (collapsible in the
"html"
representation)
We can also access the data array directly:
ds.air.data # (or equivalently, `da.data`)
array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 ,
238.59999],
[243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999,
239.29999],
[250. , 249.79999, 248.89 , ..., 233.2 , 236.39 ,
241.7 ],
...,
[296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 ,
294.69998],
[295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 ,
295.19998],
[296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 ,
296.6 ]],
[[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2 , ..., 231. , 232.5 ,
235.7 ],
[253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 ,
238.5 ],
...,
[296.4 , 295.9 , 296.19998, ..., 295.4 , 295.1 ,
294.79 ],
[296.19998, 296.69998, 296.79 , ..., 295.6 , 295.5 ,
295.1 ],
[296.29 , 297.19998, 297.4 , ..., 296.4 , 296.4 ,
296.6 ]],
[[242.29999, 242.2 , 242.29999, ..., 234.29999, 236.09999,
238.7 ],
[244.59999, 244.39 , 244. , ..., 230.29999, 232. ,
235.7 ],
[256.19998, 255.5 , 254.2 , ..., 231.2 , 233.2 ,
238.2 ],
...,
[295.6 , 295.4 , 295.4 , ..., 296.29 , 295.29 ,
295. ],
[296.19998, 296.5 , 296.29 , ..., 296.4 , 296. ,
295.6 ],
[296.4 , 296.29 , 296.4 , ..., 297. , 297. ,
296.79 ]],
...,
[[243.48999, 242.98999, 242.09 , ..., 244.18999, 244.48999,
244.89 ],
[249.09 , 248.98999, 248.59 , ..., 240.59 , 241.29 ,
242.68999],
[262.69 , 262.19 , 261.69 , ..., 239.39 , 241.68999,
245.18999],
...,
[294.79 , 295.29 , 297.49 , ..., 295.49 , 295.38998,
294.69 ],
[296.79 , 297.88998, 298.29 , ..., 295.49 , 295.49 ,
294.79 ],
[298.19 , 299.19 , 298.79 , ..., 296.09 , 295.79 ,
295.79 ]],
[[245.79 , 244.79 , 243.48999, ..., 243.29 , 243.98999,
244.79 ],
[249.89 , 249.29 , 248.48999, ..., 241.29 , 242.48999,
244.29 ],
[262.38998, 261.79 , 261.29 , ..., 240.48999, 243.09 ,
246.89 ],
...,
[293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 ,
294.29 ],
[296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 ,
294.38998],
[297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 ,
295.19 ]],
[[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999,
241.79 ],
[249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 ,
241.68999],
[262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 ,
246.29 ],
...,
[293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 ,
294.69 ],
[296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 ,
295.19 ],
[297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 ,
295.69 ]]], dtype=float32)
Named dimensions#
.dims
are the named axes of your data. They may (dimension coordinates) or may not (dimensions without coordinates) have associated values. Names can be anything that fits into a Python set
(i.e. calling hash()
on it doesn’t raise an error), but to be
useful they should be strings.
In this case we have 2 spatial dimensions (latitude
and longitude
are stored with shorthand names lat
and lon
) and one temporal dimension (time
).
ds.air.dims
('time', 'lat', 'lon')
Coordinates#
.coords
is a simple dict-like data container
for mapping coordinate names to values. These values can be:
another
DataArray
objecta tuple of the form
(dims, data, attrs)
whereattrs
is optional. This is roughly equivalent to creating a newDataArray
object withDataArray(dims=dims, data=data, attrs=attrs)
a 1-dimensional
numpy
array (or anything that can be coerced to one usingnumpy.array
, such as alist
) containing numbers, datetime objects, strings, etc. to label each point.
Here we see the actual timestamps and spatial positions of our air temperature data:
ds.air.coords
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
The difference between the dimension labels (dimension coordinates) and normal
coordinates is that for now it only is possible to use indexing operations
(sel
, reindex
, etc.) with dimension coordinates. Also, while coordinates can
have arbitrary dimensions, dimension coordinates have to be one-dimensional.
Attributes#
.attrs
is a dictionary that can contain arbitrary Python objects (strings, lists, integers, dictionaries, etc.) containing information about your data. Your only
limitation is that some attributes may not be writeable to certain file formats.
ds.air.attrs
{'long_name': '4xDaily Air temperature at sigma level 995',
'units': 'degK',
'precision': 2,
'GRIB_id': 11,
'GRIB_name': 'TMP',
'var_desc': 'Air temperature',
'dataset': 'NMC Reanalysis',
'level_desc': 'Surface',
'statistic': 'Individual Obs',
'parent_stat': 'Other',
'actual_range': array([185.16, 322.1 ], dtype=float32)}
To Pandas and back#
DataArray
and Dataset
objects are frequently created by converting from
other libraries such as pandas or by reading from
data storage formats such as
NetCDF or
zarr.
To convert from / to pandas
, we can use the
to_xarray
methods on pandas objects or the
to_pandas
methods on xarray
objects:
import pandas as pd
series = pd.Series(np.ones((10,)), index=list("abcdefghij"))
series
a 1.0
b 1.0
c 1.0
d 1.0
e 1.0
f 1.0
g 1.0
h 1.0
i 1.0
j 1.0
dtype: float64
arr = series.to_xarray()
arr
<xarray.DataArray (index: 10)> array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) Coordinates: * index (index) object 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'
arr.to_pandas()
index
a 1.0
b 1.0
c 1.0
d 1.0
e 1.0
f 1.0
g 1.0
h 1.0
i 1.0
j 1.0
dtype: float64
We can also control what pandas
object is used by calling to_series
/
to_dataframe
:
to_series
: This will always convert DataArray
objects to
pandas.Series
, using a MultiIndex
for higher dimensions
ds.air.to_series()
time lat lon
2013-01-01 00:00:00 75.0 200.0 241.199997
202.5 242.500000
205.0 243.500000
207.5 244.000000
210.0 244.099991
...
2014-12-31 18:00:00 15.0 320.0 297.389984
322.5 297.190002
325.0 296.489990
327.5 296.190002
330.0 295.690002
Name: air, Length: 3869000, dtype: float32
to_dataframe
: This will always convert DataArray
or Dataset
objects to a pandas.DataFrame
. Note that DataArray
objects have to be named
for this.
ds.air.to_dataframe()
air | |||
---|---|---|---|
time | lat | lon | |
2013-01-01 00:00:00 | 75.0 | 200.0 | 241.199997 |
202.5 | 242.500000 | ||
205.0 | 243.500000 | ||
207.5 | 244.000000 | ||
210.0 | 244.099991 | ||
... | ... | ... | ... |
2014-12-31 18:00:00 | 15.0 | 320.0 | 297.389984 |
322.5 | 297.190002 | ||
325.0 | 296.489990 | ||
327.5 | 296.190002 | ||
330.0 | 295.690002 |
3869000 rows × 1 columns
Since columns in a DataFrame
need to have the same index, they are
broadcasted.
ds.to_dataframe()
air | |||
---|---|---|---|
lat | time | lon | |
75.0 | 2013-01-01 00:00:00 | 200.0 | 241.199997 |
202.5 | 242.500000 | ||
205.0 | 243.500000 | ||
207.5 | 244.000000 | ||
210.0 | 244.099991 | ||
... | ... | ... | ... |
15.0 | 2014-12-31 18:00:00 | 320.0 | 297.389984 |
322.5 | 297.190002 | ||
325.0 | 296.489990 | ||
327.5 | 296.190002 | ||
330.0 | 295.690002 |
3869000 rows × 1 columns