https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png

Creating Data Structures#

import numpy as np
import pandas as pd
import xarray as xr

xr.set_options(display_expand_data=False)

rng = np.random.default_rng(seed=0)  # we'll use this later

In the last lecture, we looked at the following example Dataset. In most cases Xarray Datasets are created by reading a file. We’ll address this in the next lecture. Here we’ll learn how to create Xarray objects from scratch

ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

DataArray#

The DataArray class is used to attach a name, dimension names, labels, and attributes to an array.

Our goal will be to recreate the ds.air DataArray starting with the underlying numpy data

ds.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
array = ds.air.data

We do this using the DataArray constructor.

xr.DataArray(array)
<xarray.DataArray (dim_0: 2920, dim_1: 25, dim_2: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Dimensions without coordinates: dim_0, dim_1, dim_2

This works. Notice that the default dimension names are not so useful: dim_0, dim_1, dim_2

Dimension Names#

We can change this by specifying dimension names in the appropriate order using the dims kwarg

xr.DataArray(array, dims=("time", "lat", "lon"))
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Dimensions without coordinates: time, lat, lon

Much better! But notice we have no entries under “Coordinates”.

Coordinates#

While associating names with dimensions (or axes) of an array is quite useful, attaching coordinate labels to DataArrays makes a lot of analysis quite convenient.

First we’ll simply add values for lon using the coords kwarg. For this datasets, longitudes are regularly spaced at 2.5° intervals between 200°E and 330°E.

coords takes a dictionary that maps the name of a dimension to one of

  • another DataArray object

  • a tuple of the form (dims, data, attrs) where attrs is optional. This is roughly equivalent to creating a new DataArray object with DataArray(dims=dims, data=data, attrs=attrs)

  • a numpy array (or anything that can be coerced to one using numpy.array).

We’ll start with the last one

lon_values = np.arange(200, 331, 2.5)
xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_values})
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Dimensions without coordinates: time, lat

Assigning a plain numpy array is equivalent to creating a DataArray with those values and the same dimension name

lon_da = xr.DataArray(lon_values, dims="lon")
da = xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_da})
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Dimensions without coordinates: time, lat

We can also assign coordinates after a DataArray has been created.

da.coords["lat"] = np.arange(75, 14.9, -2.5)
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Dimensions without coordinates: time

Attributes#

Arbitrary attributes can be assigned using the .attrs property

da.attrs["attribute"] = "hello"
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Dimensions without coordinates: time
Attributes:
    attribute:  hello

or specified in the constructor

da2 = xr.DataArray(
    array, dims=("time", "lat", "lon"), coords={"lon": lon_da}, attrs={"attribute": "hello"}
)
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Dimensions without coordinates: time, lat
Attributes:
    attribute:  hello

Non-dimension coordinates#

Sometimes we want to attach coordinate variables along an existing dimension. Notice that

  1. itime is not bolded and

  2. has a name “time” that is different from the dimension name “time”

itime is an example of a non-dimension coordinate variable i.e. it is a coordinate variable that does not match a dimension name. Here we demonstrate the “tuple” form of assigninment: (dims, data, attrs)

da.coords["itime"] = ("time", np.arange(2920), {"name": "value"})
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
    itime    (time) int64 0 1 2 3 4 5 6 7 ... 2913 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Attributes:
    attribute:  hello

Exercises#

create a DataArray named “height” from random data rng.random((180, 360)) * 400

  1. with dimensions named “latitude” and “longitude”

xr.DataArray(rng.random((180, 360)) * 400, dims=("latitude", "longitude"), name="height")
<xarray.DataArray 'height' (latitude: 180, longitude: 360)>
254.8 107.9 16.39 6.611 325.3 365.1 ... 77.56 224.5 325.4 130.9 101.7 159.5
Dimensions without coordinates: latitude, longitude
  1. with dimension coordinates:

  • “latitude”: -90 to 89 with step size 1

  • “longitude”: -180 to 179 with step size 1

xr.DataArray(
    rng.random((180, 360)) * 400,
    dims=("latitude", "longitude"),
    coords={"latitude": np.arange(-90, 90, 1), "longitude": np.arange(-180, 180, 1)},
)
<xarray.DataArray (latitude: 180, longitude: 360)>
192.4 101.9 45.38 76.42 56.98 3.814 ... 16.73 47.68 199.2 47.88 303.5 121.3
Coordinates:
  * latitude   (latitude) int64 -90 -89 -88 -87 -86 -85 ... 84 85 86 87 88 89
  * longitude  (longitude) int64 -180 -179 -178 -177 -176 ... 176 177 178 179
  1. with metadata for both data and coordinates:

  • height: “type”: “ellipsoid”

  • latitude: “type”: “geodetic”

  • longitude: “prime_meridian”: “greenwich”

xr.DataArray(
    rng.random((180, 360)) * 400,
    dims=("latitude", "longitude"),
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
        "longitude": (
            "longitude",
            np.arange(-180, 180, 1),
            {"prime_meridian": "greenwich"},
        ),
    },
    attrs={"type": "ellipsoid"},
    name="height",
)
<xarray.DataArray 'height' (latitude: 180, longitude: 360)>
386.1 179.0 228.3 220.4 128.0 69.01 ... 11.52 5.998 313.2 272.4 333.2 24.13
Coordinates:
  * latitude   (latitude) int64 -90 -89 -88 -87 -86 -85 ... 84 85 86 87 88 89
  * longitude  (longitude) int64 -180 -179 -178 -177 -176 ... 176 177 178 179
Attributes:
    type:     ellipsoid

Dataset#

Dataset objects collect multiple data variables, each with possibly different dimensions.

The constructor of Dataset takes three parameters:

  • data_vars: dict-like mapping names to values. It has the format described in coordinates except we need to use either DataArray objects or the tuple syntax since we have to provide dimension names

  • coords: same as for DataArray

  • attrs: same as for Dataset

Creating an empty Dataset is easy!

xr.Dataset()
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*

Data Variables#

Let’s create a Dataset with two data variables: da and da2

ds = xr.Dataset({"air": da, "air2": da2})
ds
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    itime    (time) int64 0 1 2 3 4 5 6 7 ... 2913 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air2     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7

You can directly assign a new data variables

ds["air3"] = da
ds
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    itime    (time) int64 0 1 2 3 4 5 6 7 ... 2913 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air2     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air3     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7

Coordinates#

Coordinate variables can be assigned using the coords kwarg to xr.Dataset. Here we use date_range from pandas to create a time vector

xr.Dataset(
    {"air": da, "air2": da2},
    coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
)
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    itime    (time) int64 0 1 2 3 4 5 6 7 ... 2913 2914 2915 2916 2917 2918 2919
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air2     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7

Again we can assign coordinate variables after a Dataset has been created.

ds
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    itime    (time) int64 0 1 2 3 4 5 6 7 ... 2913 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air2     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air3     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
ds.coords["time"] = pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")
ds
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    itime    (time) int64 0 1 2 3 4 5 6 7 ... 2913 2914 2915 2916 2917 2918 2919
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air2     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air3     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7

Attributes#

xr.Dataset(
    {"air": da, "air2": da2},
    coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
    attrs={"key0": "value0"},
)
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
    itime    (time) int64 0 1 2 3 4 5 6 7 ... 2913 2914 2915 2916 2917 2918 2919
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air2     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
Attributes:
    key0:     value0
ds.attrs["key"] = "value"

Exercises#

  1. create a Dataset with two variables along latitude and longitude: height and gravity_anomaly

height = rng.random((180, 360)) * 400
gravity_anomaly = rng.random((180, 360)) * 400 - 200
xr.Dataset(
    {
        "height": (("latitude", "longitude"), height),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
    }
)
<xarray.Dataset>
Dimensions:          (latitude: 180, longitude: 360)
Dimensions without coordinates: latitude, longitude
Data variables:
    height           (latitude, longitude) float64 13.55 87.99 ... 74.78 241.7
    gravity_anomaly  (latitude, longitude) float64 176.5 24.42 ... -39.73 -153.0
  1. add coordinates to latitude and longitude:

  • latitude: from -90 to 90 with step size 1

  • longitude: from -180 to 180 with step size 1

xr.Dataset(
    {
        "height": (("latitude", "longitude"), height),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
    },
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1)),
        "longitude": ("longitude", np.arange(-180, 180, 1)),
    },
)
<xarray.Dataset>
Dimensions:          (latitude: 180, longitude: 360)
Coordinates:
  * latitude         (latitude) int64 -90 -89 -88 -87 -86 -85 ... 85 86 87 88 89
  * longitude        (longitude) int64 -180 -179 -178 -177 ... 176 177 178 179
Data variables:
    height           (latitude, longitude) float64 13.55 87.99 ... 74.78 241.7
    gravity_anomaly  (latitude, longitude) float64 176.5 24.42 ... -39.73 -153.0
  1. add metadata to coordinates and variables:

  • latitude: “type”: “geodetic”

  • longitude: “prime_meridian”: “greenwich”

  • height: “ellipsoid”: “wgs84”

  • gravity_anomaly: “ellipsoid”: “grs80”

xr.Dataset(
    {
        "height": (("latitude", "longitude"), height, {"ellipsoid": "wgs84"}),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly, {"ellipsoid": "grs80"}),
    },
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
        "longitude": (
            "longitude",
            np.arange(-180, 180, 1),
            {"prime_meridian": "greenwich"},
        ),
    },
)
<xarray.Dataset>
Dimensions:          (latitude: 180, longitude: 360)
Coordinates:
  * latitude         (latitude) int64 -90 -89 -88 -87 -86 -85 ... 85 86 87 88 89
  * longitude        (longitude) int64 -180 -179 -178 -177 ... 176 177 178 179
Data variables:
    height           (latitude, longitude) float64 13.55 87.99 ... 74.78 241.7
    gravity_anomaly  (latitude, longitude) float64 176.5 24.42 ... -39.73 -153.0