Creating Data Structures#

import numpy as np
import pandas as pd
import xarray as xr

xr.set_options(display_expand_data=False)

rng = np.random.default_rng(seed=0)  # we'll use this later

In the last lecture, we looked at the following example Dataset. In most cases Xarray Datasets are created by reading a file. We’ll address this in the next lecture. Here we’ll learn how to create Xarray objects from scratch

ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

DataArray#

The DataArray class is used to attach a name, dimension names, labels, and attributes to an array.

Our goal will be to recreate the ds.air DataArray starting with the underlying numpy data

ds.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]
array = ds.air.data

We do this using the DataArray constructor.

xr.DataArray(array)
<xarray.DataArray (dim_0: 2920, dim_1: 25, dim_2: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Dimensions without coordinates: dim_0, dim_1, dim_2

This works. Notice that the default dimension names are not so useful: dim_0, dim_1, dim_2

Dimension Names#

We can change this by specifying dimension names in the appropriate order using the dims kwarg

xr.DataArray(array, dims=("time", "lat", "lon"))
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Dimensions without coordinates: time, lat, lon

Much better! But notice we have no entries under “Coordinates”.

Coordinates#

While associating names with dimensions (or axes) of an array is quite useful, attaching coordinate labels to DataArrays makes a lot of analysis quite convenient.

First we’ll simply add values for lon using the coords kwarg. For this datasets, longitudes are regularly spaced at 2.5° intervals between 200°E and 330°E.

coords takes a dictionary that maps the name of a dimension to one of

  • another DataArray object

  • a tuple of the form (dims, data, attrs) where attrs is optional. This is roughly equivalent to creating a new DataArray object with DataArray(dims=dims, data=data, attrs=attrs)

  • a numpy array (or anything that can be coerced to one using numpy.array).

We’ll start with the last one

lon_values = np.arange(200, 331, 2.5)
xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_values})
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Dimensions without coordinates: time, lat

Assigning a plain numpy array is equivalent to creating a DataArray with those values and the same dimension name

lon_da = xr.DataArray(lon_values, dims="lon")
da = xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_da})
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Dimensions without coordinates: time, lat

We can also assign coordinates after a DataArray has been created.

da.coords["lat"] = np.arange(75, 14.9, -2.5)
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
Dimensions without coordinates: time

Attributes#

Arbitrary attributes can be assigned using the .attrs property

da.attrs["attribute"] = "hello"
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
Dimensions without coordinates: time
Attributes:
    attribute:  hello

or specified in the constructor

da2 = xr.DataArray(
    array, dims=("time", "lat", "lon"), coords={"lon": lon_da}, attrs={"attribute": "hello"}
)
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Dimensions without coordinates: time, lat
Attributes:
    attribute:  hello

Non-dimension coordinates#

Sometimes we want to attach coordinate variables along an existing dimension. Notice that

  1. itime is not bolded and

  2. has a name “time” that is different from the dimension name “time”

itime is an example of a non-dimension coordinate variable i.e. it is a coordinate variable that does not match a dimension name. Here we demonstrate the “tuple” form of assigninment: (dims, data, attrs)

da.coords["itime"] = ("time", np.arange(2920), {"name": "value"})
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
    itime    (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Attributes:
    attribute:  hello

Exercises#

create a DataArray named “height” from random data rng.random((180, 360)) * 400

  1. with dimensions named “latitude” and “longitude”

Hide code cell source
xr.DataArray(rng.random((180, 360)) * 400, dims=("latitude", "longitude"), name="height")
Hide code cell output
<xarray.DataArray 'height' (latitude: 180, longitude: 360)> Size: 518kB
254.8 107.9 16.39 6.611 325.3 365.1 ... 77.56 224.5 325.4 130.9 101.7 159.5
Dimensions without coordinates: latitude, longitude
  1. with dimension coordinates:

  • “latitude”: -90 to 89 with step size 1

  • “longitude”: -180 to 179 with step size 1

Hide code cell source
xr.DataArray(
    rng.random((180, 360)) * 400,
    dims=("latitude", "longitude"),
    coords={"latitude": np.arange(-90, 90, 1), "longitude": np.arange(-180, 180, 1)},
)
Hide code cell output
<xarray.DataArray (latitude: 180, longitude: 360)> Size: 518kB
192.4 101.9 45.38 76.42 56.98 3.814 ... 16.73 47.68 199.2 47.88 303.5 121.3
Coordinates:
  * latitude   (latitude) int64 1kB -90 -89 -88 -87 -86 -85 ... 85 86 87 88 89
  * longitude  (longitude) int64 3kB -180 -179 -178 -177 ... 176 177 178 179
  1. with metadata for both data and coordinates:

  • height: “type”: “ellipsoid”

  • latitude: “type”: “geodetic”

  • longitude: “prime_meridian”: “greenwich”

xr.DataArray(
    rng.random((180, 360)) * 400,
    dims=("latitude", "longitude"),
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
        "longitude": (
            "longitude",
            np.arange(-180, 180, 1),
            {"prime_meridian": "greenwich"},
        ),
    },
    attrs={"type": "ellipsoid"},
    name="height",
)
<xarray.DataArray 'height' (latitude: 180, longitude: 360)> Size: 518kB
386.1 179.0 228.3 220.4 128.0 69.01 ... 11.52 5.998 313.2 272.4 333.2 24.13
Coordinates:
  * latitude   (latitude) int64 1kB -90 -89 -88 -87 -86 -85 ... 85 86 87 88 89
  * longitude  (longitude) int64 3kB -180 -179 -178 -177 ... 176 177 178 179
Attributes:
    type:     ellipsoid

Dataset#

Dataset objects collect multiple data variables, each with possibly different dimensions.

The constructor of Dataset takes three parameters:

  • data_vars: dict-like mapping names to values. Values are either DataArray objects or defined with tuples consisting of of dimension names and arrays.

  • coords: same as for DataArray

  • attrs: same as for Dataset

Creating an empty Dataset is easy!

<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*

Data Variables#

Let’s create a Dataset with two data variables: da and da2

ds = xr.Dataset({"air": da, "air2": da2})
ds
<xarray.Dataset> Size: 62MB
Dimensions:  (lon: 53, lat: 25, time: 2920)
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
    itime    (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air2     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7

You can directly assign a new data variables

ds["air3"] = da
ds
<xarray.Dataset> Size: 93MB
Dimensions:  (lon: 53, lat: 25, time: 2920)
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
    itime    (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air2     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air3     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7

Coordinates#

Coordinate variables can be assigned using the coords kwarg to xr.Dataset. Here we use date_range from pandas to create a time vector

xr.Dataset(
    {"air": da, "air2": da2},
    coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
)
/tmp/ipykernel_2353/2620785341.py:3: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
<xarray.Dataset> Size: 62MB
Dimensions:  (lon: 53, lat: 25, time: 2920)
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
    itime    (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air2     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7

Again we can assign coordinate variables after a Dataset has been created.

<xarray.Dataset> Size: 93MB
Dimensions:  (lon: 53, lat: 25, time: 2920)
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
    itime    (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air2     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air3     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
ds.coords["time"] = pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")
ds
/tmp/ipykernel_2353/664840668.py:1: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  ds.coords["time"] = pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")
<xarray.Dataset> Size: 93MB
Dimensions:  (lon: 53, lat: 25, time: 2920)
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
    itime    (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air2     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air3     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7

Attributes#

xr.Dataset(
    {"air": da, "air2": da2},
    coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
    attrs={"key0": "value0"},
)
/tmp/ipykernel_2353/3710195814.py:3: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
<xarray.Dataset> Size: 62MB
Dimensions:  (lon: 53, lat: 25, time: 2920)
Coordinates:
  * lon      (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * lat      (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
    itime    (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
    air2     (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes:
    key0:     value0
ds.attrs["key"] = "value"

Exercises#

  1. create a Dataset with two variables along latitude and longitude: height and gravity_anomaly

height = rng.random((180, 360)) * 400
gravity_anomaly = rng.random((180, 360)) * 400 - 200
Hide code cell source
xr.Dataset(
    {
        "height": (("latitude", "longitude"), height),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
    }
)
Hide code cell output
<xarray.Dataset> Size: 1MB
Dimensions:          (latitude: 180, longitude: 360)
Dimensions without coordinates: latitude, longitude
Data variables:
    height           (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7
    gravity_anomaly  (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0
  1. add coordinates to latitude and longitude:

  • latitude: from -90 to 90 with step size 1

  • longitude: from -180 to 180 with step size 1

xr.Dataset(
    {
        "height": (("latitude", "longitude"), height),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
    },
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1)),
        "longitude": ("longitude", np.arange(-180, 180, 1)),
    },
)
<xarray.Dataset> Size: 1MB
Dimensions:          (latitude: 180, longitude: 360)
Coordinates:
  * latitude         (latitude) int64 1kB -90 -89 -88 -87 -86 ... 85 86 87 88 89
  * longitude        (longitude) int64 3kB -180 -179 -178 -177 ... 177 178 179
Data variables:
    height           (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7
    gravity_anomaly  (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0
  1. add metadata to coordinates and variables:

  • latitude: “type”: “geodetic”

  • longitude: “prime_meridian”: “greenwich”

  • height: “ellipsoid”: “wgs84”

  • gravity_anomaly: “ellipsoid”: “grs80”

Hide code cell source
xr.Dataset(
    {
        "height": (("latitude", "longitude"), height, {"ellipsoid": "wgs84"}),
        "gravity_anomaly": (("latitude", "longitude"), gravity_anomaly, {"ellipsoid": "grs80"}),
    },
    coords={
        "latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
        "longitude": (
            "longitude",
            np.arange(-180, 180, 1),
            {"prime_meridian": "greenwich"},
        ),
    },
)
Hide code cell output
<xarray.Dataset> Size: 1MB
Dimensions:          (latitude: 180, longitude: 360)
Coordinates:
  * latitude         (latitude) int64 1kB -90 -89 -88 -87 -86 ... 85 86 87 88 89
  * longitude        (longitude) int64 3kB -180 -179 -178 -177 ... 177 178 179
Data variables:
    height           (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7
    gravity_anomaly  (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0