Creating Data Structures#
import numpy as np
import pandas as pd
import xarray as xr
xr.set_options(display_expand_data=False)
rng = np.random.default_rng(seed=0) # we'll use this later
In the last lecture, we looked at the following example Dataset. In most cases Xarray Datasets are created by reading a file. We’ll address this in the next lecture. Here we’ll learn how to create Xarray objects from scratch
ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Size: 31MB
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...DataArray#
The DataArray class is used to attach a name, dimension names, labels, and
attributes to an array.
Our goal will be to recreate the ds.air DataArray starting with the underlying numpy data
ds.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
* lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
* lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]array = ds.air.data
We do this using the DataArray constructor.
xr.DataArray(array)
<xarray.DataArray (dim_0: 2920, dim_1: 25, dim_2: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Dimensions without coordinates: dim_0, dim_1, dim_2
This works. Notice that the default dimension names are not so useful: dim_0, dim_1, dim_2
Dimension Names#
We can change this by specifying dimension names in the appropriate order using the dims kwarg
xr.DataArray(array, dims=("time", "lat", "lon"))
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Dimensions without coordinates: time, lat, lon
Much better! But notice we have no entries under “Coordinates”.
Coordinates#
While associating names with dimensions (or axes) of an array is quite useful, attaching coordinate labels to DataArrays makes a lot of analysis quite convenient.
First we’ll simply add values for lon using the coords kwarg. For this datasets, longitudes are regularly spaced at 2.5° intervals between 200°E and 330°E.
coords takes a dictionary that maps the name of a dimension to one of
another
DataArrayobjecta tuple of the form
(dims, data, attrs)whereattrsis optional. This is roughly equivalent to creating a newDataArrayobject withDataArray(dims=dims, data=data, attrs=attrs)a
numpyarray (or anything that can be coerced to one usingnumpy.array).
We’ll start with the last one
lon_values = np.arange(200, 331, 2.5)
xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_values})
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 Dimensions without coordinates: time, lat
Assigning a plain numpy array is equivalent to creating a DataArray with those values and the same dimension name
lon_da = xr.DataArray(lon_values, dims="lon")
da = xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_da})
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 Dimensions without coordinates: time, lat
We can also assign coordinates after a DataArray has been created.
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 Dimensions without coordinates: time
Attributes#
Arbitrary attributes can be assigned using the .attrs property
da.attrs["attribute"] = "hello"
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
Dimensions without coordinates: time
Attributes:
attribute: helloor specified in the constructor
da2 = xr.DataArray(
array, dims=("time", "lat", "lon"), coords={"lon": lon_da}, attrs={"attribute": "hello"}
)
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
Dimensions without coordinates: time, lat
Attributes:
attribute: helloNon-dimension coordinates#
Sometimes we want to attach coordinate variables along an existing dimension. Notice that
itimeis not bolded andhas a name “itime” that is different from the dimension name “time”
itime is an example of a non-dimension coordinate variable i.e. it is a coordinate variable that does not match a dimension name. Here we demonstrate the “tuple” form of assigninment: (dims, data, attrs)
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Attributes:
attribute: helloExercises#
create a DataArray named “height” from random data rng.random((180, 360)) * 400
with dimensions named “latitude” and “longitude”
Show code cell source
xr.DataArray(rng.random((180, 360)) * 400, dims=("latitude", "longitude"), name="height")
Show code cell output
<xarray.DataArray 'height' (latitude: 180, longitude: 360)> Size: 518kB 254.8 107.9 16.39 6.611 325.3 365.1 ... 77.56 224.5 325.4 130.9 101.7 159.5 Dimensions without coordinates: latitude, longitude
with dimension coordinates:
“latitude”: -90 to 89 with step size 1
“longitude”: -180 to 179 with step size 1
Show code cell source
xr.DataArray(
rng.random((180, 360)) * 400,
dims=("latitude", "longitude"),
coords={"latitude": np.arange(-90, 90, 1), "longitude": np.arange(-180, 180, 1)},
)
Show code cell output
<xarray.DataArray (latitude: 180, longitude: 360)> Size: 518kB 192.4 101.9 45.38 76.42 56.98 3.814 ... 16.73 47.68 199.2 47.88 303.5 121.3 Coordinates: * latitude (latitude) int64 1kB -90 -89 -88 -87 -86 -85 ... 85 86 87 88 89 * longitude (longitude) int64 3kB -180 -179 -178 -177 ... 176 177 178 179
with metadata for both data and coordinates:
height: “type”: “ellipsoid”
latitude: “type”: “geodetic”
longitude: “prime_meridian”: “greenwich”
xr.DataArray(
rng.random((180, 360)) * 400,
dims=("latitude", "longitude"),
coords={
"latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
"longitude": (
"longitude",
np.arange(-180, 180, 1),
{"prime_meridian": "greenwich"},
),
},
attrs={"type": "ellipsoid"},
name="height",
)
<xarray.DataArray 'height' (latitude: 180, longitude: 360)> Size: 518kB
386.1 179.0 228.3 220.4 128.0 69.01 ... 11.52 5.998 313.2 272.4 333.2 24.13
Coordinates:
* latitude (latitude) int64 1kB -90 -89 -88 -87 -86 -85 ... 85 86 87 88 89
* longitude (longitude) int64 3kB -180 -179 -178 -177 ... 176 177 178 179
Attributes:
type: ellipsoidDataset#
Dataset objects collect multiple data variables, each with possibly different
dimensions.
The constructor of Dataset takes three parameters:
data_vars: dict-like mapping names to values. Values are eitherDataArrayobjects or defined with tuples consisting of of dimension names and arrays.coords: same as forDataArrayattrs: same as forDataset
Creating an empty Dataset is easy!
<xarray.Dataset> Size: 0B
Dimensions: ()
Data variables:
*empty*Data Variables#
Let’s create a Dataset with two data variables: da and da2
ds = xr.Dataset({"air": da, "air2": da2})
ds
<xarray.Dataset> Size: 62MB
Dimensions: (lon: 53, lat: 25, time: 2920)
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7You can directly assign a new data variables
<xarray.Dataset> Size: 93MB
Dimensions: (lon: 53, lat: 25, time: 2920)
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air3 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7Coordinates#
Coordinate variables can be assigned using the coords kwarg to xr.Dataset. Here we use date_range from pandas to create a time vector
xr.Dataset(
{"air": da, "air2": da2},
coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
)
/tmp/ipykernel_4997/2620785341.py:3: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
<xarray.Dataset> Size: 62MB
Dimensions: (lon: 53, lat: 25, time: 2920)
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7Again we can assign coordinate variables after a Dataset has been created.
<xarray.Dataset> Size: 93MB
Dimensions: (lon: 53, lat: 25, time: 2920)
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
Dimensions without coordinates: time
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air3 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7ds.coords["time"] = pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")
ds
/tmp/ipykernel_4997/664840668.py:1: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
ds.coords["time"] = pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")
<xarray.Dataset> Size: 93MB
Dimensions: (lon: 53, lat: 25, time: 2920)
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air3 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7Attributes#
xr.Dataset(
{"air": da, "air2": da2},
coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
attrs={"key0": "value0"},
)
/tmp/ipykernel_4997/3710195814.py:3: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
<xarray.Dataset> Size: 62MB
Dimensions: (lon: 53, lat: 25, time: 2920)
Coordinates:
* lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919
* time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes:
key0: value0ds.attrs["key"] = "value"
Exercises#
create a Dataset with two variables along
latitudeandlongitude:heightandgravity_anomaly
height = rng.random((180, 360)) * 400
gravity_anomaly = rng.random((180, 360)) * 400 - 200
Show code cell source
xr.Dataset(
{
"height": (("latitude", "longitude"), height),
"gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
}
)
Show code cell output
<xarray.Dataset> Size: 1MB
Dimensions: (latitude: 180, longitude: 360)
Dimensions without coordinates: latitude, longitude
Data variables:
height (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7
gravity_anomaly (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0add coordinates to
latitudeandlongitude:
latitude: from -90 to 90 with step size 1longitude: from -180 to 180 with step size 1
xr.Dataset(
{
"height": (("latitude", "longitude"), height),
"gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
},
coords={
"latitude": ("latitude", np.arange(-90, 90, 1)),
"longitude": ("longitude", np.arange(-180, 180, 1)),
},
)
<xarray.Dataset> Size: 1MB
Dimensions: (latitude: 180, longitude: 360)
Coordinates:
* latitude (latitude) int64 1kB -90 -89 -88 -87 -86 ... 85 86 87 88 89
* longitude (longitude) int64 3kB -180 -179 -178 -177 ... 177 178 179
Data variables:
height (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7
gravity_anomaly (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0add metadata to coordinates and variables:
latitude: “type”: “geodetic”longitude: “prime_meridian”: “greenwich”height: “ellipsoid”: “wgs84”gravity_anomaly: “ellipsoid”: “grs80”
Show code cell source
xr.Dataset(
{
"height": (("latitude", "longitude"), height, {"ellipsoid": "wgs84"}),
"gravity_anomaly": (("latitude", "longitude"), gravity_anomaly, {"ellipsoid": "grs80"}),
},
coords={
"latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
"longitude": (
"longitude",
np.arange(-180, 180, 1),
{"prime_meridian": "greenwich"},
),
},
)
Show code cell output
<xarray.Dataset> Size: 1MB
Dimensions: (latitude: 180, longitude: 360)
Coordinates:
* latitude (latitude) int64 1kB -90 -89 -88 -87 -86 ... 85 86 87 88 89
* longitude (longitude) int64 3kB -180 -179 -178 -177 ... 177 178 179
Data variables:
height (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7
gravity_anomaly (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0