Creating Data Structures#
import numpy as np
import pandas as pd
import xarray as xr
xr.set_options(display_expand_data=False)
rng = np.random.default_rng(seed=0) # we'll use this later
In the last lecture, we looked at the following example Dataset. In most cases Xarray Datasets are created by reading a file. We’ll address this in the next lecture. Here we’ll learn how to create Xarray objects from scratch
ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Size: 31MB Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
DataArray#
The DataArray
class is used to attach a name, dimension names, labels, and
attributes to an array.
Our goal will be to recreate the ds.air
DataArray starting with the underlying numpy data
ds.air
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lat (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 * lon (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
array = ds.air.data
We do this using the DataArray constructor.
xr.DataArray(array)
<xarray.DataArray (dim_0: 2920, dim_1: 25, dim_2: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Dimensions without coordinates: dim_0, dim_1, dim_2
This works. Notice that the default dimension names are not so useful: dim_0
, dim_1
, dim_2
Dimension Names#
We can change this by specifying dimension names in the appropriate order using the dims
kwarg
xr.DataArray(array, dims=("time", "lat", "lon"))
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Dimensions without coordinates: time, lat, lon
Much better! But notice we have no entries under “Coordinates”.
Coordinates#
While associating names with dimensions (or axes) of an array is quite useful, attaching coordinate labels to DataArrays makes a lot of analysis quite convenient.
First we’ll simply add values for lon
using the coords
kwarg. For this datasets, longitudes are regularly spaced at 2.5° intervals between 200°E and 330°E.
coords
takes a dictionary that maps the name of a dimension to one of
another
DataArray
objecta tuple of the form
(dims, data, attrs)
whereattrs
is optional. This is roughly equivalent to creating a newDataArray
object withDataArray(dims=dims, data=data, attrs=attrs)
a
numpy
array (or anything that can be coerced to one usingnumpy.array
).
We’ll start with the last one
lon_values = np.arange(200, 331, 2.5)
xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_values})
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 Dimensions without coordinates: time, lat
Assigning a plain numpy array is equivalent to creating a DataArray with those values and the same dimension name
lon_da = xr.DataArray(lon_values, dims="lon")
da = xr.DataArray(array, dims=("time", "lat", "lon"), coords={"lon": lon_da})
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 Dimensions without coordinates: time, lat
We can also assign coordinates after a DataArray has been created.
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 Dimensions without coordinates: time
Attributes#
Arbitrary attributes can be assigned using the .attrs
property
da.attrs["attribute"] = "hello"
da
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 Dimensions without coordinates: time Attributes: attribute: hello
or specified in the constructor
da2 = xr.DataArray(
array, dims=("time", "lat", "lon"), coords={"lon": lon_da}, attrs={"attribute": "hello"}
)
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 Dimensions without coordinates: time, lat Attributes: attribute: hello
Non-dimension coordinates#
Sometimes we want to attach coordinate variables along an existing dimension. Notice that
itime
is not bolded andhas a name “time” that is different from the dimension name “time”
itime
is an example of a non-dimension coordinate variable i.e. it is a coordinate variable that does not match a dimension name. Here we demonstrate the “tuple” form of assigninment: (dims, data, attrs)
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> Size: 31MB 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919 Dimensions without coordinates: time Attributes: attribute: hello
Exercises#
create a DataArray
named “height” from random data rng.random((180, 360)) * 400
with dimensions named “latitude” and “longitude”
Show code cell source
xr.DataArray(rng.random((180, 360)) * 400, dims=("latitude", "longitude"), name="height")
Show code cell output
<xarray.DataArray 'height' (latitude: 180, longitude: 360)> Size: 518kB 254.8 107.9 16.39 6.611 325.3 365.1 ... 77.56 224.5 325.4 130.9 101.7 159.5 Dimensions without coordinates: latitude, longitude
with dimension coordinates:
“latitude”: -90 to 89 with step size 1
“longitude”: -180 to 179 with step size 1
Show code cell source
xr.DataArray(
rng.random((180, 360)) * 400,
dims=("latitude", "longitude"),
coords={"latitude": np.arange(-90, 90, 1), "longitude": np.arange(-180, 180, 1)},
)
Show code cell output
<xarray.DataArray (latitude: 180, longitude: 360)> Size: 518kB 192.4 101.9 45.38 76.42 56.98 3.814 ... 16.73 47.68 199.2 47.88 303.5 121.3 Coordinates: * latitude (latitude) int64 1kB -90 -89 -88 -87 -86 -85 ... 85 86 87 88 89 * longitude (longitude) int64 3kB -180 -179 -178 -177 ... 176 177 178 179
with metadata for both data and coordinates:
height: “type”: “ellipsoid”
latitude: “type”: “geodetic”
longitude: “prime_meridian”: “greenwich”
xr.DataArray(
rng.random((180, 360)) * 400,
dims=("latitude", "longitude"),
coords={
"latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
"longitude": (
"longitude",
np.arange(-180, 180, 1),
{"prime_meridian": "greenwich"},
),
},
attrs={"type": "ellipsoid"},
name="height",
)
<xarray.DataArray 'height' (latitude: 180, longitude: 360)> Size: 518kB 386.1 179.0 228.3 220.4 128.0 69.01 ... 11.52 5.998 313.2 272.4 333.2 24.13 Coordinates: * latitude (latitude) int64 1kB -90 -89 -88 -87 -86 -85 ... 85 86 87 88 89 * longitude (longitude) int64 3kB -180 -179 -178 -177 ... 176 177 178 179 Attributes: type: ellipsoid
Dataset#
Dataset
objects collect multiple data variables, each with possibly different
dimensions.
The constructor of Dataset
takes three parameters:
data_vars
: dict-like mapping names to values. Values are eitherDataArray
objects or defined with tuples consisting of of dimension names and arrays.coords
: same as forDataArray
attrs
: same as forDataset
Creating an empty Dataset is easy!
<xarray.Dataset> Size: 0B Dimensions: () Data variables: *empty*
Data Variables#
Let’s create a Dataset
with two data variables: da
and da2
ds = xr.Dataset({"air": da, "air2": da2})
ds
<xarray.Dataset> Size: 62MB Dimensions: (lon: 53, lat: 25, time: 2920) Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919 Dimensions without coordinates: time Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
You can directly assign a new data variables
<xarray.Dataset> Size: 93MB Dimensions: (lon: 53, lat: 25, time: 2920) Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919 Dimensions without coordinates: time Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air3 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Coordinates#
Coordinate variables can be assigned using the coords
kwarg to xr.Dataset
. Here we use date_range
from pandas to create a time vector
xr.Dataset(
{"air": da, "air2": da2},
coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
)
/tmp/ipykernel_3725/2620785341.py:3: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
<xarray.Dataset> Size: 62MB Dimensions: (lon: 53, lat: 25, time: 2920) Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Again we can assign coordinate variables after a Dataset has been created.
<xarray.Dataset> Size: 93MB Dimensions: (lon: 53, lat: 25, time: 2920) Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919 Dimensions without coordinates: time Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air3 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
ds.coords["time"] = pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")
ds
/tmp/ipykernel_3725/664840668.py:1: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
ds.coords["time"] = pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")
<xarray.Dataset> Size: 93MB Dimensions: (lon: 53, lat: 25, time: 2920) Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air3 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7
Attributes#
xr.Dataset(
{"air": da, "air2": da2},
coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
attrs={"key0": "value0"},
)
/tmp/ipykernel_3725/3710195814.py:3: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
coords={"time": pd.date_range("2013-01-01", "2014-12-31 18:00", freq="6H")},
<xarray.Dataset> Size: 62MB Dimensions: (lon: 53, lat: 25, time: 2920) Coordinates: * lon (lon) float64 424B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0 * lat (lat) float64 200B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0 itime (time) int64 23kB 0 1 2 3 4 5 6 ... 2914 2915 2916 2917 2918 2919 * time (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 air2 (time, lat, lon) float64 31MB 241.2 242.5 243.5 ... 296.2 295.7 Attributes: key0: value0
ds.attrs["key"] = "value"
Exercises#
create a Dataset with two variables along
latitude
andlongitude
:height
andgravity_anomaly
height = rng.random((180, 360)) * 400
gravity_anomaly = rng.random((180, 360)) * 400 - 200
Show code cell source
xr.Dataset(
{
"height": (("latitude", "longitude"), height),
"gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
}
)
Show code cell output
<xarray.Dataset> Size: 1MB Dimensions: (latitude: 180, longitude: 360) Dimensions without coordinates: latitude, longitude Data variables: height (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7 gravity_anomaly (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0
add coordinates to
latitude
andlongitude
:
latitude
: from -90 to 90 with step size 1longitude
: from -180 to 180 with step size 1
xr.Dataset(
{
"height": (("latitude", "longitude"), height),
"gravity_anomaly": (("latitude", "longitude"), gravity_anomaly),
},
coords={
"latitude": ("latitude", np.arange(-90, 90, 1)),
"longitude": ("longitude", np.arange(-180, 180, 1)),
},
)
<xarray.Dataset> Size: 1MB Dimensions: (latitude: 180, longitude: 360) Coordinates: * latitude (latitude) int64 1kB -90 -89 -88 -87 -86 ... 85 86 87 88 89 * longitude (longitude) int64 3kB -180 -179 -178 -177 ... 177 178 179 Data variables: height (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7 gravity_anomaly (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0
add metadata to coordinates and variables:
latitude
: “type”: “geodetic”longitude
: “prime_meridian”: “greenwich”height
: “ellipsoid”: “wgs84”gravity_anomaly
: “ellipsoid”: “grs80”
Show code cell source
xr.Dataset(
{
"height": (("latitude", "longitude"), height, {"ellipsoid": "wgs84"}),
"gravity_anomaly": (("latitude", "longitude"), gravity_anomaly, {"ellipsoid": "grs80"}),
},
coords={
"latitude": ("latitude", np.arange(-90, 90, 1), {"type": "geodetic"}),
"longitude": (
"longitude",
np.arange(-180, 180, 1),
{"prime_meridian": "greenwich"},
),
},
)
Show code cell output
<xarray.Dataset> Size: 1MB Dimensions: (latitude: 180, longitude: 360) Coordinates: * latitude (latitude) int64 1kB -90 -89 -88 -87 -86 ... 85 86 87 88 89 * longitude (longitude) int64 3kB -180 -179 -178 -177 ... 177 178 179 Data variables: height (latitude, longitude) float64 518kB 13.55 87.99 ... 241.7 gravity_anomaly (latitude, longitude) float64 518kB 176.5 24.42 ... -153.0