This notebook demonstrates one of Xarray’s most powerful features: the ability to wrap dask arrays and allow users to seamlessly execute analysis code in parallel.

By the end of this notebook, you will:

1. Xarray DataArrays and Datasets are “dask collections” i.e. you can execute top-level dask functions such as dask.visualize(xarray_object)

2. Learn that all xarray built-in operations can transparently use dask

Performance will depend on the computational infrastructure you’re using (for example, how many CPU cores), how the data you’re working with is structured and stored, and the algorithms and code you’re running. Be sure to review the Dask best-practices if you’re new to Dask!

1. dask.array as a drop-in replacement for numpy arrays

2. A “scheduler” that actually runs computations on dask arrays (commonly distributed)

Dask Array implements a subset of the NumPy ndarray interface using blocked algorithms, cutting up the large array into many small arrays (blocks or chunks). This lets us compute on arrays larger than memory using all of our cores. We coordinate these blocked algorithms using Dask graphs.

import dask


 Array Chunk 400 B 32 B (10, 5) (2, 2) 15 chunks in 1 graph layer float64 numpy.ndarray

1. Use parallel resources to speed up computation

2. Work with datasets bigger than RAM (“out-of-core”)

“dask lets you scale from memory-sized datasets to disk-sized datasets”

Operations are not computed until you explicitly request them.

dasky.mean(axis=-1)

 Array Chunk 80 B 16 B (10,) (2,) 5 chunks in 3 graph layers float64 numpy.ndarray

first an apology!

So what did dask do when you called .mean? It added that operation to the “graph” or a blueprint of operations to execute later.

dask.visualize(dasky.mean(axis=-1))

dasky.mean(axis=-1).compute()

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])


More#

Remember that Xarray can wrap many different array types. So Xarray can wrap dask arrays too.

We use Xarray to enable using our metadata to express our analysis.

The chunks argument to both open_dataset and open_mfdataset allow you to read datasets as dask arrays.

import xarray as xr

ds = xr.tutorial.open_dataset("air_temperature")
ds.air

<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
[3869000 values with dtype=float32]
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]
ds = xr.tutorial.open_dataset(
"air_temperature",
chunks={  # this tells xarray to open the dataset as a dask array
"lat": "auto",
"lon": 25,
"time": -1,
},
)
ds

<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air      (time, lat, lon) float32 dask.array<chunksize=(2920, 25, 25), meta=np.ndarray>
Attributes:
Conventions:  COARDS
title:        4x daily NMC reanalysis (1948)
description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
platform:     Model
references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

The representation (“repr” in Python parlance) for the air DataArray shows the very nice HTML dask array repr. You can access the underlying chunk sizes using .chunks:

ds.air.chunks

((2920,), (25,), (25, 25, 3))

ds

<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air      (time, lat, lon) float32 dask.array<chunksize=(2920, 25, 25), meta=np.ndarray>
Attributes:
Conventions:  COARDS
title:        4x daily NMC reanalysis (1948)
description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
platform:     Model
references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Tip: All variables in a Dataset need not have the same chunk size along common dimensions.

Extracting underlying data#

There are two ways to pull out the underlying array object in an xarray object.

1. .to_numpy or .values will always return a NumPy array. For dask-backed xarray objects, this means that compute will always be called

2. .data will return a Dask array

tip: Use to_numpy or as_numpy instead of .values so that your code generalizes to other array types (like CuPy arrays, sparse arrays)

ds.air.data  # dask array, not numpy

 Array Chunk 14.76 MiB 6.96 MiB (2920, 25, 53) (2920, 25, 25) 3 chunks in 2 graph layers float32 numpy.ndarray
ds.air.as_numpy().data  ## numpy array

array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
238.59999],
[243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
239.29999],
[250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
241.7    ],
...,
[296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
294.69998],
[295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
295.19998],
[296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
296.6    ]],

[[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
235.7    ],
[253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
238.5    ],
...,
[296.4    , 295.9    , 296.19998, ..., 295.4    , 295.1    ,
294.79   ],
[296.19998, 296.69998, 296.79   , ..., 295.6    , 295.5    ,
295.1    ],
[296.29   , 297.19998, 297.4    , ..., 296.4    , 296.4    ,
296.6    ]],

[[242.29999, 242.2    , 242.29999, ..., 234.29999, 236.09999,
238.7    ],
[244.59999, 244.39   , 244.     , ..., 230.29999, 232.     ,
235.7    ],
[256.19998, 255.5    , 254.2    , ..., 231.2    , 233.2    ,
238.2    ],
...,
[295.6    , 295.4    , 295.4    , ..., 296.29   , 295.29   ,
295.     ],
[296.19998, 296.5    , 296.29   , ..., 296.4    , 296.     ,
295.6    ],
[296.4    , 296.29   , 296.4    , ..., 297.     , 297.     ,
296.79   ]],

...,

[[243.48999, 242.98999, 242.09   , ..., 244.18999, 244.48999,
244.89   ],
[249.09   , 248.98999, 248.59   , ..., 240.59   , 241.29   ,
242.68999],
[262.69   , 262.19   , 261.69   , ..., 239.39   , 241.68999,
245.18999],
...,
[294.79   , 295.29   , 297.49   , ..., 295.49   , 295.38998,
294.69   ],
[296.79   , 297.88998, 298.29   , ..., 295.49   , 295.49   ,
294.79   ],
[298.19   , 299.19   , 298.79   , ..., 296.09   , 295.79   ,
295.79   ]],

[[245.79   , 244.79   , 243.48999, ..., 243.29   , 243.98999,
244.79   ],
[249.89   , 249.29   , 248.48999, ..., 241.29   , 242.48999,
244.29   ],
[262.38998, 261.79   , 261.29   , ..., 240.48999, 243.09   ,
246.89   ],
...,
[293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
294.29   ],
[296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
294.38998],
[297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
295.19   ]],

[[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
241.79   ],
[249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
241.68999],
[262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
246.29   ],
...,
[293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
294.69   ],
[296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
295.19   ],
[297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
295.69   ]]], dtype=float32)


Exercise#

Try calling mean.values and mean.data. Do you understand the difference?

ds.air.to_numpy()

array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
238.59999],
[243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
239.29999],
[250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
241.7    ],
...,
[296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
294.69998],
[295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
295.19998],
[296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
296.6    ]],

[[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
235.7    ],
[253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
238.5    ],
...,
[296.4    , 295.9    , 296.19998, ..., 295.4    , 295.1    ,
294.79   ],
[296.19998, 296.69998, 296.79   , ..., 295.6    , 295.5    ,
295.1    ],
[296.29   , 297.19998, 297.4    , ..., 296.4    , 296.4    ,
296.6    ]],

[[242.29999, 242.2    , 242.29999, ..., 234.29999, 236.09999,
238.7    ],
[244.59999, 244.39   , 244.     , ..., 230.29999, 232.     ,
235.7    ],
[256.19998, 255.5    , 254.2    , ..., 231.2    , 233.2    ,
238.2    ],
...,
[295.6    , 295.4    , 295.4    , ..., 296.29   , 295.29   ,
295.     ],
[296.19998, 296.5    , 296.29   , ..., 296.4    , 296.     ,
295.6    ],
[296.4    , 296.29   , 296.4    , ..., 297.     , 297.     ,
296.79   ]],

...,

[[243.48999, 242.98999, 242.09   , ..., 244.18999, 244.48999,
244.89   ],
[249.09   , 248.98999, 248.59   , ..., 240.59   , 241.29   ,
242.68999],
[262.69   , 262.19   , 261.69   , ..., 239.39   , 241.68999,
245.18999],
...,
[294.79   , 295.29   , 297.49   , ..., 295.49   , 295.38998,
294.69   ],
[296.79   , 297.88998, 298.29   , ..., 295.49   , 295.49   ,
294.79   ],
[298.19   , 299.19   , 298.79   , ..., 296.09   , 295.79   ,
295.79   ]],

[[245.79   , 244.79   , 243.48999, ..., 243.29   , 243.98999,
244.79   ],
[249.89   , 249.29   , 248.48999, ..., 241.29   , 242.48999,
244.29   ],
[262.38998, 261.79   , 261.29   , ..., 240.48999, 243.09   ,
246.89   ],
...,
[293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
294.29   ],
[296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
294.38998],
[297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
295.19   ]],

[[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
241.79   ],
[249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
241.68999],
[262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
246.29   ],
...,
[293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
294.69   ],
[296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
295.19   ],
[297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
295.69   ]]], dtype=float32)


Lazy computation#

Xarray seamlessly wraps dask so all computation is deferred until explicitly requested.

mean = ds.air.mean("time")
mean

<xarray.DataArray 'air' (lat: 25, lon: 53)>
dask.array<mean_agg-aggregate, shape=(25, 53), dtype=float32, chunksize=(25, 25), chunktype=numpy.ndarray>
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0

Dask actually constructs a graph of the required computation. Here it’s pretty simple: The full array is subdivided into 3 arrays. Dask will load each of these subarrays in a separate thread using the default single-machine scheduling. You can visualize dask ‘task graphs’ which represent the requested computation:

mean.data  # dask array

 Array Chunk 5.18 kiB 2.44 kiB (25, 53) (25, 25) 3 chunks in 4 graph layers float32 numpy.ndarray
# visualize the graph for the underlying dask array
# we ask it to visualize the graph from left to right because it looks nicer


Getting concrete values#

At some point, you will want to actually get concrete values (usually a numpy array) from dask.

There are two ways to compute values on dask arrays.

1. .compute() returns an xarray object just like a dask array

2. .load() replaces the dask array in the xarray object with a numpy array. This is equivalent to ds = ds.compute()

Tip: There is a third option : “persisting”. .persist() loads the values into distributed RAM. The values are computed but remain distributed across workers. So ds.air.persist() still returns a dask array. This is useful if you will be repeatedly using a dataset for computation but it is too large to load into local memory. You will see a persistent task on the dashboard. See the dask user guide for more on persisting

Exercise#

Try running mean.compute and then examine mean after that. Is it still a dask array?

mean

<xarray.DataArray 'air' (lat: 25, lon: 53)>
dask.array<mean_agg-aggregate, shape=(25, 53), dtype=float32, chunksize=(25, 25), chunktype=numpy.ndarray>
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
mean.compute()

<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[260.37564, 260.1826 , 259.88593, ..., 250.81511, 251.93733,
253.43741],
[262.7337 , 262.7936 , 262.7489 , ..., 249.75496, 251.5852 ,
254.35849],
[264.7681 , 264.3271 , 264.0614 , ..., 250.60707, 253.58247,
257.71475],
...,
[297.64932, 296.95294, 296.62912, ..., 296.81033, 296.28793,
295.81622],
[298.1287 , 297.93646, 297.47006, ..., 296.8591 , 296.77686,
296.44348],
[298.36594, 298.38593, 298.11386, ..., 297.33777, 297.28104,
297.30502]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
mean

<xarray.DataArray 'air' (lat: 25, lon: 53)>
dask.array<mean_agg-aggregate, shape=(25, 53), dtype=float32, chunksize=(25, 25), chunktype=numpy.ndarray>
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0

Exercise#

Now repeat that exercise with mean.load.

mean.load()

<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[260.37564, 260.1826 , 259.88593, ..., 250.81511, 251.93733,
253.43741],
[262.7337 , 262.7936 , 262.7489 , ..., 249.75496, 251.5852 ,
254.35849],
[264.7681 , 264.3271 , 264.0614 , ..., 250.60707, 253.58247,
257.71475],
...,
[297.64932, 296.95294, 296.62912, ..., 296.81033, 296.28793,
295.81622],
[298.1287 , 297.93646, 297.47006, ..., 296.8591 , 296.77686,
296.44348],
[298.36594, 298.38593, 298.11386, ..., 297.33777, 297.28104,
297.30502]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
mean

<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[260.37564, 260.1826 , 259.88593, ..., 250.81511, 251.93733,
253.43741],
[262.7337 , 262.7936 , 262.7489 , ..., 249.75496, 251.5852 ,
254.35849],
[264.7681 , 264.3271 , 264.0614 , ..., 250.60707, 253.58247,
257.71475],
...,
[297.64932, 296.95294, 296.62912, ..., 296.81033, 296.28793,
295.81622],
[298.1287 , 297.93646, 297.47006, ..., 296.8591 , 296.77686,
296.44348],
[298.36594, 298.38593, 298.11386, ..., 297.33777, 297.28104,
297.30502]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0

Distributed Clusters#

As your data volumes grow and algorithms get more complex it can be hard to print out task graph representations and understand what Dask is doing behind the scenes. Luckily, you can use Dask’s ‘Distributed’ scheduler to get very useful diagnotisic information.

First let’s set up a LocalCluster using dask.distributed.

You can use any kind of Dask cluster. This step is completely independent of xarray. While not strictly necessary, the dashboard provides a nice learning tool.

from dask.distributed import Client

# This piece of code is just for a correct dashboard link mybinder.org or other JupyterHub demos
import os

# if os.environ.get('JUPYTERHUB_USER'):

client = Client(local_directory='/tmp')
client


Client

Client-ccc47260-e890-11ed-8d3d-6045bdb045d7

 Connection method: Cluster object Cluster type: distributed.LocalCluster Dashboard: http://127.0.0.1:8787/status

Cluster Info

☝️ Click the Dashboard link above.

👈 Or click the “Search” 🔍 button in the dask-labextension dashboard.

NOTE: if using the dask-labextension, you should disable the ‘Simple’ JupyterLab interface (View -> Simple Interface), so that you can drag and rearrange whichever dashboards you want. The Workers and Task Stream are good to make sure the dashboard is working!

import dask.array

dask.array.ones((1000, 4), chunks=(2, 1)).compute()  # should see activity in dashboard

array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
...,
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])


Computation#

Let’s go back to our xarray DataSet, in addition to computing the mean, other operations such as indexing will automatically use whichever Dask Cluster we are connected to!

ds.air.isel(lon=1, lat=20)

<xarray.DataArray 'air' (time: 2920)>
Coordinates:
lat      float32 25.0
lon      float32 202.5
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]

and more complicated operations…

rolling_mean = ds.air.rolling(time=5).mean()  # no activity on dashboard

<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
dask.array<truediv, shape=(2920, 25, 53), dtype=float32, chunksize=(2920, 25, 25), chunktype=numpy.ndarray>
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]
timeseries = rolling_mean.isel(lon=1, lat=20)  # no activity on dashboard

<xarray.DataArray 'air' (time: 2920)>
Coordinates:
lat      float32 25.0
lon      float32 202.5
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]
computed = rolling_mean.compute()  # activity on dashboard
computed  # has real numpy values

<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
...,
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan]],

[[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
[      nan,       nan,       nan, ...,       nan,       nan,
nan],
...
[295.69   , 295.53   , 296.23   , ..., 296.00998, 295.59198,
294.83197],
[296.872  , 297.67   , 297.812  , ..., 295.93   , 295.71198,
294.952  ],
[298.152  , 298.81198, 298.63202, ..., 296.472  , 296.09   ,
295.612  ]],

[[243.98999, 243.56999, 242.95   , ..., 244.68999, 245.12997,
245.50998],
[248.98999, 248.90999, 248.56999, ..., 240.70999, 241.68999,
243.32999],
[262.47   , 261.83   , 261.11002, ..., 239.38998, 242.20999,
246.35   ],
...,
[295.19   , 295.28998, 296.37   , ..., 295.71002, 295.39   ,
294.72998],
[296.57   , 297.57   , 297.93   , ..., 295.78998, 295.62997,
294.95   ],
[297.95   , 298.69   , 298.62997, ..., 296.45   , 296.07   ,
295.65   ]]], dtype=float32)
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]

Note that mean still contains a dask array

rolling_mean

<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
dask.array<truediv, shape=(2920, 25, 53), dtype=float32, chunksize=(2920, 25, 25), chunktype=numpy.ndarray>
Coordinates:
* lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name:     4xDaily Air temperature at sigma level 995
units:         degK
precision:     2
GRIB_id:       11
GRIB_name:     TMP
var_desc:      Air temperature
dataset:       NMC Reanalysis
level_desc:    Surface
statistic:     Individual Obs
parent_stat:   Other
actual_range:  [185.16 322.1 ]

tip While these operations all work, not all of them are necessarily the optimal implementation for parallelism. Usually analysis pipelines need some tinkering and tweaking to get things to work. In particular read the user guidie recommendations for chunking and performance

Xarray data structures are first-class dask collections.#

This means you can do things like dask.compute(xarray_object), dask.visualize(xarray_object), dask.persist(xarray_object). This works for both DataArrays and Datasets

Exercise#

Visualize the task graph for a few different computations on ds.air!

Finish up#

Gracefully shutdown our connection to the Dask cluster. This becomes more important when you are running on large HPC or Cloud servers rather than a laptop!

client.close()


Next#

See the Xarray user guide on dask.