Dask¶

level6

It’s parallel processing time: Level 6 - spyndex + dask!

Remember to install spyndex!

[ ]:

!pip install -U spyndex

Now, let’s start!

First, import spyndex and dask:

[1]:

import spyndex
import dask

`dask.Array`¶

In Level 4 we worked with xarray and DataArrays. Well, let’s do the same tutorial but now working in parallel!

Let’s use the xarray.DataArray that is stored in the spyndex datasets: sentinel:

[2]:

da = spyndex.datasets.open("sentinel")

As you already know, this data array is very simple. We have 3 dimensions: band, x and y. Each band is one of the 10 m spectral bands of a Sentinel-2 image.

[3]:

da

[3]:

<xarray.DataArray (band: 4, x: 300, y: 300)>
array([[[ 299,  276,  280, ...,  510,  516,  521],
        [ 287,  285,  284, ...,  503,  476,  469],
        [ 287,  292,  288, ...,  454,  411,  337],
        ...,
        [ 502,  508,  520, ...,  683,  670,  791],
        [ 486,  518,  532, ...,  688,  696,  693],
        [ 486,  506,  515, ...,  659,  671,  664]],

       [[ 469,  446,  466, ...,  695,  711,  728],
        [ 469,  437,  469, ...,  683,  694,  666],
        [ 460,  460,  460, ...,  628,  595,  527],
        ...,
        [ 804,  808,  832, ...,  920,  872, 1023],
        [ 787,  803,  822, ...,  890,  882,  871],
        [ 787,  799,  822, ...,  893,  832,  834]],

       [[ 319,  293,  328, ..., 1054, 1090, 1110],
        [ 327,  318,  345, ..., 1044, 1004,  952],
        [ 339,  355,  323, ...,  922,  784,  652],
        ...,
        [1528, 1516, 1516, ..., 1250, 1246, 1420],
        [1470, 1502, 1498, ..., 1316, 1200, 1162],
        [1394, 1480, 1472, ..., 1288, 1144, 1122]],

       [[2164, 2128, 2206, ..., 1796, 1837, 1816],
        [2110, 2017, 2228, ..., 1795, 1839, 1788],
        [2050, 2112, 2062, ..., 1816, 1789, 1864],
        ...,
        [1910, 1942, 1942, ..., 2105, 1898, 2102],
        [1836, 1874, 1916, ..., 2075, 1792, 1747],
        [1778, 1844, 1870, ..., 2087, 1830, 1675]]])
Coordinates:
  * band     (band) <U3 'B02' 'B03' 'B04' 'B08'
Dimensions without coordinates: x, y

The data is stored as int16, so let’s convert everything to float. The scale: 10000.

[4]:

da = da / 10000

You can easily visualize the image with rasterio:

[5]:

from rasterio import plot

Let’s see the RGB visualization:

[6]:

plot.show((da.sel(band = ["B04","B03","B02"]).data / 0.3).clip(0,1),title = "RGB")

[6]:

<AxesSubplot:title={'center':'RGB'}>

Great! Everything is going great!

Now, let’s convert our normal xarray.DataArray to a rechunked one:

[8]:

da = da.chunk({"band":1,"x":100,"y":100})

Let’s see what we have:

[9]:

da

[9]:

<xarray.DataArray (band: 4, x: 300, y: 300)>
dask.array<xarray-<this-array>, shape=(4, 300, 300), dtype=float64, chunksize=(1, 100, 100), chunktype=numpy.ndarray>
Coordinates:
  * band     (band) <U3 'B02' 'B03' 'B04' 'B08'
Dimensions without coordinates: x, y

xarray.DataArray

band: 4
x: 300
y: 300

dask.array<chunksize=(1, 100, 100), meta=np.ndarray>

	Array	Chunk
Bytes	2.75 MiB	78.12 kiB
Shape	(4, 300, 300)	(1, 100, 100)
Count	36 Tasks	36 Chunks
Type	float64	numpy.ndarray

Coordinates: (1)
- band
  (band)
  <U3
  'B02' 'B03' 'B04' 'B08'
```
array(['B02', 'B03', 'B04', 'B08'], dtype='<U3')
```
Attributes: (0)

Nice! Now let’s compute the NDVI:

[11]:

NDVI = spyndex.computeIndex("NDVI",{"N": da.sel(band = "B08"),"R": da.sel(band = "B04")})

Let’s check our result:

[12]:

NDVI

[12]:

<xarray.DataArray (x: 300, y: 300)>
dask.array<truediv, shape=(300, 300), dtype=float64, chunksize=(100, 100), chunktype=numpy.ndarray>
Dimensions without coordinates: x, y

xarray.DataArray

x: 300
y: 300

dask.array<chunksize=(100, 100), meta=np.ndarray>

Array Chunk

Bytes 703.12 kiB 78.12 kiB

Shape (300, 300) (100, 100)

Count 81 Tasks 9 Chunks

Type float64 numpy.ndarray
Coordinates: (0)
Attributes: (0)

It’s a chunked xarray.DataArray! This means that it is not computed yet, so let’s compute it!

[13]:

NDVI = NDVI.compute()

Now, if we take a look at our NDVI, you’ll see that it is already computed! And it was done in parallel with dask!

[14]:

NDVI

[14]:

<xarray.DataArray (x: 300, y: 300)>
array([[0.74305276, 0.75795126, 0.74112076, ..., 0.26035088, 0.25521011,
        0.24128503],
       [0.73163726, 0.72762313, 0.73183055, ..., 0.26452976, 0.29370383,
        0.30510949],
       [0.71619925, 0.71220105, 0.72914046, ..., 0.3265157 , 0.39059464,
        0.48171701],
       ...,
       [0.11111111, 0.1231926 , 0.1231926 , ..., 0.25484352, 0.20737913,
        0.19363998],
       [0.1107078 , 0.11018957, 0.12243702, ..., 0.22382778, 0.19786096,
        0.20110003],
       [0.12105927, 0.10950662, 0.11909037, ..., 0.23674074, 0.23066577,
        0.19771183]])
Dimensions without coordinates: x, y

Let’s visualize it!

[15]:

plot.show(NDVI,title = "NDVI")

[15]:

<AxesSubplot:title={'center':'NDVI'}>

Amazing!

`dask.Series`¶

In Level 3 we worked with pandas.Series. Well, let’s do the same tutorial but now working in parallel again!

Let’s use the pandas.DataFrame that is stored in the spyndex datasets: spectral:

[31]:

df = spyndex.datasets.open("spectral")

As you already know, each column of this dataset is the Surface Reflectance from Landsat 8 for 3 different classes. The samples were taken over Oporto:

[32]:

df.head()

[32]:

	SR_B1	SR_B2	SR_B3	SR_B4	SR_B5	SR_B6	SR_B7	ST_B10	class
0	0.089850	0.100795	0.132227	0.165764	0.269054	0.306206	0.251949	297.328396	Urban
1	0.073859	0.086990	0.124404	0.160979	0.281264	0.267596	0.217917	297.107934	Urban
2	0.072938	0.086028	0.120994	0.140203	0.284220	0.258384	0.200098	297.436064	Urban
3	0.087733	0.103916	0.135981	0.163976	0.254479	0.259580	0.216735	297.203638	Urban
4	0.090593	0.109306	0.150350	0.181260	0.269535	0.273234	0.219554	297.097680	Urban

Here you can see the classes stored in the class column:

[33]:

df["class"].unique()

[33]:

array(['Urban', 'Water', 'Vegetation'], dtype=object)

Each column of the data frame is a pandas.Series data type:

[34]:

type(df["SR_B2"])

[34]:

pandas.core.series.Series

Everything great, right?

Now, let’s create a dask data frame:

[35]:

df = dask.dataframe.from_pandas(df,npartitions = 10)

Let’s see what we have:

[36]:

df

[36]:

Dask DataFrame Structure:

	SR_B1	SR_B2	SR_B3	SR_B4	SR_B5	SR_B6	SR_B7	ST_B10	class
npartitions=10
0	float64	float64	float64	float64	float64	float64	float64	float64	object
109	...	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...	...
89	...	...	...	...	...	...	...	...	...
99	...	...	...	...	...	...	...	...	...

Dask Name: from_pandas, 10 tasks

That’s a dask data frame! Now, let’s compute our indices!

[37]:

indicesToCompute = ["NDVI","NDWI","NDBI"]

Ready?… Go!

[38]:

df[indicesToCompute] = spyndex.computeIndex(
    index = indicesToCompute,
    params = {
        "G": df["SR_B3"],
        "R": df["SR_B4"],
        "N": df["SR_B5"],
        "S1": df["SR_B6"],
    }
)

Let’s see our result!

[39]:

df

[39]:

Dask DataFrame Structure:

	SR_B1	SR_B2	SR_B3	SR_B4	SR_B5	SR_B6	SR_B7	ST_B10	class	NDVI	NDWI	NDBI
npartitions=10
0	float64	float64	float64	float64	float64	float64	float64	float64	object	float64	float64	float64
109	...	...	...	...	...	...	...	...	...	...	...	...
...	...	...	...	...	...	...	...	...	...	...	...	...
89	...	...	...	...	...	...	...	...	...	...	...	...
99	...	...	...	...	...	...	...	...	...	...	...	...

Dask Name: assign, 270 tasks

Our result is also a dask data frame! Now, just as we did with the chunked xarray.DataArray, we have to actually compute our indices!

[40]:

df = df.compute()

Now, let’s see what we have:

[41]:

df.head()

[41]:

	SR_B1	SR_B2	SR_B3	SR_B4	SR_B5	SR_B6	SR_B7	ST_B10	class	NDVI	NDWI	NDBI
0	0.089850	0.100795	0.132227	0.165764	0.269054	0.306206	0.251949	297.328396	Urban	0.237548	-0.340973	0.064584
1	0.073859	0.086990	0.124404	0.160979	0.281264	0.267596	0.217917	297.107934	Urban	0.271989	-0.386671	-0.024902
10	0.109801	0.124184	0.166369	0.202229	0.317330	0.314731	0.237525	298.668260	Urban	0.221537	-0.312098	-0.004112
100	0.022805	0.026105	0.051735	0.034822	0.255455	0.114627	0.054650	291.818548	Vegetation	0.760074	-0.663173	-0.380530
101	0.021856	0.024345	0.050635	0.034341	0.270058	0.118175	0.052354	291.423766	Vegetation	0.774367	-0.684215	-0.391215

OUR INDICES! COMPUTED IN PARALLEL! :D

Let’s visualize them!

[42]:

import seaborn as sns
import matplotlib.pyplot as plt

Define some colors for each one of the classes:

[43]:

colors = ["#E33F62","#3FDDE3","#4CBA4B"]

Now, let’s create our gorgeous pair grid!

[44]:

plt.figure(figsize = (15,15))
g = sns.PairGrid(df[['NDVI', 'NDWI', 'NDBI','class']],hue = "class",palette = sns.color_palette(colors))
g.map_lower(sns.scatterplot)
g.map_upper(sns.kdeplot,fill = True,alpha = .5)
g.map_diag(sns.kdeplot,fill = True)
g.add_legend()
plt.show()

<Figure size 1080x1080 with 0 Axes>

Splendid!

Dask¶

dask.Array¶

dask.Series¶

`dask.Array`¶

`dask.Series`¶