Snow entity learning with Random Forest

This example follow the example 'nbex_skl_snow' that use Support Vector Machine. This previous example create a snow model on one hypercube with SVM and try to classify the feature snow on 6 others hypercubes. This time, the setup is different and more in accordance with the way machine learning is use. First, we concatenate all the seven hypercubes to make a big data. We keep 25% of the data to learn the snow model and 75% to test it (but, in fact, we test on all the data). We use Random Forest and there is no hyperparameters tuning. For the caracteristics of the hyperspectral images used, I refer you to the previous example 'nbex_skl_snow'.

In [3]:
%matplotlib inline
from __future__ import print_function
import os
import os.path as osp
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pysptools.util as util
import pysptools.skl as skl


def build_ROI(data, coords, title):
    r = util.ROIs(data.shape[0], data.shape[1])
    for coord in coords:
        r.add(*coord)
    r.display(colorMap='Paired', suffix='Snow '+title)
    return r


# Use HyperGaussianNB to create a class map (a mask) for img1
# with a supervised clustering.
def build_img1_cmap(data, r, title):
    model = skl.HyperGaussianNB()
    model.fit_rois(data, r)
    model.classify(data)
    model.display(labels=r.get_labels(), interpolation=None, colorMap='Paired', suffix=title)
    return model.get_class_map()


# Use HyperLogisticRegression to create a class map for img2
# with a supervised clustering.
def build_img2_cmap(data, r, title):
    model = skl.HyperLogisticRegression(class_weight={0:1.0,1:5})
    model.fit_rois(data, r)
    model.classify(data)
    model.display(labels=r.get_labels(), interpolation=None, colorMap='Paired', suffix=title)
    return model.get_class_map()


# It's more complex for imgc7.
# The goal is to create a class map made of two classes: snow and a wall of wood.
# Fisrt: we use HyperSVC to create a class map with a supervised clustering.
# Next: the cmap's upper half is not what we want, we clean it (make it a background,
# background is always zero).
def build_imgc7_cmap(data, r, title):
    def display_img(img):
        import matplotlib.pyplot as plt
        plt.ioff()
        plt.imshow(img)
        plt.show()
        plt.close()
    model = skl.HyperSVC(class_weight={0:1,1:10,2:1}, gamma=0.5)
    model.fit_rois(data, r)
    model.classify(data)
    model.display(labels=r.get_labels(), interpolation=None, colorMap='Paired', suffix=title)
    cmap = model.get_class_map()
    # Clean the half top
    cmap[0:50,0:cmap.shape[1]] = 0
    # And print it
    print('imgc7 class map cleaned:')
    display_img(cmap)
    return cmap


# Create and fit the model with HyperRandomForestClassifier.
def fit_model(X, Y):
    model = skl.HyperRandomForestClassifier(n_estimators=25)
    model.fit(X, Y)
    return model


# We test the model on all the hypercubes.
def batch_classify(spath, model, samples):
    for s in samples:
        M = util.load_mat_file(osp.join(spath, s))   
        Ms = util.shrink(util.shrink(util.shrink(M)))
        util.display_linear_stretch(Ms, 19, 13, 3, suffix='shrink x 3 '+s)
        M = load_reduce_and_scale(spath, s)
        model.classify(M)
        model.display(suffix='batch '+s)

        
# The images are to large for my humble computer and takes to long to process.
# The solution is to cut the pixels number. Each call to shrink() cut by half
# the pixels number and save hours of processing.
def load_reduce_and_scale(path, file_name):
    """
    Load the mat file.
    Reduce the pixels number by 7/8
    and scale the result.
    """
    M = util.load_mat_file(osp.join(path, file_name))
    return skl.hyper_scale(util.shrink(util.shrink(util.shrink(M))))

First we load the seven hyperspectral images. All images are scaled. 'img1','img2' and 'imgc7' have snow, others not.

In [4]:
home_path = os.environ['HOME']
source_path = osp.join(home_path, 'data/CZ_hsdb')

# These images have snow
snow_samples = ['img1','img2','imgc7']
# These have no snow
nosnow_samples = ['imga1','imgb1','imgb6','imga7']

# Load
M_img1 = load_reduce_and_scale(source_path, snow_samples[0])
M_img2 = load_reduce_and_scale(source_path, snow_samples[1])
M_imgc7 = load_reduce_and_scale(source_path, snow_samples[2])
M_imga1 = load_reduce_and_scale(source_path, nosnow_samples[0])
M_imgb1 = load_reduce_and_scale(source_path, nosnow_samples[1])
M_imgb6 = load_reduce_and_scale(source_path, nosnow_samples[2])
M_imga7 = load_reduce_and_scale(source_path, nosnow_samples[3])

Note that the word 'feature' is use with the earth science meaning and not with the machine learning meaning. Feature in the machine learning meaning is the band value that define a spectrum 'the sample'.

We build the masks in two steps.

First we create a ROI for the three images containing the feature snow.

Second we expand the ROI by a supervised clustering and this way catch the snow cover.

To add a degree of complexity, a second feature is added. This is the wall of wood that we find in the imgc7 image.

In [5]:
# The 'snow' images ROIs:
ROI_img1 = [['Snow',{'rec':(41,79,49,100)}]]
ROI_img2 = [['Snow',{'rec':(83,50,100,79)},{'rec':(107,151,111,164)}]]
ROI_imgc7 = [['Snow',{'rec':(104,4,126,34)},{'rec':(111,79,124,101)}],
             ['Wall',{'rec':(59,103,91,162)},
                     {'poly':((55,6),(56,99),(66,94),(71,31),(99,32),(98,5))}]]

# We build and display ROIs.
roi1 = build_ROI(M_img1, ROI_img1, 'img1')
roi2 = build_ROI(M_img2, ROI_img2, 'img2')
roi3 = build_ROI(M_imgc7, ROI_imgc7, 'imgc7')

Next we build the snow cmaps.

In [6]:
cmap_1 = build_img1_cmap(M_img1, roi1, 'cmap img1')
cmap_2 = build_img2_cmap(M_img2, roi2, 'cmap img2')
cmap_c7 = build_imgc7_cmap(M_imgc7, roi3, 'cmap imgc7')
# The next declared class map is a background cmap, we use it for no snow images.
bkg_cmap = np.zeros((M_img1.shape[0],M_img1.shape[1]))
imgc7 class map cleaned:

With the seven hypercubes we create a train and a test data.

In [7]:
# Concatenate the seven hypercube and the corresponding cmaps.
X,y = skl.shape_to_XY([M_img1,M_img2,M_imgc7,              # snow hypercubes
                    M_imga1,M_imgb1,M_imgb6,M_imga7],      # no snow hypercubes
                    [cmap_1,cmap_2,cmap_c7,                # snow cmaps
                    bkg_cmap,bkg_cmap,bkg_cmap,bkg_cmap])  # no snow cmaps

# We take 25% for training
seed = 5
train_size = 0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=seed)

The training data is fit with a Random Forest Classifier There is no cross validation made. The training use default parameters for most part, except for n_estimators set to 25.

In [8]:
model = fit_model(X_train, y_train)
# A feature importances dump follow. 
model.display_feature_importances(suffix='snow - no snow')
model.display_feature_importances(sort=True, suffix='sorted')

We run the model on all the data for a visual inspection.

In [9]:
batch_classify(source_path, model, snow_samples+nosnow_samples)

Lets see the accuracy score

In [10]:
# make predictions for test data
y_pred = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Accuracy: 97.17%