Machine Learning (alpha)

Warning: ml module is experimental and may be subject to backward-incompatible changes.

This module supports LightGBM and XGBoost:

See also

See the file test_ml_models.py for an example.

LightGBM

Function

pysptools.ml.load_lgbm_model(fname)[source]

Load a LightGBM model that was saved as a file with the HyperLGBMClassifier.save method.

The model is span on two files:

  • The first file contains the model saved with the Booster class,

this file have no extension.

  • The second file contains the parameters used to create the model,

this file have the extension ‘.p’.

Parameters:fname (path) – The file name without extension.
Returns:HyperLGBMClassifier class
Return type:a model instance

Class

class pysptools.ml.HyperLGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=1, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=-1, silent=True)[source]

LightGBM classifier for Hyperspectral Imaging. The class implement the scikit-learn API and is a pysptools submodule.

This class add the save and load model functionalities. Following is a copy and paste form XGBModel documentation.

Construct a gradient boosting model.

boosting_type : string, optional (default=”gbdt”)
‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradient-based One-Side Sampling. ‘rf’, Random Forest.
num_leaves : int, optional (default=31)
Maximum tree leaves for base learners.
max_depth : int, optional (default=-1)
Maximum tree depth for base learners, -1 means no limit.
learning_rate : float, optional (default=0.1)
Boosting learning rate.
n_estimators : int, optional (default=100)
Number of boosted trees to fit.
subsample_for_bin : int, optional (default=50000)
Number of samples for constructing bins.
objective : string, callable or None, optional (default=None)
Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). default: ‘regression’ for LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier, ‘lambdarank’ for LGBMRanker.
class_weight : dict, ‘balanced’ or None, optional (default=None)
Weights associated with classes in the form {class_label: weight}. Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). If None, all classes are supposed to have weight one. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
min_split_gain : float, optional (default=0.)
Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight : float, optional (default=1e-3)
Minimum sum of instance weight(hessian) needed in a child(leaf).
min_child_samples : int, optional (default=20)
Minimum number of data need in a child(leaf).
subsample : float, optional (default=1.)
Subsample ratio of the training instance.
subsample_freq : int, optional (default=1)
Frequence of subsample, <=0 means no enable.
colsample_bytree : float, optional (default=1.)
Subsample ratio of columns when constructing each tree.
reg_alpha : float, optional (default=0.)
L1 regularization term on weights.
reg_lambda : float, optional (default=0.)
L2 regularization term on weights.
random_state : int or None, optional (default=None)
Random number seed. Will use default seeds in c++ code if set to None.
n_jobs : int, optional (default=-1)
Number of parallel threads.
silent : bool, optional (default=True)
Whether to print messages while running boosting.
n_features_ : int
The number of features of fitted model.
classes_ : array of shape = [n_classes]
The class label array (only for classification problem).
n_classes_ : int
The number of classes (only for classification problem).
best_score_ : dict or None
The best score of fitted model.
best_iteration_ : int or None
The best iteration of fitted model if early_stopping_rounds has been specified.
objective_ : string or callable
The concrete objective used while fitting this model.
booster_ : Booster
The underlying Booster of this model.
evals_result_ : dict or None
The evaluation results if early_stopping_rounds has been specified.
feature_importances_ : array of shape = [n_features]
The feature importances (the higher, the more important the feature).

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess or objective(y_true, y_pred, group) -> grad, hess:

y_true: array-like of shape = [n_samples]
The target values.
y_pred: array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The predicted values.
group: array-like
Group/query data, used for ranking task.
grad: array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The value of the gradient for each sample point.
hess: array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class task)
The value of the second derivative for each sample point.

For multi-class task, the y_pred is group by class_id first, then group by row_id. If you want to get i-th row y_pred in j-th class, the access way is y_pred[j * num_data + i] and you should group grad and hess in this way as well.

classify(M, raw_score=False, num_iteration=0)[source]

Classify a hyperspectral cube.

Parameters:M (numpy array) – A HSI cube (m x n x p).
Returns:numpy array
Return type:a class map (m x n x 1)
display_feature_importances(n_labels='all', sort=False, suffix='')[source]

Display the feature importances. The output can be split in n graphs.

Parameters:
  • n_labels (string or integer) – The number of labels to output by graph. If the value is ‘all’, only one graph is generated.
  • sort (boolean [default False]) – If true the feature importances are sorted.
  • suffix (string [default None]) – Add a suffix to the file name.
fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_class_weight=None, eval_init_score=None, eval_metric='logloss', early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]

Build a gradient boosting model from the training set (X, y).

Parameters:
  • X (array-like or sparse matrix of shape = [n_samples, n_features]) – Input feature matrix.
  • y (array-like of shape = [n_samples]) – The target values (class labels in classification, real numbers in regression).
  • sample_weight (array-like of shape = [n_samples] or None, optional (default=None)) – Weights of training data.
  • init_score (array-like of shape = [n_samples] or None, optional (default=None)) – Init score of training data.
  • group (array-like of shape = [n_samples] or None, optional (default=None)) – Group data of training data.
  • eval_set (list or None, optional (default=None)) – A list of (X, y) tuple pairs to use as a validation sets for early-stopping.
  • eval_names (list of strings or None, optional (default=None)) – Names of eval_set.
  • eval_sample_weight (list of arrays or None, optional (default=None)) – Weights of eval data.
  • eval_class_weight (list or None, optional (default=None)) – Class weights of eval data.
  • eval_init_score (list of arrays or None, optional (default=None)) – Init score of eval data.
  • eval_group (list of arrays or None, optional (default=None)) – Group data of eval data.
  • eval_metric (string, list of strings, callable or None, optional (default=None)) – If string, it should be a built-in evaluation metric to use. If callable, it should be a custom evaluation metric, see note for more details.
  • early_stopping_rounds (int or None, optional (default=None)) – Activates early stopping. The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds round(s) to continue training.
  • verbose (bool, optional (default=True)) – If True and an evaluation set is used, writes the evaluation progress.
  • feature_name (list of strings or 'auto', optional (default="auto")) – Feature names. If ‘auto’ and data is pandas DataFrame, data columns names are used.
  • categorical_feature (list of strings or int, or 'auto', optional (default="auto")) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas categorical columns are used.
  • callbacks (list of callback functions or None, optional (default=None)) – List of callback functions that are applied at each iteration. See Callbacks in Python API for more information.
Returns:

self – Returns self.

Return type:

object

Note

Custom eval function expects a callable with following functions: func(y_true, y_pred), func(y_true, y_pred, weight) or func(y_true, y_pred, weight, group). Returns (eval_name, eval_result, is_bigger_better) or list of (eval_name, eval_result, is_bigger_better)

y_true: array-like of shape = [n_samples]
The target values.
y_pred: array-like of shape = [n_samples] or shape = [n_samples * n_classes] (for multi-class)
The predicted values.
weight: array-like of shape = [n_samples]
The weight of samples.
group: array-like
Group/query data, used for ranking task.
eval_name: str
The name of evaluation.
eval_result: float
The eval result.
is_bigger_better: bool
Is eval result bigger better, e.g. AUC is bigger_better.

For multi-class task, the y_pred is group by class_id first, then group by row_id. If you want to get i-th row y_pred in j-th class, the access way is y_pred[j * num_data + i].

fit_rois(M, ROIs)[source]

Fit the HS cube M with the use of ROIs.

Parameters:
  • M (numpy array) – A HSI cube (m x n x p).
  • ROIs (ROIs class type) – Regions of interest instance.
partial_fit(X, y, sample_weight=None, init_score=None, eval_set=None, eval_names=None, eval_sample_weight=None, eval_init_score=None, eval_metric='logloss', early_stopping_rounds=None, verbose=True, feature_name='auto', categorical_feature='auto', callbacks=None)[source]

See fit() method doc

plot_feature_importances(path, n_labels='all', sort=False, suffix='')[source]

Plot the feature importances. The output can be split in n graphs.

Parameters:
  • path (string) – The path where to save the plot.
  • n_labels (string or integer) – The number of labels to output by graph. If the value is ‘all’, only one graph is generated.
  • sort (boolean [default False]) – If true the feature importances are sorted.
  • suffix (string [default None]) – Add a suffix to the file name.
save(fname, n_features, n_classes)[source]

Save the model and is parameters in two files. When the model is loaded, it instantiate an object of class HyperLGBMClassifier. See load_lgbm_model function doc.

Parameters:
  • fname (path) – The model file name.
  • n_features (int) – The model number of features.
  • n_classes (int) – The model number of classes, ex. for a binary model n_classes = 2 (the background is a class for pysptools).

XGBoost

Function

pysptools.ml.load_xgb_model(fname)[source]

Load a XGBoost model that was saved as a file with the HyperXGBClassifier.save method.

The model is span on two files:

  • The first file contains the model saved with the Booster class,

this file have no extension.

  • The second file contains the parameters used to create the model,

this file have the extension ‘.p’.

Parameters:fname (path) – The file name without extension.
Returns:HyperXGBClassifier class
Return type:a model instance

Class

class pysptools.ml.HyperXGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='reg:linear', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None)[source]

XGBoost classifier for Hyperspectral Imaging. The class implement the scikit-learn API and is a pysptools submodule.

This class add the save and load model functionalities.

Following is a copy and paste form XGBModel documentation.

Implementation of the Scikit-Learn API for XGBoost.

Parameters:
  • max_depth (int) – Maximum tree depth for base learners.
  • learning_rate (float) – Boosting learning rate (xgb’s “eta”)
  • n_estimators (int) – Number of boosted trees to fit.
  • silent (boolean) – Whether to print messages while running boosting.
  • objective (string or callable) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
  • booster (string) – Specify which booster to use: gbtree, gblinear or dart.
  • nthread (int) – Number of parallel threads used to run xgboost. (Deprecated, please use n_jobs)
  • n_jobs (int) – Number of parallel threads used to run xgboost. (replaces nthread)
  • gamma (float) – Minimum loss reduction required to make a further partition on a leaf node of the tree.
  • min_child_weight (int) – Minimum sum of instance weight(hessian) needed in a child.
  • max_delta_step (int) – Maximum delta step we allow each tree’s weight estimation to be.
  • subsample (float) – Subsample ratio of the training instance.
  • colsample_bytree (float) – Subsample ratio of columns when constructing each tree.
  • colsample_bylevel (float) – Subsample ratio of columns for each split, in each level.
  • reg_alpha (float (xgb's alpha)) – L1 regularization term on weights
  • reg_lambda (float (xgb's lambda)) – L2 regularization term on weights
  • scale_pos_weight (float) – Balancing of positive and negative weights.
  • base_score – The initial prediction score of all instances, global bias.
  • seed (int) – Random number seed. (Deprecated, please use random_state)
  • random_state (int) – Random number seed. (replaces seed)
  • missing (float, optional) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan.

Note

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values
y_pred: array_like of shape [n_samples]
The predicted values
grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
classify(M, output_margin=False, ntree_limit=0)[source]

Classify a hyperspectral cube.

Parameters:M (numpy array) – A HSI cube (m x n x p).
Returns:numpy array
Return type:a class map (m x n x 1)
display_feature_importances(n_labels='all', sort=False, suffix='')[source]

Display the feature importances. The output can be split in n graphs.

Parameters:
  • n_labels (string or integer) – The number of labels to output by graph. If the value is ‘all’, only one graph is generated.
  • sort (boolean [default False]) – If true the feature importances are sorted.
  • suffix (string [default None]) – Add a suffix to the file name.
fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)[source]

Fit gradient boosting classifier

Parameters:
  • X (array_like) – Feature matrix
  • y (array_like) – Labels
  • sample_weight (array_like) – Weight for each instance
  • eval_set (list, optional) – A list of (X, y) pairs to use as a validation set for early-stopping
  • eval_metric (str, callable, optional) – If a str, should be a built-in evaluation metric to use. See doc/parameter.md. If callable, a custom evaluation metric. The call signature is func(y_predicted, y_true) where y_true will be a DMatrix object such that you may need to call the get_label method. It must return a str, value pair where the str is a name for the evaluation and value is the value of the evaluation function. This objective is always minimized.
  • early_stopping_rounds (int, optional) – Activates early stopping. Validation error needs to decrease at least every <early_stopping_rounds> round(s) to continue training. Requires at least one item in evals. If there’s more than one, will use the last. Returns the model from the last iteration (not the best one). If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit. (Use bst.best_ntree_limit to get the correct value if num_parallel_tree and/or num_class appears in the parameters)
  • verbose (bool) – If verbose and an evaluation set is used, writes the evaluation metric measured on the validation set to stderr.
  • xgb_model (str) – file name of stored xgb model or ‘Booster’ instance Xgb model to be loaded before training (allows training continuation).
fit_rois(M, ROIs)[source]

Fit the HS cube M with the use of ROIs.

Parameters:
  • M (numpy array) – A HSI cube (m x n x p).
  • ROIs (ROIs class type) – Regions of interest instance.
partial_fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)[source]

See fit() method doc

plot_feature_importances(path, n_labels='all', sort=False, suffix='')[source]

Plot the feature importances. The output can be split in n graphs.

Parameters:
  • path (string) – The path where to save the plot.
  • n_labels (string or integer) – The number of labels to output by graph. If the value is ‘all’, only one graph is generated.
  • sort (boolean [default False]) – If true the feature importances are sorted.
  • suffix (string [default None]) – Add a suffix to the file name.
save(fname, n_features, n_classes)[source]

Save the model and is parameters in two files. When the model is loaded, it instantiate an object of class HyperXGBClassifier. See load_xgb_model function doc.

Parameters:
  • fname (path) – The model file name.
  • n_features (int) – The model number of features.
  • n_classes (int) – The model number of classes, ex. for a binary model n_classes = 2 (the background is a class for pysptools).