Using the pyChemometrics objects

pyChemometrics is a python 3.5 library for multivariate chemometric data analysis.

The main objects ChemometricsPCA, ChemometricsPLS and ChemometricsPLSDA consist of wrappers for scikit-learn Principal Component Analysis and Partial Least Squares Regression objects. They have been made to mimic as much as possible scikit-learn classifiers, from their internal properties, and therefore can be interfaced with other components of scikit-learn, such as the sklearn::Pipeline.

These wrappers contain implementations of various routines and metrics commonly seen in the Chemometric and metabonomic literature. PRESS and Q2Y estimation, permutation testing, Hotelling T2 for outlier detection of scores, VIP scores for variable importance. Pareto and Unit-Variance scaling.

Each of these objects uses ChemometricsScaler objects to automatically handle the scaling of the X and Y data matrices.

Scaling

The ChemometricsScaler object handles the scaling of the data matrices. The main s The data is always. The choice of the power determines the type of scaling. For example, scaling_power = 0 performs column centering only, scaling_power = 1/2 Pareto scaling and scaling_power = 1 UV (Unit Variance scaling or standardisation).

ChemometricsPCA object. The scaler parameter expects a ChemometricsScaler with the default options and Unit-Variance scaling

pca_model = pyChemometrics.ChemometricsPCA(…)

The pyChemometrics objects contain methods similar to the ones defined in the scikit-learn Transformer, Classifier and Regressor Mixins, for example, .fit, .transform , .predict and .score.

pca_model.fit(X) # Obtain the scores (T), the lower dimensional representation of data. t_scores = pca_model.transform(X) # Obtain the reconstructed dataset from the T scores. pca_model.inverse_transform(scores)

Principal Component Analysis

Principal Component Analysis is provided by the ChemometricsPCA object.

ChemometricsPCA object. The scaler parameter expects a ChemometricsScaler with the default options and Unit-Variance scaling

pca_model = pyChemometrics.ChemometricsPCA(…)

pca_model.fit(X)

t_scores pca_model.transform(X)

pca_model.inverse_transform(scores)

The scores and loadings obtained for each component upon calling the .fit method are set as atributes of the model.

The modelParameters dictionary contains the following keys:
  • VarExp: Total variance explained by the model, per component
  • VarExpRatio: % of variance explained, per component
  • R2X: The variance explained by the model in the fitting/training set. Calculated using the model residuals.
  • S0: The denominator for calculation of the Normalised DmodX score.

Performing model cross_validation using the cross_validation() method generates another dictionary atribute, cvParameters. These contain the mean and standard deviation values obtained from the multiple folds or sampling repeats performed the cross-validation, and if cross_validation method was called with outputdist = True, also the whole distribution obtained by CV for each parameter.

The cvParameters dictionary contains these keys:
  • Mean_Loadings: Average loading vectors during cross-validation
  • Stdev_Loadings: Standard deviation of the loading vectors

If the outputdist option is set to True when performing cross validation, cvParameters will contain extra keys with numpy.ndarrays containing all the model parameters (scores, loadings, goodness of fit metrics, etc) obtained for each model fitted during CV.

The main The methods provided by these objects The pyChemometrics objects follow a similar logic Similarly to scikit-learn:

Partial Least Squares Regression

The standard Partial Least Squares object

The scores and loadings obtained for each component upon calling the .fit method are set as atributes of the model.

  • scores_t:
  • scores_u:
  • weights_w:
  • weights_c:
  • loadings_p:
  • loadings_q:
  • rotations_ws:
  • rotations_cs:
  • b_u:
  • b_t:
  • beta_coeffs:
  • logistic_coefs:
  • n_classes:
The modelParameters dictionary contains the following keys:
  • R2Y: Total variance explained by the model, per component
  • R2X: % of variance explained, per component
  • SSX:
  • SSY:
  • SSXcomp: The variance explained by the model in the fitting/training set. Calculated using the model residuals.
  • SSYcomp: The denominator for calculation of the Normalised DmodX score.

Performing model cross_validation using the cross_validation() method generates another dictionary atribute, cvParameters. These contain the mean and standard deviation values obtained from the multiple folds or sampling repeats performed the cross-validation, and if cross_validation method was called with outputdist = True, also the whole distribution obtained by CV for each parameter.

The cvParameters dictionary contains these keys:
  • Mean_Loadings: Average loading vectors during cross-validation
  • Stdev_Loadings: Standard deviation of the loading vectors

If the outputdist option is set to True when performing cross validation, cvParameters will contain extra keys with numpy.ndarrays containing all the model parameters (scores, loadings, goodness of fit metrics, etc) obtained for each model fitted during CV.

ChemometricsPLS

Partial Least Squares - Discriminant Analysis

The ChemometricsPLSDA object shares many features with the ChemometricsPLS object.

Calling the fit method will fill in these

  • scores_t:
  • scores_u:
  • weights_w:
  • weights_c:
  • loadings_p:
  • loadings_q:
  • rotations_ws:
  • rotations_cs:
  • b_u:
  • b_t:
  • beta_coeffs:
  • logistic_coefs:
  • n_classes:

However, this object expects either a singly Y vector containing, or a dummy matrix. The singly Y vector encoding class membership is re-coded as a dummy matrix of dimensions [n observations x m classes] as part of the algorithm.

The scores and loadings obtained for each component upon calling the .fit method are set as atributes of the model.

The modelParameters dictionary attributes are contains the following keys:
The ‘PLS’ subdictionary contains all the values pertaining to the PLS regression algorithm. - R2Y: Total variance explained by the model, per component - R2X: % of variance explained, per component - SSX: - SSY: - SSXcomp: The variance explained by the model in the fitting/training set. Calculated using the model residuals. - SSYcomp: The denominator for calculation of the Normalised DmodX score. The ‘DA’ subdictionary contains the classification metrics obtained by scoring the class predictions with the known truth. - Balanced accuracy: - F1 measure: - Precision: - Recall: - ROC curve: - AUC: - 01-Loss: - MCC:

Performing model cross_validation using the cross_validation() method generates another dictionary atribute, cvParameters. These contain the mean and standard deviation values obtained from the multiple folds or sampling repeats performed the cross-validation, and if cross_validation method was called with outputdist = True, also the whole distribution obtained by CV for each parameter.

The cvParameters dictionary contains these keys:
  • Mean_Loadings: Average loading vectors during cross-validation
  • Stdev_Loadings: Standard deviation of the loading vectors
Additionaly, the discriminant analysis also contains the mean and standard deviation parameters for the DA component.
  • Mean_Accuracy:
  • Stdev_Accuracy:

If the outputdist option is set to True when performing cross validation, cvParameters will contain extra keys with numpy.ndarrays containing all the model parameters (scores, loadings, goodness of fit metrics, etc) obtained for each model fitted during CV.

Partial Least Squares - Logistic Regression

The ChemometricsPLS_Logistic object shares many features with the ChemometricsPLS object.

  • scores_t:
  • scores_u:
  • weights_w:
  • weights_c:
  • loadings_p:
  • loadings_q:
  • rotations_ws:
  • rotations_cs:
  • b_u:
  • b_t:
  • beta_coeffs:
  • logistic_coefs:
  • n_classes:

Calling the fit method will fill in these

Partial Least Squares - Linear Discriminant Analysis

The ChemometricsPLS_LDA object shares many features with the ChemometricsPLS_LDA object.

  • scores_t:
  • scores_u:
  • weights_w:
  • weights_c:
  • loadings_p:
  • loadings_q:
  • rotations_ws:
  • rotations_cs:
  • b_u:
  • b_t:
  • beta_coeffs:
  • logistic_coefs:
  • n_classes:

Calling the fit method will fill in these