fit_logistic_model() Uses TensorFlow to fit logistic regression model for classifying cells by attribute in colData slot of sce. Most arguments have sensible defaults. Use history plot to determine if model is learning and that it performs equally well on training and validation sets.

plot_confusion() plot confusion matrix: number of cells assigned to observed class and predicted class

predict_logistic_prob() predict class assignments of cells in SingleCellExperiment object. Genes used to train the model are selected automatically.

fit_logistic_model(sce, y = NULL, assay_slot = "logcounts",
  y_col = "Cell_class", activation = "softmax",
  loss = c("categorical_crossentropy", "kullback_leibler_divergence")[1],
  regularizer = keras::regularizer_l1, penalty = 0.01,
  initializer = "random_uniform", optimizer = keras::optimizer_sgd(lr =
  0.01, nesterov = TRUE), metrics = list("accuracy"), epochs = 100,
  validation_split = 0.3, validation_split_per_class = TRUE,
  callback_early_stopping_patience = 50, batch_size = 1000,
  shuffle = TRUE, verbose = TRUE, model = NULL)

plot_confusion(confusion, normalize = FALSE, text_color = "grey60")

predict_logistic_prob(sce, model_res, assay_slot = "logcounts",
  ref_y = NULL, ref_y_col = NULL, batch_size = 1000,
  verbose = TRUE)

Arguments

sce

SingleCellExperiment object.

y

Optionally, you can provide your own cell labels as y matrix (dim = cells * classes) where each row should sum to 1. This can be continuous cell labels such as those from archetypal analysis (fit_pch) or NMF (non-negativa matrix factorisation). In such case, kullback_leibler_divergence is a more suitable cost function.

assay_slot

Assay slot in sce containing gene expression matrix (default is logcounts).

y_col

Column incolData(sce) containing cell labels.

activation

Activation function. Using "softmax" gives logistic regression.

loss

Loss or Cost function that evaluates the difference between preditions and true labels and is minimised during model training. Use "categorical_crossentropy" for discrete class labels and "kullback_leibler_divergence" for continuous labels that sum to 1.

regularizer

Function to penalise high values of model parameters/weights and reduce overfitting (memorisation of the dataset). Details: regularizer_l1. L1 regularisation tends to push most weights to 0 (thus acting as feature selection method) and enforce sparse weights. L2 regularisation also reduces overfitting by keeping most weights small but does not shrink them to 0. Set to NULL to not regularise. Both weights and bias are regularised the same way.

penalty

Regularisation penalty between 0 and 1. The higher the penalty the more stringent is the regularisation. Very high values can lead to poor model performance due to high bias (limited flexibility). Sensible values: 0.01 for regularizer_l1 and 0.5 for regularizer_l2. Change this parameter based history plot to make sure the model performs equally well on training and validation sets.

initializer

Method of initialising weights and bias. You do not normally need to change this. See https://keras.io/initializers/ for details.

optimizer

Which optimiser should be used to fit the model. You do not normally need to change this. See https://keras.io/optimizers/ for details.

metrics

Metrics that evaluate performance of the model. Usually this is accuracy and loss function. See https://keras.io/metrics/ for details

epochs

Number of training epochs. You do not normally need to change this.

validation_split

What proportion of cells should be used for validation.

validation_split_per_class

Do the validation split within each class to maintains proportion of classes in training and validation sets (TRUE)?

callback_early_stopping_patience

Number of epochs to wait for improvement before stopping early.

batch_size

Look at data in batches of batch_size cells. All batches will be seen in each epoch.

shuffle

Logical, whether to shuffle the training data before each epoch? Details: fit.keras.engine.training.Model

verbose

Logical. Show and plot diagnosic output (TRUE)?

model

Provide your own keras/tensorflow model. Output units must be equal to the number of classes (columns of y), input_shape must be equal to nrow(sce). This can be used to extend logistic regression model by adding hidden layers

confusion

The confusion table generated by table(), OR the output of ParetoTI::fit_logistic_model(), OR the output of ParetoTI::predict_logistic_prob()

normalize

Normalise so that cell of each observed class sum to 1?

text_color

Color of on-plot text showing absolute numbers of cells.

model_res

Output of ParetoTI::fit_logistic_model(), class "logistic_model_fit_TF"

ref_y

Reference cell labels as ref_y matrix (dim = cells * classes) where each row should sum to 1. Optional, it can be used to match continuous cell labels across datasets.

ref_y_col

Reference class column in colData(sce). For example, not annotated clusters assigned by a clustering algorhitm. Optional, it can be used to match discrete cell labels across datasets.

Examples

# download PBMC data as SingleCellExperiment object # split in 2 parts # fit logistic regression model to 1st part # use this model to predict cell types in the second part