find_decreasing() Fits gam models to find features that are a decreasing function of distance from archetype. Both gam functions and first derivatives can be visualised using plot() method.

fit_arc_gam_1() Finds single GAM model fit and it's first derivative for a single feature and one or several archetypes.

get_top_decreasing() Find genes highest at each archetype above p-values threshold, and print top-12 genes and top-3 gene sets for each archetype.

find_decreasing_wilcox() find features that are a decreasing function of distance from archetype by finding features with highest value (median) in bin closest to archetype (1 vs all Wilcox test).

bin_cells_by_arch() find which cells are in bin closest to archetype.

find_tradeoff_wilcox(): find features that are most different between 2 archetypes (at a tradeoff, DE, differentially expressed genes) by finding features with highest value (median) in bin closest to archetype (1 vs all Wilcox test).

find_decreasing(data_attr, arc_col, features = c("Gpx1", "Alb", "Cyp2e1",
  "Apoa2")[3], min.sp = c(60), N_smooths = 4, n_points = 200,
  d = 1/n_points, weights = c(rep(1, each = n_points), rep(c(1, 0),
  each = n_points/2))[1], return_only_summary = FALSE,
  stop_at_10 = TRUE, one_arc_per_model = TRUE, type = c("s", "m",
  "cmq")[1], clust_options = list(), ...)

fit_arc_gam_1(feature, col, N_smooths, data_attr, min.sp, ..., d, n_points,
  weights)

get_top_decreasing(summary_genes, summary_sets = NULL,
  cutoff_genes = 0.01, cutoff_sets = 0.01,
  cutoff_metric = "wilcoxon_p_val", p.adjust.method = c("fdr",
  "none")[1], gam_fit_pval = 0.01, invert_cutoff = FALSE,
  order_by = "mean_diff", order_decreasing = TRUE,
  min_max_diff_cutoff_g = 0.3, min_max_diff_cutoff_f = 0.1)

find_decreasing_wilcox(data_attr, arc_col, features = c("Gpx1", "Alb",
  "Cyp2e1", "Apoa2")[3], bin_prop = 0.1, dist_cutoff = NULL,
  na.rm = FALSE, type = c("s", "m", "cmq")[1],
  clust_options = list(), method = c("BioQC", "r_stats")[1])

bin_cells_by_arch(data_attr, arc_col, bin_prop = 0.1,
  dist_cutoff = NULL, return_names = FALSE)

find_tradeoff_wilcox(data_attr, arc_col = c("archetype_1",
  "archetype_2"), features = c("Gpx1", "Alb", "Cyp2e1", "Apoa2")[3],
  bin_prop = 0.1, na.rm = FALSE)

Arguments

data_attr

data.table dim(examples, dimensions) that includes distance of each example to archetype in columns given by arc_col and feature values given by features

arc_col

character vector, columns that give distance to archetypes (column per archetype)

features

character vector (1L), column than containg feature values

min.sp

lower bound for the smoothing parameter, details: gam. Default value of 60 works well to stabilise curve shape near min and max distance

N_smooths

number of bases used to represent the smooth term (s), 4 for cubic splines

n_points

number of points at which to evaluate derivative

d

numeric vector (1L), finite difference interval

weights

how to weight points along x axis when calculating mean (integral) probability. Useful if you care that the function is decreasing near the archetype but not far away. Two defaults suggest to weight point equally or discard bottom 50 percent.

return_only_summary

return only summary data.table containing p-values for each feature at each archetype and effect-size measures (average derivative).

stop_at_10

prevents find_decreasing() from fitting too many features

one_arc_per_model

If TRUE fit separate gam models for each archetype. If FALSE combine all archetypes in one model: feature ~ s(arc1) + s(arc2) + ... + s(arcN).

type

one of s, m, cmq. s means single core processing using lapply. m means multi-core parallel procession using parLapply. cmq means multi-node parallel processing on a computing cluster using clustermq package.

clust_options

list of options for parallel processing. The default for "m" is list(cores = parallel::detectCores()-1, cluster_type = "PSOCK"). The default for "cmq" is list(memory = 2000, template = list(), n_jobs = 10, fail_on_error = FALSE). Change these options as required.

...

arguments passed to gam

summary_genes

gam_deriv summary data.table for decreasing genes

summary_sets

gam_deriv summary data.table for decreasing gene sets

cutoff_genes

value of cutoff_metric (lower bound) for genes

cutoff_sets

value of cutoff_metric (lower bound) for gene sets

cutoff_metric

probability metric for selecting decreasing genes: mean_prob, prod_prob, mean_prob_excl or prod_prob_excl

p.adjust.method

choose method for correcting p-value for multiple hypothesis testing. See p.adjust.methods and p.adjust for details.

gam_fit_pval

smooth term probability in gam fit (upper bound)

invert_cutoff

invert cutoff for genes and sets. If FALSE p < cutoff_genes, if TRUE p > cutoff_genes.

order_by

order decreasing feature list by measure in summary sets. By default is mean_diff, the average difference between cells in bin closest to archetype and all other cells. When using GAM instead of Wilcox test set this to one of c( "deriv100", "deriv50", "deriv20"), the average value of derivative at 20/50/100 percent of points closest to archetype.

order_decreasing

order significant categories using order_by

min_max_diff_cutoff_g

what should be the mean difference (log-ratio, when y is log-space) of gene expression at the point closest to archetype compared to point furthest from archetype? When Wilcox method was used it is difference between mean of bin closest to archetype and all other cells. By default, at least 0.3 for genes and 0.1 for functions.

min_max_diff_cutoff_f

see min_max_diff_cutoff_g

bin_prop

proportion of data to put in bin closest to archetype

dist_cutoff

cutoff of cell distances to archetypes (high bound) to put cells into in bin closest to archetype.

method

how to find_decreasing_wilcox()? Use wmwTest or wilcox.test. BioQC::wmwTest can be up to 1000 times faster, so it is default.

return_names

return list of indices of cells or names of cells?

Value

find_decreasing() list (S3 object, gam_deriv) containing summary p-values for features and each archetype, function call and (optionally) a data.table with values of the first derivative

fit_arc_gam_1() list containing function call, 1st derivative values of GAM model (derivs), summary of GAM model (p-value and r^2, gam_sm)

get_top_decreasing() print summary to output, and return list with character vector with one element for each archetype, and 2 data.table- with selection of enriched genes and functions.

find_decreasing_wilcox() data.table containing p-values for each feature at each archetype and effect-size measures (average difference between bins). When log(counts) was used mean_diff reflects log-fold change.

bin_cells_by_arch() list of indices of cells or names of cells that are in bin closest to each archetype

find_tradeoff_wilcox() data.table containing p-values for each feature at each archetype and effect-size measures (average difference between bins). When log(counts) was used mean_diff reflects log-fold change.