Find features that are decreasing functions of distance from archetype

find_decreasing() Fits gam models to find features that are a decreasing function of distance from archetype. Both gam functions and first derivatives can be visualised using plot() method.

fit_arc_gam_1() Finds single GAM model fit and it's first derivative for a single feature and one or several archetypes.

get_top_decreasing() Find genes highest at each archetype above p-values threshold, and print top-12 genes and top-3 gene sets for each archetype.

find_decreasing_wilcox() find features that are a decreasing function of distance from archetype by finding features with highest value (median) in bin closest to archetype (1 vs all Wilcox test).

bin_cells_by_arch() find which cells are in bin closest to archetype.

find_tradeoff_wilcox(): find features that are most different between 2 archetypes (at a tradeoff, DE, differentially expressed genes) by finding features with highest value (median) in bin closest to archetype (1 vs all Wilcox test).

find_decreasing(data_attr, arc_col, features = c("Gpx1", "Alb", "Cyp2e1",
  "Apoa2")[3], min.sp = c(60), N_smooths = 4, n_points = 200,
  d = 1/n_points, weights = c(rep(1, each = n_points), rep(c(1, 0),
  each = n_points/2))[1], return_only_summary = FALSE,
  stop_at_10 = TRUE, one_arc_per_model = TRUE, type = c("s", "m",
  "cmq")[1], clust_options = list(), ...)

fit_arc_gam_1(feature, col, N_smooths, data_attr, min.sp, ..., d, n_points,
  weights)

get_top_decreasing(summary_genes, summary_sets = NULL,
  cutoff_genes = 0.01, cutoff_sets = 0.01,
  cutoff_metric = "wilcoxon_p_val", p.adjust.method = c("fdr",
  "none")[1], gam_fit_pval = 0.01, invert_cutoff = FALSE,
  order_by = "mean_diff", order_decreasing = TRUE,
  min_max_diff_cutoff_g = 0.3, min_max_diff_cutoff_f = 0.1)

find_decreasing_wilcox(data_attr, arc_col, features = c("Gpx1", "Alb",
  "Cyp2e1", "Apoa2")[3], bin_prop = 0.1, dist_cutoff = NULL,
  na.rm = FALSE, type = c("s", "m", "cmq")[1],
  clust_options = list(), method = c("BioQC", "r_stats")[1])

bin_cells_by_arch(data_attr, arc_col, bin_prop = 0.1,
  dist_cutoff = NULL, return_names = FALSE)

find_tradeoff_wilcox(data_attr, arc_col = c("archetype_1",
  "archetype_2"), features = c("Gpx1", "Alb", "Cyp2e1", "Apoa2")[3],
  bin_prop = 0.1, na.rm = FALSE)

Arguments

data_attr	data.table dim(examples, dimensions) that includes distance of each example to archetype in columns given by `arc_col` and feature values given by `features`
arc_col	character vector, columns that give distance to archetypes (column per archetype)
features	character vector (1L), column than containg feature values
min.sp	lower bound for the smoothing parameter, details: gam. Default value of 60 works well to stabilise curve shape near min and max distance
N_smooths	number of bases used to represent the smooth term (s), 4 for cubic splines
n_points	number of points at which to evaluate derivative
d	numeric vector (1L), finite difference interval
weights	how to weight points along x axis when calculating mean (integral) probability. Useful if you care that the function is decreasing near the archetype but not far away. Two defaults suggest to weight point equally or discard bottom 50 percent.
return_only_summary	return only summary data.table containing p-values for each feature at each archetype and effect-size measures (average derivative).
stop_at_10	prevents `find_decreasing()` from fitting too many features
one_arc_per_model	If TRUE fit separate gam models for each archetype. If FALSE combine all archetypes in one model: feature ~ s(arc1) + s(arc2) + ... + s(arcN).
type	one of s, m, cmq. s means single core processing using lapply. m means multi-core parallel procession using parLapply. cmq means multi-node parallel processing on a computing cluster using clustermq package.
clust_options	list of options for parallel processing. The default for "m" is list(cores = parallel::detectCores()-1, cluster_type = "PSOCK"). The default for "cmq" is list(memory = 2000, template = list(), n_jobs = 10, fail_on_error = FALSE). Change these options as required.
...	arguments passed to gam
summary_genes	gam_deriv summary data.table for decreasing genes
summary_sets	gam_deriv summary data.table for decreasing gene sets
cutoff_genes	value of cutoff_metric (lower bound) for genes
cutoff_sets	value of cutoff_metric (lower bound) for gene sets
cutoff_metric	probability metric for selecting decreasing genes: mean_prob, prod_prob, mean_prob_excl or prod_prob_excl
p.adjust.method	choose method for correcting p-value for multiple hypothesis testing. See p.adjust.methods and p.adjust for details.
gam_fit_pval	smooth term probability in gam fit (upper bound)
invert_cutoff	invert cutoff for genes and sets. If FALSE p < cutoff_genes, if TRUE p > cutoff_genes.
order_by	order decreasing feature list by measure in summary sets. By default is mean_diff, the average difference between cells in bin closest to archetype and all other cells. When using GAM instead of Wilcox test set this to one of c( "deriv100", "deriv50", "deriv20"), the average value of derivative at 20/50/100 percent of points closest to archetype.
order_decreasing	order significant categories using `order_by`
min_max_diff_cutoff_g	what should be the mean difference (log-ratio, when y is log-space) of gene expression at the point closest to archetype compared to point furthest from archetype? When Wilcox method was used it is difference between mean of bin closest to archetype and all other cells. By default, at least 0.3 for genes and 0.1 for functions.
min_max_diff_cutoff_f	see min_max_diff_cutoff_g
bin_prop	proportion of data to put in bin closest to archetype
dist_cutoff	cutoff of cell distances to archetypes (high bound) to put cells into in bin closest to archetype.
method	how to find_decreasing_wilcox()? Use wmwTest or wilcox.test. BioQC::wmwTest can be up to 1000 times faster, so it is default.
return_names	return list of indices of cells or names of cells?

Value

find_decreasing() list (S3 object, gam_deriv) containing summary p-values for features and each archetype, function call and (optionally) a data.table with values of the first derivative

fit_arc_gam_1() list containing function call, 1st derivative values of GAM model (derivs), summary of GAM model (p-value and r^2, gam_sm)

get_top_decreasing() print summary to output, and return list with character vector with one element for each archetype, and 2 data.table- with selection of enriched genes and functions.

find_decreasing_wilcox() data.table containing p-values for each feature at each archetype and effect-size measures (average difference between bins). When log(counts) was used mean_diff reflects log-fold change.

bin_cells_by_arch() list of indices of cells or names of cells that are in bin closest to each archetype

find_tradeoff_wilcox() data.table containing p-values for each feature at each archetype and effect-size measures (average difference between bins). When log(counts) was used mean_diff reflects log-fold change.

Find features that are decreasing functions of distance from archetype

Arguments

Value

Contents