Statistical functions¶
This page describes the statistical functions that are available in Phonometrica.
Array statistics¶
- mean(x[, dim])¶
Returns the mean of the array x. If dim is specified, returns an Array in which each element
represents the mean over the given dimension in a two-dimensional array. If dim is equal to 1, the calculation is performed
over rows. If it is equal to 2, it is performed over columns.
- std(x[, dim])¶
Returns the standard deviation of the array x. If dim is specified, returns an Array in which each element
represents the standard deviation over the given dimension in a two-dimensional array. If dim is equal to 1, the calculation is performed
over rows. If it is equal to 2, it is performed over columns.
- sum(x[, dim])¶
Returns the sum of the elements in the array x. If dim is specified, returns an Array in which each element
represents the sum over the given dimension in a two-dimensional array. If dim is equal to 1, the summation is performed
over rows. If it is equal to 2, summation is performed over columns.
- vrc(x[, dim])¶
Returns the sample variance of the array x. If dim is specified, returns an Array in which each element
represents the variance over the given dimension in a two-dimensional array. If dim is equal to 1, the calculation is performed
over rows. If it is equal to 2, it is performed over columns.
See also: std()
Model fitting¶
- fit(formula, data[, family])¶
Fits a frequentist statistical model from a formula string and a data table (concordance or dataset). This is the main entry point for model fitting in Phonometrica.
formula is an R-style formula string (e.g. "f1 ~ vowel + gender + (1|speaker)"). data is a
DataTable object (a concordance or dataset). family is an optional string specifying the
distributional family; the default is "gaussian".
Supported families:
"gaussian": linear regression / LMM (identity link)"binomial": logistic regression / logistic GLMM (logit link)"poisson": Poisson regression / Poisson GLMM (log link)"negbin": negative binomial regression / NB GLMM (log link)"beta": beta regression / beta GLMM (logit link), for proportions in (0, 1)"student": Student t regression / t mixed model (identity link), for continuous outcomes with heavy-tailed residuals (e.g. formant measurements with tracking errors). The scale parameter σ and the degrees-of-freedom parameter ν are estimated jointly with the regression coefficients. Observations with large residuals are automatically down-weighted, making the estimates robust to outliers. When ν → ∞, the model reduces to Gaussian regression.
Returns a Model object (see Model fields below).
Example:
let ds = load("my_data.csv")
let m = fit("f1 ~ vowel + gender + (1|speaker)", ds)
summarize(m)
print "AIC = " & m.aic
let m2 = fit("voicing ~ consonant + position + (1|speaker)", ds, "beta")
summarize(m2)
let m3 = fit("f1 ~ vowel + gender + (1|speaker)", ds, "student")
summarize(m3)
print "sigma = " & m3.sigma
print "nu = " & m3.nu
- fit(formula, data, prior[, prior])¶
Fits a Bayesian model using approximate Bayesian inference (INLA-style). The prior
argument is a Prior object (see Prior specification below) that controls the
prior distributions on all model parameters.
When family is omitted, "gaussian" is used. The function returns a Model object
with estimation set to "Bayesian" and the posterior summary fields populated
(see Bayesian model fields below).
At fit time, any prior fields left at their auto-scaled defaults are replaced by data-dependent weakly informative priors (see Prior specification).
Example:
let ds = load("my_data.csv")
// Bayesian fit with default priors
let prior = Prior()
let m = fit("f1 ~ vowel + (1|speaker)", ds, prior)
summarize(m)
print "pd for vowel[i] = " & m.pd[2]
// Bayesian fit with custom fixed-effects prior
let prior2 = Prior()
set_fixed(prior2, 0, 5) // N(0, 5) for all slopes
let m2 = fit("count ~ group + (1|subject)", ds, "poisson", prior2)
summarize(m2)
- summarize(model)¶
Prints a summary of a fitted Model object. The output adapts to the estimation method:
Frequentist models: fixed-effects coefficients (estimates, standard errors, z/t-values, and p-values), random-effects variance components (if present), and overall fit statistics (AIC, BIC, log-likelihood).
Bayesian models: fixed-effects posterior summaries (posterior mean, posterior SD, 95% credible interval bounds, and probability of direction with significance codes), hyperparameter posteriors (variance component SDs, dispersion parameters) with credible intervals when available from grid integration, and Bayesian fit statistics (WAIC, LOO-IC with Pareto k diagnostics, LPPD, log-marginal likelihood).
Example:
let m = fit("f1 ~ vowel + (1|speaker)", ds)
summarize(m)
// Bayesian
let prior = Prior()
let mb = fit("f1 ~ vowel + (1|speaker)", ds, prior)
summarize(mb)
- get_coef(model)¶
Prints a formatted table of the estimated fixed-effects coefficients and returns the coefficient array from a fitted model.
- compare(model1, model2)¶
Compares two fitted models. The comparison method depends on the estimation type:
Frequentist models: likelihood-ratio test (LRT) with a table of information criteria (AIC, BIC, log-likelihood, deviance) and the LRT chi-squared statistic with p-value. The models should be nested (one should be a special case of the other).
Bayesian models: WAIC and LOO-IC comparison tables (with ΔWAIC, ΔLOO-IC and their standard errors), Pareto k diagnostic summaries, and log Bayes factors. Cannot be mixed with frequentist models.
Example:
let m1 = fit("f1 ~ vowel + (1|speaker)", ds)
let m2 = fit("f1 ~ vowel + gender + (1|speaker)", ds)
compare(m1, m2)
- filter(table as DataTable, expression as String[, label as String])¶
Returns a new dataset containing only the rows that match the filter expression.
If label is provided, the resulting dataset will have the given label.
Example:
let ds = load("data.csv")
let females = filter(ds, "gender == 'F'")
Post-hoc analysis¶
- emmeans(model, factor[, adjustment])¶
Computes and prints estimated marginal means (EMMs) for the given categorical factor. EMMs are population-averaged predictions at each level of the factor, with other categorical factors balanced and numeric covariates held at their observed means.
If adjustment is provided ("holm", "bonferroni", or "none"), pairwise contrasts
between all levels of the factor are computed and printed as well.
For Bayesian models, the EMM table reports credible intervals instead of confidence
intervals, and the contrast table reports the probability of direction (pd) instead of
adjusted p-values. Multiplicity adjustment is not applied to pd values; the adjustment
argument is ignored.
Example:
let m = fit("f1 ~ vowel + gender + (1|speaker)", ds)
emmeans(m, "vowel", "holm")
- emtrends(model, factor, variable[, adjustment])¶
Estimates the slope (trend) of a continuous variable at each level of a categorical factor. This is
useful when the model includes an interaction between a numeric covariate and a factor (e.g.
f2 ~ frequency * group).
Results are reported on the link scale. If adjustment is provided, pairwise contrasts
of the slopes across factor levels are computed and printed.
Example:
let m = fit("f2 ~ frequency * group + (1|speaker)", ds)
emtrends(m, "group", "frequency", "holm")
Diagnostics¶
- test_residuals(model)¶
Computes simulation-based residual diagnostics similar to the DHARMa package in R [HAR2022] and prints the results. Three tests are performed:
Uniformity test (Kolmogorov-Smirnov): tests whether the scaled residuals follow a uniform distribution, as expected if the model is correctly specified.
Dispersion test: checks for over- or under-dispersion by comparing the observed variance of the scaled residuals to its expected value.
Outlier test: counts observations whose response falls entirely outside the simulated range.
For frequentist models, the simulations are drawn from the fitted model (conditional simulation with 1000 replicates).
For Bayesian models, the function performs posterior predictive checking (PPC) for the uniformity and dispersion tests: coefficient vectors are drawn from the posterior (200 replicates), new response vectors are simulated, and the test statistics are compared between observed and simulated data. The resulting p-values are Bayesian p-values (proportion of replicates where the simulated statistic exceeds the observed).
Example:
let m = fit("count ~ condition + (1|subject)", ds, "poisson")
test_residuals(m)
Model fields¶
A Model object returned by fit() has the following fields:
- formula¶
The formula string used to fit the model.
- family¶
The family name (e.g. "gaussian", "binomial", "poisson", "negbin", "beta", "student").
- link¶
The link function name (e.g. "identity", "logit", "log").
- nobs¶
Number of observations.
- aic¶
Akaike Information Criterion.
- bic¶
Bayesian Information Criterion.
- loglik¶
Log-likelihood at convergence.
- deviance¶
Residual deviance.
- r2¶
R² (Gaussian fixed-effects models only).
- adj_r2¶
Adjusted R² (Gaussian fixed-effects models only).
- r2_marginal¶
Nakagawa marginal R² (mixed models only).
- r2_conditional¶
Nakagawa conditional R² (mixed models only).
- rse¶
Residual standard error (Gaussian only).
- df¶
Residual degrees of freedom.
- theta¶
Overdispersion parameter (negative binomial only; 0 otherwise).
- phi¶
Precision parameter (beta only; 0 otherwise).
- sigma¶
Scale parameter (Student t only; 0 otherwise).
- nu¶
Degrees of freedom (Student t only; 0 otherwise).
- converged¶
Boolean indicating whether the optimizer converged.
- niter¶
Number of iterations (0 for OLS).
- fitted¶
Array of fitted values from the model.
- residuals¶
Array of response residuals (observed − fitted) from the model.
- estimation¶
A string indicating the estimation method: "Frequentist" or "Bayesian".
Available on all models.
- waic¶
WAIC (Watanabe–Akaike information criterion). NaN for frequentist models.
- loo_ic¶
LOO-IC (PSIS leave-one-out information criterion). NaN for frequentist models.
- p_waic¶
Effective number of parameters from WAIC. NaN for frequentist models.
- p_loo¶
Effective number of parameters from LOO-IC. NaN for frequentist models.
- se_waic¶
Standard error of WAIC. NaN for frequentist models.
- se_loo¶
Standard error of LOO-IC. NaN for frequentist models.
- pareto_k¶
Array of per-observation Pareto k diagnostics from PSIS-LOO. Empty for frequentist models. Values below 0.7 indicate that LOO-IC is reliable; values above 0.7 suggest that the LOO approximation may be poor for those observations.
- log_marginal¶
Log-marginal likelihood (Laplace approximation). NaN for frequentist models.
Prior specification¶
A Prior object controls the prior distributions used for Bayesian model fitting. Create
one by calling the Prior constructor, then optionally configure individual priors before
passing it to fit().
When a prior field is left at its default, Phonometrica replaces it at fit time with a data-dependent weakly informative prior scaled to the response variable (following the approach of brms). Setting a prior explicitly disables auto-scaling for that component.
Warning
Match the prior scale to the response scale. Custom priors supplied via
set_fixed(), set_variance(), and set_residual() are applied on the
link scale of the model, which for identity-link families (Gaussian, Student-t) is
the raw response scale. Data on a Hz scale (e.g. F1 formant values in the hundreds)
will have slope coefficients of similar magnitude, so a tight prior like N(0, 10)
acts as a very strong shrinkage toward zero — pulling coefficients away from their
data-supported values and, for Student-t specifically, causing the optimizer to
inflate sigma and push nu to its upper bound as it compensates.
For logit-link families (Binomial, Beta) and log-link families (Poisson, Negative
Binomial), coefficients live on a transformed scale and are typically O(1), so
N(0, 10) is a loose prior by default. Custom priors on these families can
usually use moderate scales without issue.
When in doubt, omit the set_fixed call and let the auto-scaled defaults apply
(the constructor default is to auto-scale every field). Phonometrica will emit a
prior_warning on the fitted model when the residual scale of an identity-link
fit exceeds 1.5 × sd(y), which reliably flags this class of misconfiguration.
- Prior()¶
Creates a new prior specification with all fields set to auto-scaled defaults:
Fixed effects (slopes): Normal(0, 2.5 × sd(y))
Intercept: Normal(mean(y), 2.5 × sd(y))
Variance components: PC(2.5 × sd(y), 0.05)
Residual SD: PC(2.5 × sd(y), 0.05)
NB θ: Gamma(1, 0.01)
Beta φ: Gamma(1, 0.01)
Example:
let prior = Prior()
- set_fixed(prior, mean, sd)¶
Sets the default Normal prior for all fixed-effect slope coefficients to Normal(mean,
sd). The intercept receives a separate prior centered on the response mean (this is
handled automatically). Disables auto-scaling for fixed effects.
Note
The sd value is on the link scale of the model. For identity-link families
(Gaussian, Student-t) fit to data on a Hz or dB scale, a small sd such as 10
strongly shrinks coefficients toward the prior mean and may produce a degenerate
fit. See prior-scale-awareness for details.
Example:
let prior = Prior()
set_fixed(prior, 0, 5) // N(0, 5) for all slopes
- set_fixed(prior, name, mean, sd)¶
Sets a Normal prior for a specific coefficient identified by name (e.g.
"Intercept", "age"). This overrides the default fixed-effects prior for that
coefficient only.
Example:
let prior = Prior()
set_fixed(prior, "Intercept", 500, 100) // informative prior for intercept (e.g. F1 in Hz)
- set_variance(prior, type, scale)¶
- set_variance(prior, type, param1, param2)
Sets the prior for random-effect standard deviations (variance components). type is one
of "pc" (penalized complexity), "half_cauchy" (Half-Cauchy), or "half_normal"
(Half-Normal). Disables auto-scaling for variance components.
For "pc", two parameters are required: param1 is the upper bound u and param2
is the tail probability α, such that P(σ > u) = α. For "half_cauchy" and
"half_normal", only scale (the scale parameter) is needed.
Example:
let prior = Prior()
set_variance(prior, "pc", 50, 0.05) // PC prior: P(σ > 50) = 0.05
set_variance(prior, "half_cauchy", 25) // Half-Cauchy(25)
- set_residual(prior, type, scale)¶
- set_residual(prior, type, param1, param2)
Sets the prior for the residual standard deviation (Gaussian family only). Same syntax as
set_variance(). Disables auto-scaling for the residual prior.
Example:
let prior = Prior()
set_residual(prior, "pc", 100, 0.05)
- set_negbin_theta(prior, shape, rate)¶
Sets a Gamma(shape, rate) prior for the negative binomial overdispersion parameter θ.
Example:
let prior = Prior()
set_negbin_theta(prior, 2, 0.1) // Gamma(2, 0.1) — more informative than default
- set_beta_phi(prior, shape, rate)¶
Sets a Gamma(shape, rate) prior for the beta regression precision parameter φ.
Example:
let prior = Prior()
set_beta_phi(prior, 1, 0.01)
- set_lkj(prior, eta)¶
Sets an LKJ prior on the correlation matrix of correlated random effects (e.g. a random intercept and random slope for the same grouping factor). The density is
where R is the correlation matrix derived from the random-effect covariance.
eta must be strictly positive. The default is 1.0, which is uniform over
correlation matrices (equivalent to placing no prior on the correlation structure).
Values greater than 1 concentrate posterior mass toward the identity (favouring
independent random terms); values less than 1 push toward strongly correlated random
terms. eta = 2 is a common mildly regularising choice — it downweights extreme
correlations (±1) that would indicate a degenerate covariance — and matches the
convention used in brms and in Vasishth et al. (2018).
The prior is only active when a grouping factor has two or more random terms (e.g. an intercept plus a slope); for intercept-only random effects it has no effect.
Reference: Lewandowski, Kurowicka & Joe (2009), J. Multivariate Anal.
Example:
let prior = Prior()
set_lkj(prior, 2) // LKJ(2): mildly regularising
Bayesian model fields¶
When a model is fitted with Bayesian estimation (by passing a Prior to fit()),
the following additional fields are available on the Model object. These fields are
empty arrays for frequentist models.
- estimation¶
A string indicating the estimation method: "Frequentist" or "Bayesian".
- posterior_mean¶
Array of posterior means for each fixed-effect coefficient (length = number of fixed effects). This is the quantity reported as “Estimate” in the Bayesian summary.
- posterior_mode¶
Array of posterior modes (MAP estimates) for each fixed-effect coefficient. For models with grid integration, this is the conditional mode at the posterior mode of the hyperparameters θ*. For symmetric posteriors, the mode equals the mean; they may differ for small samples.
- posterior_median¶
Array of posterior medians (0.5 quantile of the marginal posterior) for each fixed-effect coefficient. Computed from the mixture CDF for grid-integrated models.
- posterior_sd¶
Array of posterior standard deviations for each fixed-effect coefficient. This is the quantity reported as “Est.Error” in the Bayesian summary.
- ci_lower¶
Array of lower bounds of the 95% credible interval for each fixed-effect coefficient.
- ci_upper¶
Array of upper bounds of the 95% credible interval for each fixed-effect coefficient.
- pd¶
Array of probability-of-direction values for each fixed-effect coefficient. The pd is defined as max(P(β > 0), P(β < 0)) and ranges from 0.5 (no evidence for either direction) to 1.0 (certainty). It is the Bayesian counterpart to a two-sided p-value.
Example:
let prior = Prior()
let m = fit("f1 ~ vowel + gender + (1|speaker)", ds, prior)
print "Estimation: " & m.estimation // "Bayesian"
print "Intercept posterior mean: " & m.posterior_mean[1]
print "Intercept 95% CrI: [" & m.ci_lower[1] & ", " & m.ci_upper[1] & "]"
print "Intercept pd: " & m.pd[1]
// Frequentist fields (AIC, BIC, loglik) remain available
print "AIC = " & m.aic
Note
WAIC, LOO-IC, and log-marginal likelihood are available both through summarize()
and as individual model fields (waic, loo_ic, log_marginal, etc.; see
Model fields).
References¶
Hartig, Florian. 2022. DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package. https://CRAN.R-project.org/package=DHARMa.