ANOVA and Experimental Design

Why ANOVA?

Food sensitivity testing is fundamentally a hypothesis-testing problem: does ingredient X cause symptom Y? The naive approach — eat something, feel bad, conclude it’s the cause — is confounded by dozens of variables and produces unreliable results.

ANOVA (Analysis of Variance) gives us a rigorous framework to ask: is the variance in symptoms explained by food exposure, or just noise?

What ANOVA Does

ANOVA partitions total variance in an outcome (symptom score) into:

  • Between-group variance — variance attributable to different conditions (ingredient exposures)
  • Within-group variance — baseline noise (day-to-day variation in symptoms unrelated to food)

The F-statistic is the ratio of these:

F = Between-group variance / Within-group variance

A high F means the food condition explains significantly more variance than noise. The associated p-value tells you the probability of seeing this F-statistic by chance.

One-Way vs Factorial ANOVA

One-way ANOVA — one factor (e.g. did you eat gluten today?) Factorial ANOVA — multiple factors simultaneously (gluten × dairy × stress level)

Factorial ANOVA is where this gets powerful — it can detect interaction effects. Maybe gluten alone causes mild symptoms, but gluten + high stress causes severe symptoms. That interaction term would be missed by running separate one-way tests.

The Meal Plan as Experimental Design

For ANOVA to work, you need controlled exposure. This is where meal plan design matters.

Latin Squares

A Latin square is a design where each treatment appears exactly once in each row and column. For a 4×4 grid with 4 ingredients (A, B, C, D) across 4 days:

Day 1: A B C D
Day 2: B C D A
Day 3: C D A B
Day 4: D A B C

Every ingredient appears exactly once per day position. This balances order effects — any systematic effect of eating later in the week affects all ingredients equally.

For Confidente, we use a Latin square-inspired design (not strictly square) to ensure:

  • Each target ingredient appears ≥3 times across the test period
  • Exposures are spread across the testing window, not clustered
  • Order effects are balanced

Washout Windows

Between exposures to the same ingredient (or same sensitivity category), a washout period is enforced. This ensures:

  • The previous exposure’s effects have cleared before the next
  • We’re measuring acute responses, not cumulative load
  • Category-level effects (e.g. total histamine burden) are controlled

Washout durations are category-dependent

Different food chemicals have different pharmacokinetics. The app should use per-category washout defaults rather than a single global value:

  • Histamine: Plasma histamine half-life is extremely short (~1-20 minutes depending on individual DAO capacity), but this is misleading. The clinical question isn’t how fast free histamine clears — it’s how long downstream effects (mast cell sensitization, mucosal inflammation, DAO depletion from high load) persist. Standard elimination diet protocols use 2-4 weeks for the initial elimination phase and 3-4 day spacing between single-food reintroductions. For the meal plan, 72-96 hours between same-category histamine exposures is the minimum to avoid stacking effects from DAO depletion dynamics — but the user should understand this is a compromise for practical testing, not a full pharmacokinetic washout.
  • FODMAPs: Fermentation-based symptoms typically resolve within 24-48 hours of substrate clearance. 48-72 hour washout is reasonable.
  • Salicylates: COX enzyme inhibition recovers over days (aspirin irreversibly acetylates COX-1; COX-1 turnover in platelets is ~7-10 days; in gut epithelium ~3-5 days). But dietary salicylate doses produce reversible, competitive inhibition — recovery is faster. 72-96 hour washout is reasonable for dietary levels.
  • Oxalates: Crystal deposition is a slow, cumulative process. Acute GI effects from a high-oxalate meal resolve in 24-48 hours, but tissue-level effects accumulate over weeks. 48-72 hour washout for acute GI testing; oxalate effects are better assessed over longer timeframes.
  • Lectins: Often produce delayed reactions (12-48 hours after ingestion). 96 hour washout minimum to avoid confounding the next exposure with a delayed reaction from the previous one.
  • Glutamates: Dose-dependent neurological effects typically resolve within hours. 48 hour washout is likely sufficient.
  • Capsaicin: TRPV1 desensitization and re-sensitization dynamics mean repeated exposure reduces reactivity (this is how capsaicin tolerance develops). 72 hour washout to allow receptor re-sensitization between test exposures.

These values should be configurable and will need calibration against real user data.

Why Not Pure ANOVA?

In practice, Confidente starts with simpler correlation analysis and moves toward mixed effects modeling, not classical ANOVA. Reasons:

  1. Self-reported symptom data is ordinal — 1-5 scale violates ANOVA’s continuous outcome assumption. Non-parametric equivalents (Kruskal-Wallis) or treating the scale as approximately continuous are both defensible. For 5-point scales specifically, the Likert-scale-as-interval debate is well-studied — treating it as continuous is widely accepted in practice when sample sizes are reasonable (n ≥ 30 observations), and simulation studies show robustness of parametric methods to mild violations of the continuity assumption.

  2. Repeated measures — the same person is measured multiple times. Standard ANOVA assumes independence between observations. Repeated-measures ANOVA or mixed effects models handle this correctly. For a single user, observations are also autocorrelated — today’s symptoms are correlated with yesterday’s (through ongoing inflammation, mediator carryover, baseline drift). This is technically a violation of even repeated-measures ANOVA. Mixed effects models with an AR(1) correlation structure handle this properly, but this is a v0.2+ concern.

  3. Multiple comparisons — testing many ingredients simultaneously inflates false positive rates. With 7 sensitivity categories × 12 symptom types = 84 potential correlations, running at α = 0.05, you’d expect ~4 false positives by chance alone. Benjamini-Hochberg FDR correction is the appropriate fix for the MVP — it controls the false discovery rate (expected proportion of false positives among all positives) rather than the family-wise error rate (Bonferroni), which is overly conservative for correlated tests. Implementation is simple: sort p-values, compare each to (rank/total_tests) × target_FDR. This is a one-function addition to SymptomCorrelator, not a v0.2 item.

  4. Confounders — ANCOVA (Analysis of Covariance) extends ANOVA to include continuous covariates. Sleep quality and stress level become covariates that partial out their effects before testing food factors.

What Confidente Actually Uses

MVP: Weighted Correlation with FDR Correction

The MVP uses weighted Pearson correlation, but the weighting needs to be applied correctly:

# WRONG — conflates magnitude with reliability:
# adjusted_symptom_score = raw_symptom_score * quality_score
 
# CORRECT — weight the observation's contribution to the correlation:
# Each (exposure, symptom) pair contributes to the correlation
# weighted by the day's quality_score.
 
# The quality_score does NOT modify the symptom score.
# It modifies how much that day's observation counts in the
# statistical calculation.
 
ingredient_correlation = WeightedPearson(
  x: ingredient_exposure_binary,  # 0 or 1 per day
  y: symptom_scores,              # raw 1-5 per day
  weights: quality_scores          # 0.0-1.0 per day
)

The distinction matters: multiplying the symptom score by a weight silently distorts the data. A high symptom day with poor sleep would have its score reduced, making it look like the symptoms were mild. Weighted correlation keeps the data intact and instead adjusts how much influence each observation has on the fitted line.

After computing correlations across all ingredient-symptom pairs, apply Benjamini-Hochberg:

def apply_fdr_correction(p_values, target_fdr: 0.10)
  sorted = p_values.each_with_index
    .sort_by { |pval, _| pval }
 
  n = sorted.length
  sorted.each_with_index do |(pval, original_idx), rank|
    threshold = ((rank + 1).to_f / n) * target_fdr
    # Mark as significant if p_value <= threshold
  end
end

A target FDR of 0.10 (10% false discovery rate) is appropriate for an exploratory consumer app — it’s more permissive than clinical research (typically 0.05) but prevents the worst spurious hits. The user sees “likely” and “possible” associations, not clinical diagnoses.

v0.2+: Mixed Effects Model

symptom_score ~ ingredient_exposure + sleep_quality + stress_level + cycle_phase + (1 | user_id)

Where (1 | user_id) is a random intercept per user — accounting for the fact that everyone has a different baseline symptom level. This is the appropriate model for repeated-measures data with multiple confounders.

For a single user (pre-population data), the random intercept isn’t needed, but the model is still valuable as a multivariate regression:

symptom_score ~ ingredient_exposure + sleep_quality + stress_level + cycle_phase

This directly estimates the ingredient effect while controlling for confounders, which is cleaner than the MVP’s pre-adjustment approach.

Ruby gems: statsample has ANOVA. For mixed effects, options include: (1) the rover gem (R bindings, wraps lme4), (2) calling a Python microservice with statsmodels, or (3) ordinal logistic regression in pure Ruby via statsample for as long as it’s sufficient. The Python sidecar adds deploy complexity; prefer keeping it in-process unless the statistical requirements genuinely outgrow Ruby.

Sample Size Considerations

ANOVA needs sufficient observations per condition to detect effects. For a single user:

  • Each ingredient needs ≥5 exposures (not 3) to have reasonable statistical power for detecting medium-to-large effect sizes. With 3 observations, you can detect only very large effects (Cohen’s d > 1.5) with any confidence. 5 observations per condition at α = 0.10 gives ~60% power for d = 1.0 (a “large” effect), which is still modest but pragmatically useful.
  • 14 days minimum of logging before results are meaningful — at this point you have enough data to detect strong signals but not subtle ones
  • 30 days produces substantially more reliable results and enough observations for the FDR correction to work properly
  • 60+ days is where the mixed effects model (v0.2) starts to shine — enough repeated observations to estimate confounder effects reliably

This is why the app explicitly notes low confidence when fewer than 14 days of data exist, and why meal plan design should target ≥5 exposures per ingredient when possible.

Honesty about statistical power

At the data volumes a consumer app realistically generates (30-60 days per user), this system can detect strong food-symptom associations — the kind where an ingredient consistently and noticeably worsens symptoms. It cannot detect subtle effects, interaction effects, or dose-response relationships with any reliability at these sample sizes. This is important to communicate to users: “likely association” means the signal is strong enough to surface above noise at this data volume. Absence of a detected association doesn’t mean the food is safe — it means the effect, if present, isn’t large enough to detect with the data available. More data improves sensitivity.

The “Cheats as Data” Insight

Classical experimental design treats off-protocol observations as contamination. In Confidente’s framing they’re unplanned but valid observations — naturalistic variation in the independent variable (food exposure). They add ecological validity and increase n. The only requirement is accurate logging.

This is defensible provided the off-plan meals are treated correctly in the statistical model. Specifically:

  • Off-plan meals may cluster on certain days (weekends, social events) which are also higher stress/alcohol days. The confounder model handles this IF the confounders are logged.
  • Off-plan meals may introduce multiple new exposures simultaneously (a restaurant meal with unknown ingredients). These observations contribute to category-level analysis but are weaker for individual ingredient identification. The model should down-weight meals with high ingredient uncertainty.

Temporal Lag: The Missing Variable

Different sensitivity categories produce symptoms on different timescales. The current MVP correlates same-day exposure with same-day symptoms, but this misses delayed reactions:

CategoryTypical onsetPeakResolution
Histamine (dietary)15-60 min1-2 hours4-12 hours
Histamine (liberator)15-60 min1-2 hours4-12 hours
FODMAP2-6 hours6-24 hours24-48 hours
Salicylate2-6 hours6-24 hours24-72 hours
Oxalatevariablevariablevariable
Lectin6-48 hours12-48 hours24-96 hours
Glutamate15-60 min1-3 hours4-12 hours
Capsaicin15-60 min30-90 min2-6 hours

For the MVP, correlating each day’s food with that day’s AND next day’s symptoms (a 1-day lag window) captures most reactions. For v0.2, the correlation should run at multiple lag values per category and report the lag with the strongest signal — this is a standard time-series cross-correlation approach.

The late-phase allergic response (2-6 hours post-exposure, mediated by leukotrienes, prostaglandins, and cytokines rather than histamine) is particularly important to capture because it represents a distinct mediator profile from the immediate response. A food might not cause immediate symptoms but produces a next-day flare via the late-phase pathway.

Further Reading