Variance Decomposition
statistical primitive
slug: topic-map-statistical-primitive-variance-decomposition
Vocabulary:
- variance: Statistical measure of dispersion - how spread out values are
- between_group_variance: Variation across different segment means (SSB - Sum of Squares Between)
- within_group_variance: Variation within each segment around its own mean (SSW - Sum of Squares Within)
- total_variance: Overall variation in the entire dataset (SST - Sum of Squares Total)
- explained_variance: Portion of total variance attributable to grouping factor
- unexplained_variance: Residual variation not explained by the grouping
- r_squared: Proportion of variance explained (SSB/SST) - ranges 0 to 1
- f_statistic: Ratio of between to within variance, tests significance
- degrees_of_freedom: Number of independent values that can vary
- mean_square: Variance divided by degrees of freedom (MS = SS/df)
- eta_squared: Effect size measure (same as R² in one-way ANOVA)
- partitioning: Breaking total variance into additive components
Concepts:
- variance_as_information: Variance tells us how much "information" a dimension contains
- dimension_importance: Higher between-group variance = dimension matters more
- signal_to_noise: Between variance is signal, within variance is noise
- explained_vs_unexplained: R² tells us how much of the story this dimension explains
- additive_decomposition: SST = SSB + SSW (must sum exactly)
- hierarchical_variance: Can decompose variance at each cube level
- multi_way_decomposition: With multiple dimensions, can partition variance multiple ways
- homogeneity_assumption: Within-group variances should be similar for valid interpretation
Concepts_advanced:
- interaction_variance: When using multiple dimensions, variance from interaction effects
- nested_variance: Variance within categories that are nested in other categories
- random_vs_fixed_effects: Whether dimension values represent all possible or just a sample
- variance_components: In hierarchical data, how much variance at each level
- intraclass_correlation: Proportion of variance between groups vs total
Procedures:
- calculate_grand_mean: Overall mean across all observations
- calculate_group_means: Mean within each segment
- calculate_SST: Σ(observation - grand_mean)² across all data points
- calculate_SSB: Σ[n_group × (group_mean - grand_mean)²] across groups
- calculate_SSW: SST - SSB (or calculate directly from within-group deviations)
- calculate_r_squared: SSB / SST
- calculate_degrees_of_freedom: df_between = k-1, df_within = N-k
- calculate_mean_squares: MS_between = SSB/df_b, MS_within = SSW/df_w
- calculate_f_statistic: MS_between / MS_within
- rank_dimensions: Compare R² across different dimensions to see which explains most
Procedures_detailed:
- grand_mean_from_cube: Use level=0 row or weight level=1 means by freq
- group_means_from_cube: Extract from level=1 rows for single dimension
- reconstruct_SST: If raw data unavailable, use variance × (N-1)
- weighted_variance: Account for different group sizes (freq column)
- compare_across_levels: Calculate R² separately for each cube level
Topics:
- dimension_prioritization
- segmentation_validation
- feature_importance_analysis
- market_structure_analysis
- explained_variance_reporting
- dimension_interaction_effects
- data_quality_assessment
- natural_grouping_detection
- stratification_optimization
- segment_homogeneity_testing
Categories:
- statistical_decomposition
- dimension_analysis
- variance_partitioning
- effect_size_measurement
- segmentation_evaluation
Themes:
- dimension_matters: Quantifying which dimensions drive variation
- signal_extraction: Separating meaningful patterns from noise
- parsimony: Using fewest dimensions that explain most variance
- hierarchical_understanding: Variance at different levels of aggregation
Trends:
- automated_dimension_selection: Algorithm picks dimensions with highest R²
- real_time_variance_monitoring: Track how R² changes over time
- predictive_variance_allocation: Forecast which dimensions will matter most
- interactive_variance_explorer: UI to toggle dimensions and see variance impact
- variance_based_clustering: Group dimensions by similar variance patterns
Use_cases:
- retail: "Category explains 65% of revenue variance, store location only 15% - focus on category strategy"
- marketing: "Channel explains 45% of conversion variance, creative only 8% - channel selection is critical"
- manufacturing: "Production line explains 72% of defect variance - line-level intervention needed"
- healthcare: "Provider explains 55% of cost variance, diagnosis only 25% - provider performance key driver"
- saas: "Pricing tier explains 80% of usage variance, industry only 12% - tier design is paramount"
- finance: "Customer segment explains 40% of default variance - segmentation has predictive power"
- logistics: "Origin location explains 62% of shipping cost variance - consolidation opportunity"
- education: "School explains 48% of test score variance, teacher 22%, student background 30%"