Tutorial: Experiment Design and Optimization
This tutorial covers the experiment design capabilities of the aboba library. You'll learn how to optimize experiment parameters, determine optimal sample sizes, and design time-based experiments for maximum statistical power.
Understanding Experiment Design
Experiment design in aboba helps you answer critical questions before running your AB tests:
- How many samples do I need? - Determine the minimum sample size for desired statistical power
- What parameters should I use? - Find optimal configurations for your experiment
- How long should my experiment run? - For time-based experiments, determine the ideal duration
- What's my expected power? - Estimate Type I and Type II error rates
The library provides three designer classes:
- BaseExperimentDesigner: Maximum flexibility with custom experiment factories
- BasicExperimentDesigner: Simplified interface for common use cases
- TimeBasedDesigner: Specialized for time-series experiments
Example 1: Custom Experiment Design with BaseExperimentDesigner
The BaseExperimentDesigner gives you complete control over the experiment design process. You define a factory class that generates experiments with different parameters.
Understanding the Workflow
The designer works by:
-
Creating experiments with different parameter combinations
-
Running AA tests to measure Type I error (false positive rate)
-
Running AB tests with synthetic effects to measure Type II error (false negative rate, or power)
-
Finding the optimal parameters that balance these error rates
Step 1: Create an Experiment Factory
First, define a class that generates experiments based on parameters:
import numpy as np
import pandas as pd
import scipy.stats as sps
import typing as tp
from aboba.design.base_designer import (
ExperimentDesignMetrics,
IntervalEstimate,
BaseExperimentDesigner
)
import aboba
class ExperimentSetup:
"""Factory class that creates experiments with varying sample sizes."""
def generate_data(self, n_samples: int) -> pd.DataFrame:
"""Generate synthetic data from normal distribution."""
data_a = sps.norm.rvs(size=n_samples, loc=0, scale=1)
data_b = sps.norm.rvs(size=n_samples, loc=0, scale=1)
# Create dataset with two columns: value and group
data = pd.DataFrame({
'value': np.concatenate([data_a, data_b]),
'b_group': np.concatenate([
np.repeat(0, n_samples),
np.repeat(1, n_samples),
]),
})
return data
def generate_test(self) -> aboba.base.base_test.BaseTest:
"""Create the statistical test to use."""
return aboba.tests.AbsoluteIndependentTTest(
value_column='value',
)
def generate_pipeline(self, n_samples: int) -> aboba.pipeline.Pipeline:
"""Create sampling pipeline based on sample size."""
# Sample 10% of data, minimum 2 samples per group
group_size = max(n_samples // 10, 2)
return aboba.pipeline.Pipeline([
("GroupSplitter", aboba.splitters.GroupSplitter(
size=group_size,
column='b_group'
)),
])
def __call__(self, parameters: tp.Dict[str, tp.Any]) -> ExperimentDesignMetrics:
"""
Run experiment with given parameters and return metrics.
This method is called by the designer for each parameter combination.
"""
assert sorted(parameters.keys()) == ["n_samples"]
n_samples = parameters["n_samples"]
# Generate experiment components
data = self.generate_data(n_samples)
pipeline = self.generate_pipeline(n_samples)
test = self.generate_test()
# Create experiment hub
experiment = aboba.experiment.AbobaExperiment()
# Run AA test to measure Type I error (false positive rate)
aa_group = experiment.group(
"AA",
test=test,
data=data,
data_pipeline=pipeline,
n_iter=1000,
joblib_kwargs={"n_jobs": -1, "backend": "threading"}
).run()
alpha_level = 0.05
# Calculate Type I error: proportion of false positives
type_I_error = (aa_group.get_data()["pvalue"] < alpha_level).mean()
# Run AB test with synthetic effect to measure power
ab_group = experiment.group(
"AB",
test=test,
data=data,
data_pipeline=pipeline,
synthetic_effect=aboba.effect_modifiers.GroupModifier(
effects={1: 0.3}, # Add effect of 0.3 to group 1
value_column='value',
group_column='b_group',
),
n_iter=1000,
).run()
# Calculate power: proportion of correctly detected effects
type_II_error = (ab_group.get_data()["pvalue"] < alpha_level).mean()
return ExperimentDesignMetrics(
type_I_error=IntervalEstimate(
parameter_estimate=type_I_error
),
type_II_error=IntervalEstimate(
parameter_estimate=type_II_error
),
)
Step 2: Create and Run the Designer
Now use the designer to find optimal sample sizes:
# Create designer with sample size constraints
designer = BaseExperimentDesigner(
experiment_design_factory=ExperimentSetup(),
constraints={
# Test 20 different sample sizes from 10 to 100,000
"n_samples": np.array(np.logspace(1, 5, 20), dtype=np.int32)
}
)
# Find optimal parameters using brute force search
designer.optimize(method=BaseExperimentDesigner.OptimizerMethod.BRUTE_FORCE)
# Visualize the results
designer.visualize()
The visualization will show:
-
Type I error rate across different sample sizes (should stay near 0.05)
-
Statistical power (1 - Type II error) across sample sizes
-
The optimal sample size that achieves desired power while controlling Type I error
Example 2: Simplified Design with BasicExperimentDesigner
For most use cases, BasicExperimentDesigner provides a simpler interface without requiring a custom factory class.
Step 1: Prepare Your Data and Test
import numpy as np
import pandas as pd
import scipy.stats as sps
from aboba.design.basic_designer import BasicExperimentDesigner
from aboba.tests import AbsoluteIndependentTTest
from aboba.splitters import GroupSplitter
from aboba.pipeline import Pipeline
from aboba.effect_modifiers import GroupModifier
# Generate sample data
n_samples = 1000
data_a = sps.norm.rvs(size=n_samples, loc=0, scale=1)
data_b = sps.norm.rvs(size=n_samples, loc=0, scale=1)
data = pd.DataFrame({
'value': np.concatenate([data_a, data_b]),
'b_group': np.concatenate([
np.repeat(0, n_samples),
np.repeat(1, n_samples),
]),
})
Step 2: Define Test and Effect
# Define the statistical test
test = AbsoluteIndependentTTest(value_column='value')
# Define the synthetic effect to test for
synthetic_effect = GroupModifier(
effects={1: 0.3}, # Add 0.3 to group 1
value_column='value',
group_column='b_group',
)
Step 3: Create Designer with Parameter Constraints
The key feature is the get_pipeline function that creates pipelines based on parameters:
# Create designer with parameter constraints
designer = BasicExperimentDesigner(
data=data,
test=test,
get_pipeline=lambda params: Pipeline([
('GroupSplitter', GroupSplitter(
size=params['group_size'], # Use parameter value
column='b_group'
)),
]),
synthetic_effect=synthetic_effect,
n_iter=1000, # Number of iterations for each parameter combination
constraints={
"group_size": [50, 100, 200, 500], # Test these group sizes
}
)
Step 4: Optimize and Analyze
# Find optimal parameters
designer.optimize()
# Get the best parameters
best_params = designer.get_best_params()
print(f"Best group size: {best_params.parameters['group_size']}")
# Visualize results
designer.visualize()
The designer will:
-
Test each group size (50, 100, 200, 500)
-
Run 1000 iterations for each to estimate error rates
-
Find the group size that provides the best balance of power and Type I error control
-
Display visualizations showing performance across all tested parameters
Example 3: Time-Based Experiment Design
The TimeBasedDesigner is specialized for experiments with time-series data, where you need to optimize timing parameters like experiment duration and effect start date.
Understanding Time-Based Experiments
Time-based experiments have unique considerations: - Inclusion Date: When users enter the experiment - Effect Start Date: When the treatment effect begins - Experiment Duration: How long to run the experiment - Time-varying Effects: Effects that change over time
Step 1: Generate Time-Series Data
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from aboba.design.time_based_designer import TimeBasedDesigner
from aboba.tests import AbsoluteIndependentTTest
from aboba.splitters import UserSplitter
from aboba.pipeline import Pipeline
from aboba.effect_modifiers import TimeBasedEffectModifier
from aboba.utils.time_based_data_generator import generate_time_based_data
# Generate sample time-series data
# This creates user-level data with dates and payments
data = generate_time_based_data(
n_users=200,
start_date='2024-01-01',
end_date='2024-12-31'
)
The generated data includes:
-
user_id: Unique user identifier -
date: Date of each observation -
payment: Payment amount (target metric) -
is_in_b_group: Group assignment -
inclusion_date: When user entered the experiment
Step 2: Define Test and Pipeline Factory
# Define the statistical test
test = AbsoluteIndependentTTest(value_column='payment')
# Create pipeline factory that uses parameters
get_pipeline = lambda params: Pipeline([
('UserSplitter', UserSplitter(
group_column='is_in_b_group',
user_column='user_id',
size=params.get('user_sample_size', 50) # Use parameter or default
))
])
Step 3: Define Time-Based Effect Factory
The effect factory creates effects based on timing parameters:
get_effect = lambda params: TimeBasedEffectModifier(
effect=5.0 * params.get('effect_multiplier', 1.0), # Scale effect
effect_start_date=pd.to_datetime(params['effect_start_date']),
value_column='payment',
date_column='date',
group_column='is_in_b_group',
inclusion_date_column='inclusion_date'
)
This creates effects that:
-
Only apply after the
effect_start_date -
Only affect users who were included before the effect started
-
Scale by the
effect_multiplierparameter
Step 4: Create Designer with Time Constraints
# Create designer with parameter constraints
designer = TimeBasedDesigner(
data=data,
test=test,
get_pipeline=get_pipeline,
get_effect=get_effect,
date_column='date',
experiment_duration='experiment_duration', # parameter name
effect_start_date='effect_start_date', # parameter name
n_iter=100,
constraints={
'effect_start_date': [
'2024-03-01',
'2024-04-01',
'2024-05-01'
],
'effect_multiplier': [0.8, 1.0, 1.2, 1.5],
'experiment_duration': [
pd.Timedelta(weeks=4),
pd.Timedelta(weeks=6),
pd.Timedelta(weeks=8)
]
}
)
Step 5: Optimize and Visualize
# Find optimal parameters
designer.optimize()
# Get the best parameters
best_params = designer.get_best_params()
print(f"Best effect start date: {best_params.parameters['effect_start_date']}")
print(f"Best effect multiplier: {best_params.parameters['effect_multiplier']}")
print(f"Best experiment duration: {best_params.parameters['experiment_duration']}")
# Visualize all results
designer.visualize()
Advanced Visualization: Fixed Parameters
You can visualize results while fixing certain parameters:
# Visualize with fixed duration and start date
designer.visualize(
fixed_parameters={
'experiment_duration': pd.Timedelta(weeks=4),
'effect_start_date': '2024-03-01',
}
)
This shows how the remaining parameters (like effect_multiplier) affect performance when other parameters are held constant.
Key Concepts in Experiment Design
Type I Error (False Positive Rate)
- Definition: Probability of detecting an effect when none exists
- Target: Should be close to your significance level (typically 0.05)
- Measured by: Running AA tests (both groups from same distribution)
Type II Error and Statistical Power
- Type II Error: Probability of missing a real effect
- Statistical Power: 1 - Type II Error (probability of detecting a real effect)
- Target: Typically aim for 80% power (Type II error = 0.20)
- Measured by: Running AB tests with known synthetic effects
Optimization Strategy
The designers find parameters that:
-
Maintain Type I error near the significance level (no inflation)
-
Maximize statistical power (minimize Type II error)
-
Balance practical constraints (sample size, duration, cost)