Tutorial: Experiment Design and Optimization

This tutorial covers the experiment design capabilities of the aboba library. You'll learn how to optimize experiment parameters, determine optimal sample sizes, and design time-based experiments for maximum statistical power.

Understanding Experiment Design

Experiment design in aboba helps you answer critical questions before running your AB tests:

How many samples do I need? - Determine the minimum sample size for desired statistical power
What parameters should I use? - Find optimal configurations for your experiment
How long should my experiment run? - For time-based experiments, determine the ideal duration
What's my expected power? - Estimate Type I and Type II error rates

The library provides three designer classes:

BaseExperimentDesigner: Maximum flexibility with custom experiment factories
BasicExperimentDesigner: Simplified interface for common use cases
TimeBasedDesigner: Specialized for time-series experiments

Example 1: Custom Experiment Design with BaseExperimentDesigner

The BaseExperimentDesigner gives you complete control over the experiment design process. You define a factory class that generates experiments with different parameters.

Understanding the Workflow

The designer works by:

Creating experiments with different parameter combinations
Running AA tests to measure Type I error (false positive rate)
Running AB tests with synthetic effects to measure Type II error (false negative rate, or power)
Finding the optimal parameters that balance these error rates

Step 1: Create an Experiment Factory

First, define a class that generates experiments based on parameters:

import numpy as np
import pandas as pd
import scipy.stats as sps
import typing as tp
from aboba.design.base_designer import (
    ExperimentDesignMetrics, 
    IntervalEstimate, 
    BaseExperimentDesigner
)
import aboba

class ExperimentSetup:
    """Factory class that creates experiments with varying sample sizes."""

    def generate_data(self, n_samples: int) -> pd.DataFrame:
        """Generate synthetic data from normal distribution."""
        data_a = sps.norm.rvs(size=n_samples, loc=0, scale=1)
        data_b = sps.norm.rvs(size=n_samples, loc=0, scale=1)

        # Create dataset with two columns: value and group
        data = pd.DataFrame({
            'value': np.concatenate([data_a, data_b]),
            'b_group': np.concatenate([
                np.repeat(0, n_samples),
                np.repeat(1, n_samples),
            ]),
        })

        return data

    def generate_test(self) -> aboba.base.base_test.BaseTest:
        """Create the statistical test to use."""
        return aboba.tests.AbsoluteIndependentTTest(
            value_column='value',
        )

    def generate_pipeline(self, n_samples: int) -> aboba.pipeline.Pipeline:
        """Create sampling pipeline based on sample size."""
        # Sample 10% of data, minimum 2 samples per group
        group_size = max(n_samples // 10, 2)
        return aboba.pipeline.Pipeline([
            ("GroupSplitter", aboba.splitters.GroupSplitter(
                size=group_size, 
                column='b_group'
            )),
        ])

    def __call__(self, parameters: tp.Dict[str, tp.Any]) -> ExperimentDesignMetrics:
        """
        Run experiment with given parameters and return metrics.

        This method is called by the designer for each parameter combination.
        """
        assert sorted(parameters.keys()) == ["n_samples"]

        n_samples = parameters["n_samples"]

        # Generate experiment components
        data = self.generate_data(n_samples)
        pipeline = self.generate_pipeline(n_samples)
        test = self.generate_test()

        # Create experiment hub
        experiment = aboba.experiment.AbobaExperiment()

        # Run AA test to measure Type I error (false positive rate)
        aa_group = experiment.group(
            "AA",
            test=test,
            data=data,
            data_pipeline=pipeline,
            n_iter=1000,
            joblib_kwargs={"n_jobs": -1, "backend": "threading"}
        ).run()

        alpha_level = 0.05

        # Calculate Type I error: proportion of false positives
        type_I_error = (aa_group.get_data()["pvalue"] < alpha_level).mean()

        # Run AB test with synthetic effect to measure power
        ab_group = experiment.group(
            "AB",
            test=test,
            data=data,
            data_pipeline=pipeline,
            synthetic_effect=aboba.effect_modifiers.GroupModifier(
                effects={1: 0.3},  # Add effect of 0.3 to group 1
                value_column='value',
                group_column='b_group',
            ),
            n_iter=1000,
        ).run()

        # Calculate power: proportion of correctly detected effects
        type_II_error = (ab_group.get_data()["pvalue"] < alpha_level).mean()

        return ExperimentDesignMetrics(
            type_I_error=IntervalEstimate(
                parameter_estimate=type_I_error
            ),
            type_II_error=IntervalEstimate(
                parameter_estimate=type_II_error
            ),
        )

Step 2: Create and Run the Designer

Now use the designer to find optimal sample sizes:

# Create designer with sample size constraints
designer = BaseExperimentDesigner(
    experiment_design_factory=ExperimentSetup(),
    constraints={
        # Test 20 different sample sizes from 10 to 100,000
        "n_samples": np.array(np.logspace(1, 5, 20), dtype=np.int32)
    }
)

# Find optimal parameters using brute force search
designer.optimize(method=BaseExperimentDesigner.OptimizerMethod.BRUTE_FORCE)

# Visualize the results
designer.visualize()

The visualization will show:

Type I error rate across different sample sizes (should stay near 0.05)
Statistical power (1 - Type II error) across sample sizes
The optimal sample size that achieves desired power while controlling Type I error

Example 2: Simplified Design with BasicExperimentDesigner

For most use cases, BasicExperimentDesigner provides a simpler interface without requiring a custom factory class.

Step 1: Prepare Your Data and Test

import numpy as np
import pandas as pd
import scipy.stats as sps
from aboba.design.basic_designer import BasicExperimentDesigner
from aboba.tests import AbsoluteIndependentTTest
from aboba.splitters import GroupSplitter
from aboba.pipeline import Pipeline
from aboba.effect_modifiers import GroupModifier

# Generate sample data
n_samples = 1000
data_a = sps.norm.rvs(size=n_samples, loc=0, scale=1)
data_b = sps.norm.rvs(size=n_samples, loc=0, scale=1)

data = pd.DataFrame({
    'value': np.concatenate([data_a, data_b]),
    'b_group': np.concatenate([
        np.repeat(0, n_samples),
        np.repeat(1, n_samples),
    ]),
})

Step 2: Define Test and Effect

# Define the statistical test
test = AbsoluteIndependentTTest(value_column='value')

# Define the synthetic effect to test for
synthetic_effect = GroupModifier(
    effects={1: 0.3},  # Add 0.3 to group 1
    value_column='value',
    group_column='b_group',
)

Step 3: Create Designer with Parameter Constraints

The key feature is the get_pipeline function that creates pipelines based on parameters:

# Create designer with parameter constraints
designer = BasicExperimentDesigner(
    data=data,
    test=test,
    get_pipeline=lambda params: Pipeline([
        ('GroupSplitter', GroupSplitter(
            size=params['group_size'],  # Use parameter value
            column='b_group'
        )),
    ]),
    synthetic_effect=synthetic_effect,
    n_iter=1000,  # Number of iterations for each parameter combination
    constraints={
        "group_size": [50, 100, 200, 500],  # Test these group sizes
    }
)

Step 4: Optimize and Analyze

# Find optimal parameters
designer.optimize()

# Get the best parameters
best_params = designer.get_best_params()
print(f"Best group size: {best_params.parameters['group_size']}")

# Visualize results
designer.visualize()

The designer will:

Test each group size (50, 100, 200, 500)
Run 1000 iterations for each to estimate error rates
Find the group size that provides the best balance of power and Type I error control
Display visualizations showing performance across all tested parameters

Example 3: Time-Based Experiment Design

The TimeBasedDesigner is specialized for experiments with time-series data, where you need to optimize timing parameters like experiment duration and effect start date.

Understanding Time-Based Experiments

Time-based experiments have unique considerations: - Inclusion Date: When users enter the experiment - Effect Start Date: When the treatment effect begins - Experiment Duration: How long to run the experiment - Time-varying Effects: Effects that change over time

Step 1: Generate Time-Series Data

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from aboba.design.time_based_designer import TimeBasedDesigner
from aboba.tests import AbsoluteIndependentTTest
from aboba.splitters import UserSplitter
from aboba.pipeline import Pipeline
from aboba.effect_modifiers import TimeBasedEffectModifier
from aboba.utils.time_based_data_generator import generate_time_based_data

# Generate sample time-series data
# This creates user-level data with dates and payments
data = generate_time_based_data(
    n_users=200,
    start_date='2024-01-01',
    end_date='2024-12-31'
)

The generated data includes:

user_id: Unique user identifier
date: Date of each observation
payment: Payment amount (target metric)
is_in_b_group: Group assignment
inclusion_date: When user entered the experiment

Step 2: Define Test and Pipeline Factory

# Define the statistical test
test = AbsoluteIndependentTTest(value_column='payment')

# Create pipeline factory that uses parameters
get_pipeline = lambda params: Pipeline([
    ('UserSplitter', UserSplitter(
        group_column='is_in_b_group',
        user_column='user_id',
        size=params.get('user_sample_size', 50)  # Use parameter or default
    ))
])

Step 3: Define Time-Based Effect Factory

The effect factory creates effects based on timing parameters:

get_effect = lambda params: TimeBasedEffectModifier(
    effect=5.0 * params.get('effect_multiplier', 1.0),  # Scale effect
    effect_start_date=pd.to_datetime(params['effect_start_date']),
    value_column='payment',
    date_column='date',
    group_column='is_in_b_group',
    inclusion_date_column='inclusion_date'
)

This creates effects that:

Only apply after the effect_start_date
Only affect users who were included before the effect started
Scale by the effect_multiplier parameter

Step 4: Create Designer with Time Constraints

# Create designer with parameter constraints
designer = TimeBasedDesigner(
    data=data,
    test=test,
    get_pipeline=get_pipeline,
    get_effect=get_effect,
    date_column='date',
    experiment_duration='experiment_duration',  # parameter name
    effect_start_date='effect_start_date',  # parameter name
    n_iter=100,
    constraints={
        'effect_start_date': [
            '2024-03-01',
            '2024-04-01',
            '2024-05-01'
        ],
        'effect_multiplier': [0.8, 1.0, 1.2, 1.5],
        'experiment_duration': [
            pd.Timedelta(weeks=4),
            pd.Timedelta(weeks=6),
            pd.Timedelta(weeks=8)
        ]
    }
)

Step 5: Optimize and Visualize

# Find optimal parameters
designer.optimize()

# Get the best parameters
best_params = designer.get_best_params()
print(f"Best effect start date: {best_params.parameters['effect_start_date']}")
print(f"Best effect multiplier: {best_params.parameters['effect_multiplier']}")
print(f"Best experiment duration: {best_params.parameters['experiment_duration']}")

# Visualize all results
designer.visualize()

Advanced Visualization: Fixed Parameters

You can visualize results while fixing certain parameters:

# Visualize with fixed duration and start date
designer.visualize(
    fixed_parameters={
        'experiment_duration': pd.Timedelta(weeks=4),
        'effect_start_date': '2024-03-01',
    }
)

This shows how the remaining parameters (like effect_multiplier) affect performance when other parameters are held constant.

Key Concepts in Experiment Design

Type I Error (False Positive Rate)

Definition: Probability of detecting an effect when none exists
Target: Should be close to your significance level (typically 0.05)
Measured by: Running AA tests (both groups from same distribution)

Type II Error and Statistical Power

Type II Error: Probability of missing a real effect
Statistical Power: 1 - Type II Error (probability of detecting a real effect)
Target: Typically aim for 80% power (Type II error = 0.20)
Measured by: Running AB tests with known synthetic effects

Optimization Strategy

The designers find parameters that:

Maintain Type I error near the significance level (no inflation)
Maximize statistical power (minimize Type II error)
Balance practical constraints (sample size, duration, cost)