Skip to content

Tutorial: Experiment Design and Optimization

This tutorial covers the experiment design capabilities of the aboba library. You'll learn how to optimize experiment parameters, determine optimal sample sizes, and design time-based experiments for maximum statistical power.

Understanding Experiment Design

Experiment design in aboba helps you answer critical questions before running your AB tests:

  • How many samples do I need? - Determine the minimum sample size for desired statistical power
  • What parameters should I use? - Find optimal configurations for your experiment
  • How long should my experiment run? - For time-based experiments, determine the ideal duration
  • What's my expected power? - Estimate Type I and Type II error rates

The library provides three designer classes:

  • BaseExperimentDesigner: Maximum flexibility with custom experiment factories
  • BasicExperimentDesigner: Simplified interface for common use cases
  • TimeBasedDesigner: Specialized for time-series experiments

Example 1: Custom Experiment Design with BaseExperimentDesigner

The BaseExperimentDesigner gives you complete control over the experiment design process. You define a factory class that generates experiments with different parameters.

Understanding the Workflow

The designer works by:

  1. Creating experiments with different parameter combinations

  2. Running AA tests to measure Type I error (false positive rate)

  3. Running AB tests with synthetic effects to measure Type II error (false negative rate, or power)

  4. Finding the optimal parameters that balance these error rates

Step 1: Create an Experiment Factory

First, define a class that generates experiments based on parameters:

import numpy as np
import pandas as pd
import scipy.stats as sps
import typing as tp
from aboba.design.base_designer import (
    ExperimentDesignMetrics, 
    IntervalEstimate, 
    BaseExperimentDesigner
)
import aboba

class ExperimentSetup:
    """Factory class that creates experiments with varying sample sizes."""

    def generate_data(self, n_samples: int) -> pd.DataFrame:
        """Generate synthetic data from normal distribution."""
        data_a = sps.norm.rvs(size=n_samples, loc=0, scale=1)
        data_b = sps.norm.rvs(size=n_samples, loc=0, scale=1)

        # Create dataset with two columns: value and group
        data = pd.DataFrame({
            'value': np.concatenate([data_a, data_b]),
            'b_group': np.concatenate([
                np.repeat(0, n_samples),
                np.repeat(1, n_samples),
            ]),
        })

        return data

    def generate_test(self) -> aboba.base.base_test.BaseTest:
        """Create the statistical test to use."""
        return aboba.tests.AbsoluteIndependentTTest(
            value_column='value',
        )

    def generate_pipeline(self, n_samples: int) -> aboba.pipeline.Pipeline:
        """Create sampling pipeline based on sample size."""
        # Sample 10% of data, minimum 2 samples per group
        group_size = max(n_samples // 10, 2)
        return aboba.pipeline.Pipeline([
            ("GroupSplitter", aboba.splitters.GroupSplitter(
                size=group_size, 
                column='b_group'
            )),
        ])

    def __call__(self, parameters: tp.Dict[str, tp.Any]) -> ExperimentDesignMetrics:
        """
        Run experiment with given parameters and return metrics.

        This method is called by the designer for each parameter combination.
        """
        assert sorted(parameters.keys()) == ["n_samples"]

        n_samples = parameters["n_samples"]

        # Generate experiment components
        data = self.generate_data(n_samples)
        pipeline = self.generate_pipeline(n_samples)
        test = self.generate_test()

        # Create experiment hub
        experiment = aboba.experiment.AbobaExperiment()

        # Run AA test to measure Type I error (false positive rate)
        aa_group = experiment.group(
            "AA",
            test=test,
            data=data,
            data_pipeline=pipeline,
            n_iter=1000,
            joblib_kwargs={"n_jobs": -1, "backend": "threading"}
        ).run()

        alpha_level = 0.05

        # Calculate Type I error: proportion of false positives
        type_I_error = (aa_group.get_data()["pvalue"] < alpha_level).mean()

        # Run AB test with synthetic effect to measure power
        ab_group = experiment.group(
            "AB",
            test=test,
            data=data,
            data_pipeline=pipeline,
            synthetic_effect=aboba.effect_modifiers.GroupModifier(
                effects={1: 0.3},  # Add effect of 0.3 to group 1
                value_column='value',
                group_column='b_group',
            ),
            n_iter=1000,
        ).run()

        # Calculate power: proportion of correctly detected effects
        type_II_error = (ab_group.get_data()["pvalue"] < alpha_level).mean()

        return ExperimentDesignMetrics(
            type_I_error=IntervalEstimate(
                parameter_estimate=type_I_error
            ),
            type_II_error=IntervalEstimate(
                parameter_estimate=type_II_error
            ),
        )

Step 2: Create and Run the Designer

Now use the designer to find optimal sample sizes:

# Create designer with sample size constraints
designer = BaseExperimentDesigner(
    experiment_design_factory=ExperimentSetup(),
    constraints={
        # Test 20 different sample sizes from 10 to 100,000
        "n_samples": np.array(np.logspace(1, 5, 20), dtype=np.int32)
    }
)

# Find optimal parameters using brute force search
designer.optimize(method=BaseExperimentDesigner.OptimizerMethod.BRUTE_FORCE)

# Visualize the results
designer.visualize()

The visualization will show:

  • Type I error rate across different sample sizes (should stay near 0.05)

  • Statistical power (1 - Type II error) across sample sizes

  • The optimal sample size that achieves desired power while controlling Type I error

Example 2: Simplified Design with BasicExperimentDesigner

For most use cases, BasicExperimentDesigner provides a simpler interface without requiring a custom factory class.

Step 1: Prepare Your Data and Test

import numpy as np
import pandas as pd
import scipy.stats as sps
from aboba.design.basic_designer import BasicExperimentDesigner
from aboba.tests import AbsoluteIndependentTTest
from aboba.splitters import GroupSplitter
from aboba.pipeline import Pipeline
from aboba.effect_modifiers import GroupModifier

# Generate sample data
n_samples = 1000
data_a = sps.norm.rvs(size=n_samples, loc=0, scale=1)
data_b = sps.norm.rvs(size=n_samples, loc=0, scale=1)

data = pd.DataFrame({
    'value': np.concatenate([data_a, data_b]),
    'b_group': np.concatenate([
        np.repeat(0, n_samples),
        np.repeat(1, n_samples),
    ]),
})

Step 2: Define Test and Effect

# Define the statistical test
test = AbsoluteIndependentTTest(value_column='value')

# Define the synthetic effect to test for
synthetic_effect = GroupModifier(
    effects={1: 0.3},  # Add 0.3 to group 1
    value_column='value',
    group_column='b_group',
)

Step 3: Create Designer with Parameter Constraints

The key feature is the get_pipeline function that creates pipelines based on parameters:

# Create designer with parameter constraints
designer = BasicExperimentDesigner(
    data=data,
    test=test,
    get_pipeline=lambda params: Pipeline([
        ('GroupSplitter', GroupSplitter(
            size=params['group_size'],  # Use parameter value
            column='b_group'
        )),
    ]),
    synthetic_effect=synthetic_effect,
    n_iter=1000,  # Number of iterations for each parameter combination
    constraints={
        "group_size": [50, 100, 200, 500],  # Test these group sizes
    }
)

Step 4: Optimize and Analyze

# Find optimal parameters
designer.optimize()

# Get the best parameters
best_params = designer.get_best_params()
print(f"Best group size: {best_params.parameters['group_size']}")

# Visualize results
designer.visualize()

The designer will:

  • Test each group size (50, 100, 200, 500)

  • Run 1000 iterations for each to estimate error rates

  • Find the group size that provides the best balance of power and Type I error control

  • Display visualizations showing performance across all tested parameters

Example 3: Time-Based Experiment Design

The TimeBasedDesigner is specialized for experiments with time-series data, where you need to optimize timing parameters like experiment duration and effect start date.

Understanding Time-Based Experiments

Time-based experiments have unique considerations: - Inclusion Date: When users enter the experiment - Effect Start Date: When the treatment effect begins - Experiment Duration: How long to run the experiment - Time-varying Effects: Effects that change over time

Step 1: Generate Time-Series Data

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from aboba.design.time_based_designer import TimeBasedDesigner
from aboba.tests import AbsoluteIndependentTTest
from aboba.splitters import UserSplitter
from aboba.pipeline import Pipeline
from aboba.effect_modifiers import TimeBasedEffectModifier
from aboba.utils.time_based_data_generator import generate_time_based_data

# Generate sample time-series data
# This creates user-level data with dates and payments
data = generate_time_based_data(
    n_users=200,
    start_date='2024-01-01',
    end_date='2024-12-31'
)

The generated data includes:

  • user_id: Unique user identifier

  • date: Date of each observation

  • payment: Payment amount (target metric)

  • is_in_b_group: Group assignment

  • inclusion_date: When user entered the experiment

Step 2: Define Test and Pipeline Factory

# Define the statistical test
test = AbsoluteIndependentTTest(value_column='payment')

# Create pipeline factory that uses parameters
get_pipeline = lambda params: Pipeline([
    ('UserSplitter', UserSplitter(
        group_column='is_in_b_group',
        user_column='user_id',
        size=params.get('user_sample_size', 50)  # Use parameter or default
    ))
])

Step 3: Define Time-Based Effect Factory

The effect factory creates effects based on timing parameters:

get_effect = lambda params: TimeBasedEffectModifier(
    effect=5.0 * params.get('effect_multiplier', 1.0),  # Scale effect
    effect_start_date=pd.to_datetime(params['effect_start_date']),
    value_column='payment',
    date_column='date',
    group_column='is_in_b_group',
    inclusion_date_column='inclusion_date'
)

This creates effects that:

  • Only apply after the effect_start_date

  • Only affect users who were included before the effect started

  • Scale by the effect_multiplier parameter

Step 4: Create Designer with Time Constraints

# Create designer with parameter constraints
designer = TimeBasedDesigner(
    data=data,
    test=test,
    get_pipeline=get_pipeline,
    get_effect=get_effect,
    date_column='date',
    experiment_duration='experiment_duration',  # parameter name
    effect_start_date='effect_start_date',  # parameter name
    n_iter=100,
    constraints={
        'effect_start_date': [
            '2024-03-01',
            '2024-04-01',
            '2024-05-01'
        ],
        'effect_multiplier': [0.8, 1.0, 1.2, 1.5],
        'experiment_duration': [
            pd.Timedelta(weeks=4),
            pd.Timedelta(weeks=6),
            pd.Timedelta(weeks=8)
        ]
    }
)

Step 5: Optimize and Visualize

# Find optimal parameters
designer.optimize()

# Get the best parameters
best_params = designer.get_best_params()
print(f"Best effect start date: {best_params.parameters['effect_start_date']}")
print(f"Best effect multiplier: {best_params.parameters['effect_multiplier']}")
print(f"Best experiment duration: {best_params.parameters['experiment_duration']}")

# Visualize all results
designer.visualize()

Advanced Visualization: Fixed Parameters

You can visualize results while fixing certain parameters:

# Visualize with fixed duration and start date
designer.visualize(
    fixed_parameters={
        'experiment_duration': pd.Timedelta(weeks=4),
        'effect_start_date': '2024-03-01',
    }
)

This shows how the remaining parameters (like effect_multiplier) affect performance when other parameters are held constant.

Key Concepts in Experiment Design

Type I Error (False Positive Rate)

  • Definition: Probability of detecting an effect when none exists
  • Target: Should be close to your significance level (typically 0.05)
  • Measured by: Running AA tests (both groups from same distribution)

Type II Error and Statistical Power

  • Type II Error: Probability of missing a real effect
  • Statistical Power: 1 - Type II Error (probability of detecting a real effect)
  • Target: Typically aim for 80% power (Type II error = 0.20)
  • Measured by: Running AB tests with known synthetic effects

Optimization Strategy

The designers find parameters that:

  1. Maintain Type I error near the significance level (no inflation)

  2. Maximize statistical power (minimize Type II error)

  3. Balance practical constraints (sample size, duration, cost)