S2Stack - We build apps people actually use

Most A/B testing advice assumes you have a data science team, a dedicated analytics platform, and months to run experiments. That's not reality for most teams. We run A/B tests with lean resources. Here's our playbook. The Hypothesis Template Every test starts with a hypothesis. We use this template: "We believe [change] will [outcome] because [reason]. We'll know it worked if [metric] [direction] by [amount]." Example: "We believe removing the phone number field from checkout will increase conversion rate because it reduces friction. We'll know it worked if checkout completion rate increases by 5%." This template forces us to: - Define the change clearly - Predict the outcome - Explain why we think it will work - Set success criteria upfront Without a clear hypothesis, tests become fishing expeditions. With one, every test teaches us something. Test Sizing: How Many Visitors? You need enough visitors to detect a difference. But how many is enough? We use a simple calculator: - Baseline conversion rate: What's current performance? - Minimum detectable effect: What's the smallest improvement we care about? - Statistical significance: Usually 95% - Statistical power: Usually 80% For most checkout tests, we need 5,000-10,000 visitors per variant to detect a 5% lift. For smaller changes, we need more. If we don't have enough traffic, we either: - Run the test longer (if we can wait) - Focus on higher-traffic pages (homepage, product pages) - Skip the test and ship the change (if it's low risk) Prioritization: What to Test First Not everything is worth testing. We prioritize based on: 1. Impact: How much revenue does this page/flow drive? 2. Confidence: How sure are we this will work? 3. Effort: How hard is it to build and test? 4. Risk: What happens if it breaks? High impact + high confidence + low effort = test it now. High impact + low confidence + low effort = test it soon. Low impact + anything = probably skip it. We maintain a backlog of test ideas, ranked by this framework. Stopping Rules: When to End a Test Tests should end when: - We hit statistical significance (p < 0.05) - We hit the minimum sample size (even if not significant) - We see a clear winner (even if not significant, if the lift is large) - We hit a time limit (usually 2-4 weeks max) We don't peek at results early. We set the sample size upfront, then wait. If results are inconclusive, we either: - Run the test longer (if we have time) - Ship the variant (if it's not worse) - Keep the control (if the variant is worse) The key is making a decision, not running tests forever. Tools: What We Use We don't use expensive A/B testing platforms. Instead, we use: - Google Optimize (free, integrates with GA4) - Vercel Edge Config (for feature flags) - Custom analytics (for tracking) For most tests, Google Optimize is enough. For more complex tests, we build custom solutions. The key is using tools that fit our workflow, not tools that require us to change our workflow. Learning: What We Do With Results Every test teaches us something, even if it "fails." We document: - What we tested - What happened - Why we think it happened - What we'll test next This creates a knowledge base that makes future tests better. We also share results with the team. Even "failed" tests are valuable if they help us understand user behavior. The Bottom Line A/B testing with lean resources is about focus: clear hypotheses, smart prioritization, proper sizing, and disciplined stopping rules. You don't need a data science team to run good tests. You need a process, some basic stats knowledge, and the discipline to follow through. The best teams test constantly, learn quickly, and ship what works. That's the playbook.

A/B Testing Playbook for Lean Teams