The 30-day AI experiment framework
Stop strategising. A 30-day framework to test if AI works for your business — one workflow, one team, one outcome, one decision at the end.
- Most AI rollouts fail because they start as strategy decks instead of experiments.
- Pick one workflow, one team, one outcome metric, one decision date. 30 days from start to "scale or kill."
- Week 1: scope + baseline. Week 2: build / configure. Week 3: run. Week 4: measure + decide.
- If you cannot define the success criteria up front, you are not ready to experiment yet. Go re-read the readiness checklist.
Most teams' AI projects look like this: "we should do AI" → six months of meetings → strategy deck → procurement → two-year rollout → discover it does not work in your context. Cost: 12-18 months. Lesson: useless.
A 30-day experiment is the antidote. Below is the framework.
The four constraints
Every experiment defines these four before starting:
- One workflow. Not the whole department. Not "AI in customer support." Specifically: "AI drafts replies to refund-request emails for the e-commerce team."
- One team. 2-5 people who run the experiment. Their job for 30 days is the experiment.
- One outcome metric. The number that decides scale-or-kill. Not three KPIs. One.
- One decision date. Day 30 from start. At that meeting, you decide.
Resist scope expansion. Every additional workflow / team / metric reduces the chance the experiment teaches you anything clean.
The week-by-week
Week 1 (days 1-7): scope + baseline
- Day 1-2: write the experiment charter. One page. The four constraints + the success threshold.
- Day 3-4: measure the baseline. Without AI, how does this workflow perform today? Get 30 days of historical data if available.
- Day 5-7: define the eval set. 50-100 representative cases with the right answer noted. This is how you will measure quality.
Week 2 (days 8-14): build / configure
- Day 8-10: choose the model + the prompt strategy + the integration. Resist over-engineering.
- Day 11-12: build the workflow. If you are buying, configure the vendor.
- Day 13: run the eval set. Quality should be at least 70% of the target. If under, fix the prompt or pick a different model. Do not go to week 3 with bad quality.
- Day 14: stakeholder walkthrough. Get sign-off to run live.
Week 3 (days 15-21): run
- Day 15: launch with the chosen team. Limited scope.
- Day 15-21: AI handles the workflow. Operator watches every output the first 2 days, then samples 20%.
- Daily: 10-minute standup. What broke, what worked, what to tweak.
- Mid-week: prompt tuning if quality is drifting.
Week 4 (days 22-30): measure + decide
- Day 22-26: outcome data accumulates. Track the chosen metric.
- Day 27-28: synthesise the findings. Outcome vs baseline. Cost vs benefit. Team feedback.
- Day 29: write the decision memo. Scale, kill, or extend by 30 days (only if there is a specific learning question).
- Day 30: decision meeting. One decision. No "let's think about it for another quarter."
The success threshold
Define before you start. "What number on the outcome metric, by Day 30, would convince us to scale?"
Examples:
- "Refund email drafts reduce agent time per ticket by 50%+."
- "AI-suggested practice plans achieve 80%+ teacher acceptance."
- "AI receipt OCR achieves 95%+ accuracy on the eval set."
- "AI follow-ups generate 20%+ more replies than the previous template."
Vague thresholds ("we'll see if it helps") guarantee a vague outcome. Specific thresholds guarantee a clean decision.
The three decisions at Day 30
Scale
Outcome cleared the threshold. Roll out to more teams + more workflows. Allocate the next 90 days.
Kill
Outcome did not clear the threshold. Document what was learned. Pick a different workflow + run another 30-day experiment.
Extend
Outcome was ambiguous. Use this only with a specific learning question, e.g. "the metric is borderline because the eval set was the wrong one — extend 30 days with a corrected eval set."
Do not extend just because you are afraid to decide.
Common failures
Scope creep
Day 12, someone says "what if we also did X?" The right answer is "in the next experiment, not this one."
Bad baseline
If you cannot quantify the workflow's current performance, you cannot prove the AI improved it. Spend the time on Week 1.
Eval set drift
The eval set must be set in Week 1 and not changed. Changing it mid-experiment to make the AI look good is cheating yourself.
Skipping the decision meeting
"Let's revisit next quarter" is the silent killer of experiments. Day 30 is the day. Schedule it on Day 1.
When you are not ready
You are not ready for an experiment if:
- You cannot name the workflow specifically.
- You cannot get 2-5 people to commit 30 days to it.
- You cannot measure the outcome metric reliably.
- You will not actually decide at Day 30.
If any of these is true, read our AI readiness checklist before running the experiment.
What this means for you
- 30-day experiments beat 6-month strategies. Pick a workflow, run it, decide.
- Four constraints up front: one workflow, one team, one metric, one date.
- Week 1 baseline + eval set. Week 2 build. Week 3 run. Week 4 decide.
- The decision is scale, kill, or extend-with-specific-question. Not "think about it."
- Read our rollout playbook if the experiment lands as a "scale" decision.
Want to run a 30-day AI experiment? Book a 30-minute call. We will help you scope it.
Talk to a real engineer.
A 30-minute call. We will tell you honestly whether AI is the right fix and what it would take.



