From: Olivia Tati
Subject:
Four case study articles about how I've used artificial intelligence and machine learning to strengthen my copywriting process and create marketing knowledge systems that scale.
The goal of each is to empower AI with unique context, so marketing strategy is more objective, while leaving room for invaluable human input.
People assume, "The more context I give AI, the better my outputs will be!"
But this isn't the case.
Context is only the beginning of good AI output.
So, if more context isn't enough, then what do we need?
My hypothesis:
The rest of this page documents the AI copywriting system I built, and the discoveries I made along the way.
First, I built a Chrome extension that scrapes my client's email service provider to get all relevant email data.
I knew I could use the 3 years of emails that I've written in Olivia's voice to create a better AI copywriting workflow. This scraping gave the added benefit of collecting email performance as well.
The emails by themselves gave me raw context, but I doubted this would be good enough.
AI could see what I wrote, but all it could do was try to match my taste as a copywriter. If you've tried this, you probably know it doesn't produce the best results.
To make the context truly intelligent, I needed to be able to label and encode my taste with objective truths.
So I compiled 44 texts across linguistics and copywriting domains. Then I took those texts, analyzed their major contributions, and then ran a cross text analysis between them to make connections between all 44 texts.
Next, I went through every single email at varying levels of depth, and analyzed it against the 44 texts I compiled.
Put simply, the analysis process looks like this:
This intelligent context is what enables scalable copywriting.
Two libraries. One network.
On the left: the canon representing the 44 texts.
On the right: a sample of her body of work.
The lines between them show how the canon texts inform specific writing patterns from the emails.
The cross-reference at work
On the left: the canon. On the right: a sample of her body of work.
The canon
Her body of work
Or hover any book or email to explore the connections. Click to read the explanation.
3 connections in her body of work
Hover any phrase to see the move it's making.
Because unlike the spreadsheet bros who think “mindset” is a dirty word, I know that the biggest thing standing between most women and their first property isn't a math problem. It's a belief problem. A confidence problem. A “I've been saving carefully for YEARS and I refuse to throw that at something I'm not sure about” problem.
From “Engineer + Vibes = Fund Your Freedom”, launch email, March 2026.
Pick any passage. Hover over any phrase to see the moves it's making.
What if it's possible that you don't have anxiety, you actually just HATE your job? And I don't say that to invalidate mental health issues at all, but as someone without any history of mental health issues, I feel like I woke up one day, a decade into my engineering job, and realized I was depressed and anxious ALL the time.
From: "Escape your anxiety for ONE day!" newsletter, April 4, 2024
Rhetorical-question opener that reframes a category
Names a familiar interior experience, then sideloads a different cause underneath it. Reframing through question, not declaration, disarms the reader's defense.
From Aristotle, On Rhetoric
Anti-objection inside the move
Before the reader can object ("you're being dismissive"), the writer labels the objection out loud and dispatches it. The dash break is doing labor.
From Voss, Never Split the Difference
Specificity tax: "a decade," "engineering job"
The number and the job title make the autobiographical claim auditable. Concrete specifics make the writer trustworthy.
From Heath and Heath, Made to Stick
"Do you ever feel like life is holding you back? Like you can't go full speed no matter how hard you try?"
""I can barely trust myself to stay out of credit card debt. How am I supposed to handle owning a whole property?""
"At 25, I was making six figures as a corporate engineer."
Now that I had intelligent context, it was time to use it to produce copy with it.
I had the intelligent context create a brand voice skill doc, I prompted AI, and I was disappointed…
Was all of this work for nothing? Wasn't good context + skill docs supposed to make AI outputs consistent and scalable?
Turns out no.
So I created a 9 phase testing plan to figure out which agentic writing pattern produced the best output.
You can see those phases and what they tested below.
Tested whether the writer model alone (with just a one-line role prompt) could produce on-brand copy.
Hey [First Name],
I want to tell you about my friend Stephanie. I actually met her years ago. We were both backpacking through Bosnia and ended up at the same hostel. The kind of woman you meet once and just know she’s going places.
Fast forward to earlier this year. Stephanie had been living in Mexico (just for fun, because she could), and she decided she wanted to buy investment property in Florida. She had the income. She had the drive. She had the "I’m a boss and I figure things out" energy in spades.
So naturally, I told her to join Wanderlust Wealth Academy.
Her response? "I don’t think I really need it. Buying a house is easy."
And honestly? I wasn’t even mad.
“Reads like an AI wrote a ‘sounds like a friend’ email. The structure is there. The voice isn’t.”
I built an AI output ablation tester that would create competitors for every testing round.
Each competitor was prompted with differing variables. Some different variables included:
I was blind to which competitor produced which output. I then scored each output and made comments to explain my reasoning behind the score.
Here's a simplified demonstration of how this system:
All to produce a more scalable AI writing process.
The brief
Mid-launch student-spotlight for Fund Your Freedom. The Masterclass is March 17. Feature Cyan, a 25-year-old engineer in Chicago, single income, used the NACA loan to buy a $634K duplex with zero dollars down, then traveled Europe for a month while her property paid for everything. Speak to the reader who's worried she's not financially disciplined enough to handle real estate. Match Olivia's voice.
AI searched the knowledge base for…
From: Olivia Tati
Subject:
Right now, the only way to write with AI is a one shot, back-and-forth with an AI chat interface.
This is a slow and frustrating way to work because you're iterating on entire pieces of copy at a time.
This interface allows you to take proven copywriting structures and write pieces with AI, one step at a time.
This interface combined with intelligent context and agentic writing patterns is the ultimate way to scale copywriting.
Current AI writing, even when you give it everything, is not being used to its full potential. There are three reasons.
Generic context
Even with infinite memory, the AI doesn't know what makes your writing yours.
One-shot prompting
Prompt in. Output out. AI is writing for you, but without agentic writing patterns, AI isn't thinking while it writes.
Writing with AI chat interfaces
It's inefficient and frustrating to go back and forth on entire pieces at a time.
Intelligent context
Encode your copywriting taste, so AI understands how your voice works and how to match it.
Agentic writing patterns
Spawn sub-agents and assign them jobs so that the AI thinks through its decisions fully.
Specialized AI writing interface
Truly collaborate with AI and iterate on smaller chunks at a time.
Quantifying Brand Voice
Answering the question, "Does this sound like my client?"
Brand voice is the hardest thing to encapsulate and scale.

And since I like doing hard things, I decided I wanted to try to take the "magic" out of brand voice, and turn it into a number that answers one question:
"How closely does this piece of copy resemble the brand voice?"
To do this, I took about 200 emails that I hand-wrote for one of my clients, Olivia, and ran them through an AI model called StyleDistance.
The result is exciting for anybody who's looking to capture and scale brand voice copy.
The reason I cared about quantifying brand voice into a number, not just a feeling, is what happens when you try to write at scale.
If you've ever:
You know just how hard (and incredibly frustrating) this can be.
If I could solve this issue, it helps the writer show how close or far away they are from matching their brand's voice. Also, if I could get this number, I could use it as a feedback mechanism for AI.
"If the brand voice match is under X% retry."
And if we get a few more supporting details (discussed later), we could change that prompt to:
"If the brand voice match is under X%, use the supporting details to figure out why it doesn't match, and list the best edits to make in the next revision."
That's what got me excited enough to build this.
Quantifying brand voice is complicated, but not impossible!
How complicated, though? The easiest way I've found to explain this layer is to compare it to a spam filter. A spam filter asks a yes-or-no question. If it's spam, you put it in bucket A. If it's not spam, you put it in bucket B. EASY.
But brand voice has significantly more variables than a simple yes or no.
Instead of two buckets, StyleDistance creates a cloud of 768 dimensions. If you can't picture it, that's okay, neither can I.
The point is that it's much more complex than the spam folder, and that the experts have taken care of the math behind the 768 dimensions (thank goodness).
Spam detector, two bins
Style space, a region not a bin
Under the hood, there's a machine learning model called StyleDistance that does one thing, it reads a piece of writing and pays attention to HOW it's written, not WHAT it's written about.
Put simply, StyleDistance measures style, not topic.
Most models that turn text into numbers do the opposite. They measure the topic and ignore the style.
But I wanted a model that notices when two pieces of text use the same diction, rhythm, tone, mood, energy, and sentence shapes, REGARDLESS of the topic.
The way it was trained is the part I find interesting. The researchers showed it the same content rewritten in different styles, over and over, until it learned to distinguish between the styles.
It read about the same idea in a formal voice, then in a casual voice, then in a clinical one.
This tells the model, "these pieces all say the SAME thing, so instead, you should group them apart by HOW they say it."
Do that enough times with enough examples and the model learns STYLE.
Generic embedder
clusters by topic
StyleDistance
clusters by style
Once style is the organizing variable, the rest is math.
What the model gives you for any piece of text is a list of 768 numbers. More specifically, those 768 numbers are actually coordinates in a really high-dimensional style space.
Think of graph paper, but instead of two axes (X and Y), you have 768 of them. A piece of writing then becomes just a single dot somewhere in that space.
Here's the part that makes it all work: Two pieces of writing that sound alike land near each other on that graph. Two pieces that sound different land far apart. The closer together, the closer the brand voice match.
Now it's time to actually use the dang thing to score a draft.
There are three steps to seeing if this worked or not.
One: Take the new draft and run it through StyleDistance. This gives it the 768 numbers it needs and plots it on the same graph as the 200 emails I've written by hand.
Two: Measure how far the draft is from the 200 brand voice target emails. There's some nerdy math stuff here, but basically, you can calculate the direction of your drafted piece against the average of the example pieces.
If they point in the same direction in style space, the score number is near 1 (good brand voice match). If they point in unrelated directions, the number is near 0 (bad match).
Three: Pull the closest archived pieces to the draft and surface them as evidence.
These matching pieces pulled from the archive are the "supporting details" I mentioned earlier. They become meaningful context that we can use later to answer questions like:
Now it was time to ask the scariest question of all. Does it actually work??
I picked an email called Dec Newsletter 9, "The Power of Decision." It's the kind of piece that should match the brand voice with flying colors.
I yanked it out of the 200 examples that I used to train the StyleDistance model. I rebuilt the style signature from the other 199. Then scored the test email against the rebuilt signature.
It got a score of 0.988 (that's good)!
That puts the test email at the 91st percentile of the cohort, which means only about 9% of the other emails landed closer to their own style signature than this one did.
Put simply, the piece I pulled out of the pile sat back down almost exactly where it should have.
Then the receipt. The four closest neighbors from the rest of the cohort, by score, are:
The closest matches are recognizably the same kind of writing as the test email. Same shape, same rhythm, same job. That's exactly what I want this tool to do!
score(draft, brent_email_centroid) · cohort p5 = 0.94
Dec Newsletter 9, “The Power of Decision”
That's just ONE test though...
A careful reader would (correctly) push back and ask whether Dec Newsletter 9 was a lucky chance.
So to really test the heck out of this thing, I ran the same test on all 200 emails.
For each email in the cohort, I rebuilt the style signature from the other 199 and scored the held-out one.
The full distribution comes out at p5 = 0.937, median = 0.979, p95 = 0.990. Effectively the same shape as the in-cohort distribution.
This just means that the system reliably recognizes Olivia's brand voice across all 200 emails!
To test it a little bit more, I took some academic text from a guy named Steven Pinker, and tested it against my brand voice email cohort. If everything is working, this should NOT score well.
It landed with a score of 0.76!
All 200 emails sit in a tight ring (0.94 to 0.99). But Pinker's academic text sits way outside at 0.76.
The Dec Newsletter 9 test email successfully sits right inside at 0.988.
The cohort minimum score is 0.84. Pinker sits well below the cohort's p5 of 0.937. CLEANLY not a match. The encoder IS NOT confused about what does and doesn't belong as brand voice for Olivia.
The calibration bands fall out naturally:
This in and of itself is cool, but it's not the end-all-be-all of scaling brand voice.
However, it gives us a concrete number and some extra context that ENABLES us to do much cooler stuff in the future.
Next, I want to work this into an agentic writing workflow as a source of feedback for agents.
AI, as it is now, is not very good at editing because it's honestly not that smart.
But by giving it this extra layer of feedback, we're giving it one more way to reason about whether a piece of copy matches brand voice or not. The score is a quantifiable number that helps AI not hallucinate, and the context it gives explains WHY it does or doesn't match and HOW to make it better.
AI drafts
draft v1
This layer scores it
0.84
AI revises
draft v2
Closest matches as evidence
“Your house is your sugar daddy” · “Become the main character”
Predicting Copy Performance with Machine Learning
Forecasting how a draft is likely to perform before it ships.
The performance of every piece of copy (email, social posts, etc.) can feel like a gamble.
Maybe it's based on hunches you've had from past tests, maybe it's based on data, maybe it's based on something else...
The reality is that, until you publish a piece and get the numbers back, it's basically a guess.
But what if you could make a quantifiable prediction of how well a piece would perform BEFORE publishing it?
Something that says, "This will go viral and get around 1M views, but it won't move people further down the funnel."
If we knew that, it would be the marketing equivalent of time travel. Before we publish anything, we can go back and tweak it until it's optimized for the right thing.
How freakin' cool would that be?!
So that's what I tried to build here!
If you can't figure out what variables in your copy affect outcomes, this project is for you.
Every piece of copy should optimize for ONE goal at a time. The way you find that goal is by asking, "What one action do we want the reader to take?"
If we want more leads coming in top of funnel, our goal should be to optimize for virality and follower counts.
If we want leads going deeper into the funnel, we need to optimize for opt-ins, booked calls, etc.
Therefore, we need this performance predictor to differentiate between goals.
So I started by asking two questions.
Question 1: "Will this piece go viral, but not convert?"
Question 2: "Will the audience actually take the action the copy asks them to take?"
To train the performance predictor on this, we split up pieces of copy based on their goal. Did it include a CTA to go deeper into the funnel? Or was it optimizing for reach?
High engagement · low CTA conversion
“Costa Rica villa tour: what living abroad actually looks like at 32”
Outcome: went viral, but low conversion.
Low engagement · high CTA conversion
“Comment FREEDOM and I’ll DM you the rate sheet I used to close my first duplex.”
Outcome: not much reach, but massive action taking.
Once you have both numbers, you can do something no off-the-shelf software does. You can determine the variables that contribute to reach and conversion learn and where they intersect.
To do this, we're using a machine learning model called LightGBM.
To start, the human creates variables that they believe might or might not affect the performance of copy. Then LightGBM takes those variables, creates a spreadsheet with a column for every variable, then creates one spreadsheet row for every piece of copy.
Then you let it loose to fill out every single column for every single piece of copy.
To give you an idea, here's a sample row for one IG post:
Row for one IG post
caption_length : 187
hashtag_count : 9
has_manychat_keyword : 1
sentiment_arc_p1 : -0.12
sentiment_arc_p10 : +0.34
style_distance_to_centroid : 0.41
fw_to_ratio : 0.018
dim_offer : "lead-magnet"
... about 100 more like thisI decided that I wanted to be able to predict 6 different copywriting objectives across IG, YouTube, Threads, and Email.
Here's the list:
| Model | What it predicts | Trained on |
|---|---|---|
| IG conversion | ManyChat trigger volume | IG posts that contain a keyword |
| IG engagement (ManyChat keyword) | likes + saves + shares | same posts as above |
| IG engagement (no ManyChat keyword) | likes + saves + shares | IG posts without a keyword |
| Email open rate | open % | Brent-authored emails only |
| YouTube engagement | engagement composite | YouTube videos |
| Threads engagement | engagement composite | Threads posts |
After splitting all of the samples along these lines, and filling out the spreadsheet for each piece of copy, it's time to start training this thing!
What does "train" even mean?
Think of it like studying with flashcards (where each piece of copy is a single flashcard).
Eventually the algorithm gets pretty good at guessing the outcome on posts it's already seen, and (if you set it up right) on posts it hasn't.
But it's not really flashcards.
Under the hood, this algorithm uses something called "decision trees".
A decision tree is a short flowchart of yes/no questions about variables, ending in a guess. Here's one the model might actually build for IG conversion:
One tree the model built (illustrative)
Does the caption have a ManyChat keyword?
No → guess: 4 ManyChat triggers
Yes → How many hashtags?
≤ 9 → guess: 12 triggers
> 9 → How long is the caption?
≤ 220 chars → guess: 8 ManyChat triggers
> 220 chars → guess: 6 ManyChat triggersThe power behind LightGBM and these decision trees is that they build on each others' learning.
For example, the first tree makes a guess to find it was wrong by a score of 10. Tree 2 focuses on figuring out what Tree 1 got wrong. Over time, this process gets closer and closer to finding the patterns that make predictions more accurate.
This is called "Gradient Boosting" (hence the "GB" in LightGBM).
How do we know LightGBM is actually working though?
We also trained a simpler version of every model as a sanity check, called ElasticNet. ElasticNet draws straight-line rules (every extra word does X to the score, that kind of thing).
LightGBM beat it on every one of the six datasets.
That's because LightGBM successfully catches the complex patterns where performance depends on multiple factors in any given piece.
And to prove that we're ABSOLUTELY SURE about the successful results, I ran the LightGBM process on each copywriting objective 15 times.
The ± number next to each score (something like ±0.025) is the spread across those 15 runs. A small number means a smaller spread and a higher confidence in performance. A bigger number means a wider spread and less confidence.
The top three are strong. This tells us that the model is able to predict higher performing copy around 64% to 71% of the time, which is statistically good enough to be useful in production.
Middle two (YouTube and Threads) aren't quite there yet, but I honestly haven't spent as much time on them.
The bottom one (IG keyword-path engagement at 0.08) is weak ON PURPOSE. We can use this as a way to predict what DOESN'T perform.
(Also, I left out the email click rate because there just wasn't enough data to get statistically significant predictions)
When an IG post has a ManyChat keyword in it, two prediction models run on it, so you get two predictions for one post.
This allows you to predict how the post will convert and how much reach it will get. Then you can plot the post on a grid.
Predicted virality on one axis. Predicted conversion on the other.
Quiet but converts. Worth amplifying if you wanted reach too.
Both numbers strong. Whatever your goal, this is going to land.
Weak on both axes. Rework if you wanted either signal.
Big reach, few triggers. Great for top-of-funnel awareness. A trap only if you were trying to convert.
What the percentages mean: when the model puts a draft in this quadrant, how often was it actually in this quadrant on data it had never seen? Above 25% beats random chance.
Let's take an actual IG post as an example:
Olivia hosted a women's investing retreat in Sedona last spring. The reel she posted afterward opened with this caption:
"I just hosted a retreat in Sedona and I am still sitting in the afterglow
☀ Comment LEAP and I'll DM you the details on the Leap Year Minimind."
The LEAP keyword was the ManyChat trigger for a Leap Year Minimind launch sequence. The objective of the post was to drive conversion to that launch.
When I ran the prediction model on this post, two predictions came back.
Predicted virality was in the top 5% for the cohort. Predicted conversion was in the bottom 1%.
How did the post actually perform?
Reach: 3,407. Likes: 137. ManyChat triggers: zero.
The model called it. High reach, no conversion. Lots of likes, no leads.
How often is the model right about which box a draft lands in?
On the most recent 20% of posts it never saw at training, the high-reach, low-conversion was predicted 49.1% against a 25% random-chance baseline.
That's statistically significant precision meaning the model reliably flags when something needs work. But of course, it's the human's call to push publish in the end.
Imagine writing your next piece of copy inside an editor that scores it as you type. The feedback would look like this:
Remember how LightGBM uses the spreadsheet row of 100+ variables to predict performance for a post?
A standard technique called SHAP cracks the prediction open and tells you which features pushed the score up, which pulled it down, and by how much.
For example, say you have a draft and you run the prediction model on it.
The IG conversion model scores it 0.65.
SHAP breaks that 0.65 score into parts.
Basically, LightGBM gives you a prediction. SHAP gives you the WHY behind the prediction.
Run the same process across every post the model has seen, and you get the features it leans on most. Here's the aggregated view for IG conversion:
What the post is trying to do
Lead-magnet-promo posts drive triggers. Lifestyle posts don't. Dominant by a wide margin.
Caption literally contains a ManyChat trigger word partial confound
Captions without the keyword obviously can't trigger ManyChat. The model is partially learning the rule, not the cause.
Which product or offer the post is promoting
The specific offer in the post pushes the prediction down on lower-converting products. Different offers convert at different rates.
How many hashtags the caption uses
Denser hashtag stacks correlate with lower conversion. Reads as desperation, not signal.
How heavily the caption uses the word "to"
Instruction-heavy chains ("click the link to grab the guide to start your...") convert worse than plain talk.
This opens the door to a powerful source of feedback.
You type a draft. The performance model scores it as you write. SHAP runs underneath and surfaces feedback line by line.
And even better, you can easily hook this up to an AI agent to help with planning, ideation, testing, writing, and beyond!
A/B Testing Systems with AI & Machine Learning
Turning tests into a compounding map of what actually causes performance.
Most marketing teams are running A/B tests, but are they running real, RIGOROUS testing systems?
Most people run tests on "hunches" and, when they're done, they end up with "hunches" about what worked or what didn't work.
Hunches don't survive or grow, and a year later, you still don't know what variables influence outcomes.
I want to build a system that fixes that.
An A/B test shouldn't get you a "winning variant". It should get you a more confident answer to a question like, "What variable(s) about that subject line CAUSED it work?"
Not a "hunch". Not a fluke. A real root cause.
Because the closer you get to discovering root causes, the more results compound.
Below is the system I'm planning to build, the math underneath it, why I'm confident it works, and what it unlocks.
It's a closed loop. Five pieces hand off to each other in order, and the last one feeds back into the first, so the whole system gets a little more confident every cycle.
One more piece sits just outside the loop and feeds it: the hypothesis generator. This is a LightGBM model that reads factor importance across your entire copy archive and proposes brand-new variables worth testing, seeding fresh rows into the store for the loop to chew on.
The output of this isn't standalone "winning variants." Why? Because winning variants are isolated, and they can't be consistently repeated.
Instead, this testing system is a map of cause-and-effect that grows more confident every week. The result is consistently high-performing variants, with the guesswork removed.
1 · Hypothesis Store
the spreadsheet: beliefs + confidence cols
2 · Test Designer
ranks by impact-weighted EIG → which test next?
3 · Variant Gate
control + quality checks
4 · Test Runs
the actual send
5 · Bayesian Updater
tightens CI on tested row + pooled neighbors
+ LightGBM Generator
factor importance → new hypotheses (feeds new rows into the store)
The hypothesis store is a literal spreadsheet that can be imagined to look like this:
| Hypothesis | Best estimate | Confidence | Evidence |
|---|---|---|---|
| Curiosity-gap subjects lift opens vs benefit-forward | +6% |
± 4 pp
|
18 tests
~90K sends
|
| Emoji in subject lifts opens | +2% |
± 3 pp
|
9 tests
~41K sends
|
| Question subjects lift clicks vs declarative | +11% |
± 2 pp
|
24 tests
~122K sends
|
| Sender-name personalization lifts opens | +1% |
± 5 pp
|
4 tests
~18K sends
|
| Three-word subject lines lift opens | ? |
± large
|
0 tests
untested
|
Each row is a testable claim worthy of testing.
The best estimate column calculates the impact of that claim.
The confidence column quantifies how unsure we are about the estimate (is it a coincidence or a root cause).
The evidence column tells us how much data we need to gather before we can make a statistically significant conclusion about that claim.
The job of the system is to:
The math underneath is called a Bayesian hierarchical model with partial pooling.
Put simply: it's the spreadsheet above organized so the rows can talk to each other.
With every test, it runs the math and updates every column automatically.
The "pooling" aspect is cool enough to explain.
Let's say you set out to test the "curiosity-gap on warm subscribers" hypothesis. The results affect that test most, but "pooling" means the results also subtly nudge other related tests like "curiosity gap on cold subscribers" too.
This means, even if you're sending to smaller segments (which is good practice), your results can still have a significant effect since every test is doing work for every related claim, forever.
You test one claim, and the evidence ripples out to its relatives.
Say we run a test on our claim about curiosity-gap on warm subscribers. Here's what that single test does to four rows in the hypothesis store. The green part of each bar is how unsure we still are. As it grows shorter, we become more confident about that claim.
Big jump in confidence. This is the row you actually tested.
Nudged a little. A curiosity-gap sibling, so it borrows some of the evidence.
Nudged a little. Same family, different audience, but still related.
It's like compounding interest, but for marketers!
While we're building this testing system, we might as well squeeze as much as we can out of it. Why not have it surface new hypotheses it thinks could work?
That's where LightGBM comes in. This AI model reads every email (or any other piece of copy) you've ever sent and learns which features (subject length, brand-voice consistency, emotional sentiment, emoji density, opening word, P.S., etc.) correlate with which outcomes.
And because it has access to all of those variables and data about how they predict performance, it can discover hypotheses worth testing through making new variable combinations.
LightGBM scoring variable combinations
personalization_token keeps appearing in high-scoring combinations. That becomes a hypothesis worth testing.
It's able to make a test suggestion like, "subject length under six words seems predictive of opens."
And when you combine the automatic hypothesis discovery and the "pooling" features, this system will collect data on this hypothesis in the background while you run other tests.
COOL!
"Okay, but that sounds like A LOT of tests, man... Where do I even start?"
No worries. I got you!
Prioritize the biggest potential impact hypothesis first. Narrow in from there.
We can calculate that with a formula that looks something like this:
The priority formula
expected_impact: size of the lift if the test confirms.
future_sends_affected: how many future sends this answer applies to.
EIG: Expected Information Gain. How much sharper the answer gets.
sends_required: Sends left until we can come to a statistically significant conclusion.
Worked example
If we're running an A/B test, every proposed A/B pair has to pass two checks before it passes:
"Bandits" are tools that dynamically shift traffic toward whichever variant is winning DURING a test. It's great at squeezing every last conversion out of this week's campaign. This sounds awesome at first...
However bandits can have a negative effect on testing programs.
Bandits starve losing variants of traffic which kills the precision needed to build a causal claim. Bandits learn which arm pays. They don't care about why.
A true detective tester wants to know which variant pays AND why. And the only way to make a real causal conclusion is to get enough evidence (losing variant data).
Bandits enable this week's revenue, but they kill the compounding effect of next year's learning.
I haven't shipped this. So why should you (or I) trust the design? Three reasons.
The math is settled. Bayesian hierarchical models, gradient boosting, partial pooling, Expected Information Gain, these are decades-old, well-tested techniques.
The novel part is intimately applying them to a company's marketing assets, sticking with it over time so it compounds, and using the data as context that can be leveraged by AI for future copy generation.
The shift doesn't come from any singular component.
It's what happens when the full testing loop runs for a full year.
Every send becomes a data point. There's no "throwaway" campaign anymore, because even routine sends quietly sharpen the hypothesis store. Twelve months in, you have a growing map of cause and effect instead of a folder of non-scalable "winning variants".
At enterprise scale, this testing system creates institutional memory. All of the testing knowledge survives a marketer changing roles or a quarter ending. One team can read what another already proved and skip the test.
It also fixes the opportunity cost problem. Since we test the highest impact claims first, wasted sends drop, meaning we waste less revenue. And because this is structured data, it feeds straight into whatever AI writes your copy next.