AI Writing Systems

Four case study articles about how I've used artificial intelligence and machine learning to strengthen my copywriting process and create marketing knowledge systems that scale.

The goal of each is to empower AI with unique context, so marketing strategy is more objective, while leaving room for invaluable human input.

Select an article

Predicting Copy Performance with Machine Learning

Forecasting how a draft is likely to perform before it ships.

The performance of every piece of copy (email, social posts, etc.) can feel like a gamble.

Maybe it's based on hunches you've had from past tests, maybe it's based on data, maybe it's based on something else...

The reality is that, until you publish a piece and get the numbers back, it's basically a guess.

But what if you could make a quantifiable prediction of how well a piece would perform BEFORE publishing it?

Something that says, "This will go viral and get around 1M views, but it won't move people further down the funnel."

If we knew that, it would be the marketing equivalent of time travel. Before we publish anything, we can go back and tweak it until it's optimized for the right thing.

How freakin' cool would that be?!

So that's what I tried to build here!

If you can't figure out what variables in your copy affect outcomes, this project is for you.

We start with two questions

Every piece of copy should optimize for ONE goal at a time. The way you find that goal is by asking, "What one action do we want the reader to take?"

If we want more leads coming in top of funnel, our goal should be to optimize for virality and follower counts.

If we want leads going deeper into the funnel, we need to optimize for opt-ins, booked calls, etc.

Therefore, we need this performance predictor to differentiate between goals.

So I started by asking two questions.

Question 1: "Will this piece go viral, but not convert?"

Question 2: "Will the audience actually take the action the copy asks them to take?"

To train the performance predictor on this, we split up pieces of copy based on their goal. Did it include a CTA to go deeper into the funnel? Or was it optimizing for reach?

Illustrative

Post A

High engagement · low CTA conversion

“Costa Rica villa tour: what living abroad actually looks like at 32”

Engagement

88 / 100

CTA Conversion

14 / 100

Outcome: went viral, but low conversion.

Post B

Low engagement · high CTA conversion

“Comment FREEDOM and I’ll DM you the rate sheet I used to close my first duplex.”

Engagement

22 / 100

CTA Conversion

82 / 100

Outcome: not much reach, but massive action taking.

Once you have both numbers, you can do something no off-the-shelf software does. You can determine the variables that contribute to reach and conversion learn and where they intersect.

But HOW? What makes all this possible?

What does the model actually do?

To do this, we're using a machine learning model called LightGBM.

To start, the human creates variables that they believe might or might not affect the performance of copy. Then LightGBM takes those variables, creates a spreadsheet with a column for every variable, then creates one spreadsheet row for every piece of copy.

Then you let it loose to fill out every single column for every single piece of copy.

To give you an idea, here's a sample row for one IG post:

Row for one IG post
  caption_length              : 187
  hashtag_count               : 9
  has_manychat_keyword        : 1
  sentiment_arc_p1            : -0.12
  sentiment_arc_p10           : +0.34
  style_distance_to_centroid  : 0.41
  fw_to_ratio                 : 0.018
  dim_offer                   : "lead-magnet"
  ... about 100 more like this

I decided that I wanted to be able to predict 6 different copywriting objectives across IG, YouTube, Threads, and Email.

Here's the list:

Model	What it predicts	Trained on
IG conversion	ManyChat trigger volume	IG posts that contain a keyword
IG engagement (ManyChat keyword)	likes + saves + shares	same posts as above
IG engagement (no ManyChat keyword)	likes + saves + shares	IG posts without a keyword
Email open rate	open %	Brent-authored emails only
YouTube engagement	engagement composite	YouTube videos
Threads engagement	engagement composite	Threads posts

After splitting all of the samples along these lines, and filling out the spreadsheet for each piece of copy, it's time to start training this thing!

What does "train" even mean?

Think of it like studying with flashcards (where each piece of copy is a single flashcard).

You take a stack of past pieces of copy whose outcomes are already known.
You show the algorithm one piece of copy (one row from the spreadsheet).
You ask it to guess the outcome.
You tell it the right answer (based on your real performance data).
You let it adjust, so its next guess on that kind of post will be a hair closer to correct.
Then you do it again. Then again. Thousands of times across all past posts.

Eventually the algorithm gets pretty good at guessing the outcome on posts it's already seen, and (if you set it up right) on posts it hasn't.

But it's not really flashcards.

Under the hood, this algorithm uses something called "decision trees".

A decision tree is a short flowchart of yes/no questions about variables, ending in a guess. Here's one the model might actually build for IG conversion:

One tree the model built (illustrative)

  Does the caption have a ManyChat keyword?
    No  → guess: 4 ManyChat triggers
    Yes → How many hashtags?
            ≤ 9   → guess: 12 triggers
            > 9   → How long is the caption?
                      ≤ 220 chars → guess: 8 ManyChat triggers
                      > 220 chars → guess: 6 ManyChat triggers

The power behind LightGBM and these decision trees is that they build on each others' learning.

For example, the first tree makes a guess to find it was wrong by a score of 10. Tree 2 focuses on figuring out what Tree 1 got wrong. Over time, this process gets closer and closer to finding the patterns that make predictions more accurate.

This is called "Gradient Boosting" (hence the "GB" in LightGBM).

How do we know LightGBM is actually working though?

We also trained a simpler version of every model as a sanity check, called ElasticNet. ElasticNet draws straight-line rules (every extra word does X to the score, that kind of thing).

LightGBM beat it on every one of the six datasets.

That's because LightGBM successfully catches the complex patterns where performance depends on multiple factors in any given piece.

And to prove that we're ABSOLUTELY SURE about the successful results, I ran the LightGBM process on each copywriting objective 15 times.

The ± number next to each score (something like ±0.025) is the spread across those 15 runs. A small number means a smaller spread and a higher confidence in performance. A bigger number means a wider spread and less confidence.

The results are in!

Spearman rank correlation · 15-fold CV 0 = random · 1 = perfect ranking

IG conversion keyword path

0.708

CV ±0.025

Email open rate single-author corpus

0.671

CV ±0.078

IG engagement non-keyword path

0.645

CV ±0.058

Email click rate

skipped: target too noisy at n=184

YouTube engagement

0.371

CV ±0.045

Threads engagement

0.236

CV ±0.038

IG keyword-path engagement kept on purpose. The divergence detector needs both signals to find the trap.

0.080weak on purpose

CV ±0.089

≥ 0.6 strong 0.4–0.6 useful 0.2–0.4 weak < 0.2 essentially noise

The top three are strong. This tells us that the model is able to predict higher performing copy around 64% to 71% of the time, which is statistically good enough to be useful in production.

Middle two (YouTube and Threads) aren't quite there yet, but I honestly haven't spent as much time on them.

The bottom one (IG keyword-path engagement at 0.08) is weak ON PURPOSE. We can use this as a way to predict what DOESN'T perform.

(Also, I left out the email click rate because there just wasn't enough data to get statistically significant predictions)

What two predictions together can tell you

When an IG post has a ManyChat keyword in it, two prediction models run on it, so you get two predictions for one post.

This allows you to predict how the post will convert and how much reach it will get. Then you can plot the post on a grid.

Predicted virality on one axis. Predicted conversion on the other.

Virality: low ← → high

Conversion: low ← → high

low virality · high conversion

Sleeper conversion

7.3%

precision on held-out data

Quiet but converts. Worth amplifying if you wanted reach too.

high virality · high conversion

Gold

30.8%

precision on held-out data

Both numbers strong. Whatever your goal, this is going to land.

low virality · low conversion

Rework

62.1%

precision on held-out data

Weak on both axes. Rework if you wanted either signal.

high virality · low conversion

Engagement trap

49.1%

precision on held-out data

Big reach, few triggers. Great for top-of-funnel awareness. A trap only if you were trying to convert.

What the percentages mean: when the model puts a draft in this quadrant, how often was it actually in this quadrant on data it had never seen? Above 25% beats random chance.

Let's take an actual IG post as an example:

Olivia hosted a women's investing retreat in Sedona last spring. The reel she posted afterward opened with this caption:

"I just hosted a retreat in Sedona and I am still sitting in the afterglow
☀ Comment LEAP and I'll DM you the details on the Leap Year Minimind."

The LEAP keyword was the ManyChat trigger for a Leap Year Minimind launch sequence. The objective of the post was to drive conversion to that launch.

When I ran the prediction model on this post, two predictions came back.

Predicted virality was in the top 5% for the cohort. Predicted conversion was in the bottom 1%.

How did the post actually perform?

Reach: 3,407. Likes: 137. ManyChat triggers: zero.

The model called it. High reach, no conversion. Lots of likes, no leads.

How often is the model right about which box a draft lands in?

On the most recent 20% of posts it never saw at training, the high-reach, low-conversion was predicted 49.1% against a 25% random-chance baseline.

That's statistically significant precision meaning the model reliably flags when something needs work. But of course, it's the human's call to push publish in the end.

The next layer: a live copy editor

Imagine writing your next piece of copy inside an editor that scores it as you type. The feedback would look like this:

Your first sentence is longer than 90% of high-performers in this cohort.
Your emotional sentiment dips negative around sentence 7. High-performers hold positive longer.
Adding an emoji here historically lifted pieces like this by 0.4 points.
Your hashtag count is one above the high-performer median for promotional reels.

How is that feedback possible?

Remember how LightGBM uses the spreadsheet row of 100+ variables to predict performance for a post?

A standard technique called SHAP cracks the prediction open and tells you which features pushed the score up, which pulled it down, and by how much.

For example, say you have a draft and you run the prediction model on it.

The IG conversion model scores it 0.65.

SHAP breaks that 0.65 score into parts.

+0.18 came from the post's purpose tag (lead-magnet promo)
+0.07 from a ManyChat trigger word in the caption
-0.04 from the hashtag count being one too high

Basically, LightGBM gives you a prediction. SHAP gives you the WHY behind the prediction.

Run the same process across every post the model has seen, and you get the features it leans on most. Here's the aggregated view for IG conversion:

The IG conversion model's top 5 features

↑ pushes prediction toward HIGH conversion · ↓ pushes toward LOW

↑

What the post is trying to do

Lead-magnet-promo posts drive triggers. Lifestyle posts don't. Dominant by a wide margin.

rank1
↑

Caption literally contains a ManyChat trigger word partial confound

Captions without the keyword obviously can't trigger ManyChat. The model is partially learning the rule, not the cause.

rank2
↓

Which product or offer the post is promoting

The specific offer in the post pushes the prediction down on lower-converting products. Different offers convert at different rates.

rank3
↓

How many hashtags the caption uses

Denser hashtag stacks correlate with lower conversion. Reads as desperation, not signal.

rank4
↓

How heavily the caption uses the word "to"

Instruction-heavy chains ("click the link to grab the guide to start your...") convert worse than plain talk.

rank5

These are the features the model leans on most. Number two is partially a confound. Captions without the trigger word obviously can't trigger ManyChat. The other four are doing real work.

This opens the door to a powerful source of feedback.

You type a draft. The performance model scores it as you write. SHAP runs underneath and surfaces feedback line by line.

And even better, you can easily hook this up to an AI agent to help with planning, ideation, testing, writing, and beyond!

A/B Testing Systems with AI & Machine Learning

Turning tests into a compounding map of what actually causes performance.

Most marketing teams are running A/B tests, but are they running real, RIGOROUS testing systems?

How do you choose what to A/B test?
Is the test designed to measure the right thing?
How statistically significant were the results?

Most people run tests on "hunches" and, when they're done, they end up with "hunches" about what worked or what didn't work.

Hunches don't survive or grow, and a year later, you still don't know what variables influence outcomes.

I want to build a system that fixes that.

An A/B test shouldn't get you a "winning variant". It should get you a more confident answer to a question like, "What variable(s) about that subject line CAUSED it work?"

Not a "hunch". Not a fluke. A real root cause.

Because the closer you get to discovering root causes, the more results compound.

Below is the system I'm planning to build, the math underneath it, why I'm confident it works, and what it unlocks.

The building blocks of rigorous testing

It's a closed loop. Five pieces hand off to each other in order, and the last one feeds back into the first, so the whole system gets a little more confident every cycle.

The hypothesis store
Holds every claim about your copy ("curiosity-gap subjects lift opens, +6%") with a confidence number on each one ("± 4 percentage points"). This is where the loop starts and ends.
The test designer
Reads the hypothesis store and ranks every claim by impact, expected information gain, and how many future sends it affects. Then it picks the one test that would teach you the most.
The variant gate
Builds the A/B pair for that test and checks it against a control and against Olivia's voice, so you know the send is actually testing what you think it's testing.
The test runs
The actual send. The audience votes with their opens and clicks, and the result comes back as an outcome.
The Bayesian updater
Takes that outcome, tightens the confidence interval on the claim you just tested, and pools the evidence across sibling hypotheses. Then it writes the sharpened numbers back into the store, closing the loop.

One more piece sits just outside the loop and feeds it: the hypothesis generator. This is a LightGBM model that reads factor importance across your entire copy archive and proposes brand-new variables worth testing, seeding fresh rows into the store for the loop to chew on.

The output of this isn't standalone "winning variants." Why? Because winning variants are isolated, and they can't be consistently repeated.

Instead, this testing system is a map of cause-and-effect that grows more confident every week. The result is consistently high-performing variants, with the guesswork removed.

1 · Hypothesis Store

the spreadsheet: beliefs + confidence cols

2 · Test Designer

ranks by impact-weighted EIG → which test next?

3 · Variant Gate

control + quality checks

4 · Test Runs

the actual send

5 · Bayesian Updater

tightens CI on tested row + pooled neighbors

+ LightGBM Generator

factor importance → new hypotheses (feeds new rows into the store)

How the spreadsheet works

The hypothesis store is a literal spreadsheet that can be imagined to look like this:

Hypothesis	Best estimate	Confidence	Evidence
Curiosity-gap subjects lift opens vs benefit-forward	+6%	± 4 pp	18 tests ~90K sends
Emoji in subject lifts opens	+2%	± 3 pp	9 tests ~41K sends
Question subjects lift clicks vs declarative	+11%	± 2 pp	24 tests ~122K sends
Sender-name personalization lifts opens	+1%	± 5 pp	4 tests ~18K sends
Three-word subject lines lift opens	?	± large	0 tests untested

Each row is a testable claim. Each confidence column is how wrong the current answer might still be. The job of the system is to make those columns narrower.

Each row is a testable claim worthy of testing.

The best estimate column calculates the impact of that claim.

The confidence column quantifies how unsure we are about the estimate (is it a coincidence or a root cause).

The evidence column tells us how much data we need to gather before we can make a statistically significant conclusion about that claim.

The job of the system is to:

Rank which tests would have the largest impact
Increase the confidence of our hypothesis one test at a time
Collect and update performance data in real time (the loop)
Explore new variables and combinations to discover new tests

The math underneath is called a Bayesian hierarchical model with partial pooling.

Put simply: it's the spreadsheet above organized so the rows can talk to each other.

With every test, it runs the math and updates every column automatically.

The "pooling" aspect is cool enough to explain.

Let's say you set out to test the "curiosity-gap on warm subscribers" hypothesis. The results affect that test most, but "pooling" means the results also subtly nudge other related tests like "curiosity gap on cold subscribers" too.

This means, even if you're sending to smaller segments (which is good practice), your results can still have a significant effect since every test is doing work for every related claim, forever.

Illustrative

You test one claim, and the evidence ripples out to its relatives.

Say we run a test on our claim about curiosity-gap on warm subscribers. Here's what that single test does to four rows in the hypothesis store. The green part of each bar is how unsure we still are. As it grows shorter, we become more confident about that claim.

You tested this Curiosity-gap on warm subscribers

±4 pp → ±2.5 pp

Big jump in confidence. This is the row you actually tested.
Same family Curiosity-gap with question framing

±4 pp → ±3.7 pp

Nudged a little. A curiosity-gap sibling, so it borrows some of the evidence.
Same family Curiosity-gap on cold subscribers

±5 pp → ±4.5 pp

Nudged a little. Same family, different audience, but still related.
Unrelated Emoji in subject lifts opens

±3 pp → ±3 pp

Doesn't move. A different topic, so it gets none of the evidence.

It's like compounding interest, but for marketers!

How new hypotheses are discovered

While we're building this testing system, we might as well squeeze as much as we can out of it. Why not have it surface new hypotheses it thinks could work?

That's where LightGBM comes in. This AI model reads every email (or any other piece of copy) you've ever sent and learns which features (subject length, brand-voice consistency, emotional sentiment, emoji density, opening word, P.S., etc.) correlate with which outcomes.

And because it has access to all of those variables and data about how they predict performance, it can discover hypotheses worth testing through making new variable combinations.

Illustrative

LightGBM scoring variable combinations

Combination 01 (3 features)

curiosity_gap + subject_under_30_chars + personalization_token

Pred. score 0.79

Combination 02 (3 features)

personalization_token + urgency_phrase + question_format

Pred. score 0.82

Combination 03 (2 features)

personalization_token + emoji_in_subject

Pred. score 0.71

Pattern recognized · personalization_token appears in all three

Discovery

personalization_token keeps appearing in high-scoring combinations. That becomes a hypothesis worth testing.

New row added to the spreadsheet: "Personalization token in subject lifts opens"Untested

The model evaluates thousands of variable combinations and notices when one variable keeps appearing in winners. That variable becomes a hypothesis worth testing, not a conclusion to ship.

It's able to make a test suggestion like, "subject length under six words seems predictive of opens."

And when you combine the automatic hypothesis discovery and the "pooling" features, this system will collect data on this hypothesis in the background while you run other tests.

COOL!

Picking the next test

"Okay, but that sounds like A LOT of tests, man... Where do I even start?"

No worries. I got you!

Prioritize the biggest potential impact hypothesis first. Narrow in from there.

We can calculate that with a formula that looks something like this:

Illustrative

The priority formula

priority = expected_impact × future_sends_affected × EIG ÷ sends_required

expected_impact: size of the lift if the test confirms.

future_sends_affected: how many future sends this answer applies to.

EIG: Expected Information Gain. How much sharper the answer gets.

sends_required: Sends left until we can come to a statistically significant conclusion.

Worked example

Hypothesis "Curiosity-gap subjects lift opens vs benefit-forward"

expected_impact

percentage points

future_sends_affected

2,000,000

sends / year

EIG

0.8

increases confidence

sends_required

5,000

to reach significance

(6 × 2,000,000 × 0.8) ÷ 5,000 = 19.2

Priority score

Sort the spreadsheet by this. Run the top row next.

19.2

Making sure the test is solid

If we're running an A/B test, every proposed A/B pair has to pass two checks before it passes:

Variants differ only on the variable under test.
If you're testing curiosity vs. benefit-forward, the two subject lines should be identical in as many ways as possible: same length, same topic, etc. Otherwise you can't attribute the outcome to the variable you thought you were testing.
Both variants are high quality.
Both A and B have to score high on the brand-voice scorer and clear the baseline performance predictor (both are separate tools I've already built). If a variant loses because it's straight up BAD, the variable being tested doesn't have the chance to make a difference.

Quality control is important. That's why we build tools for this, but also keeping a human in the loop is crucial to avoid test contamination.

What about multi-armed bandits?

"Bandits" are tools that dynamically shift traffic toward whichever variant is winning DURING a test. It's great at squeezing every last conversion out of this week's campaign. This sounds awesome at first...

However bandits can have a negative effect on testing programs.

Bandits starve losing variants of traffic which kills the precision needed to build a causal claim. Bandits learn which arm pays. They don't care about why.

A true detective tester wants to know which variant pays AND why. And the only way to make a real causal conclusion is to get enough evidence (losing variant data).

Bandits enable this week's revenue, but they kill the compounding effect of next year's learning.

Why I'm confident this works (without having built it yet)

I haven't shipped this. So why should you (or I) trust the design? Three reasons.

The math is settled. Bayesian hierarchical models, gradient boosting, partial pooling, Expected Information Gain, these are decades-old, well-tested techniques.

LinkedIn uses them for email-marketing personalization.
Booking.com uses them for sequential testing.
Optimizely uses them for their stats engine.

The novel part is intimately applying them to a company's marketing assets, sticking with it over time so it compounds, and using the data as context that can be leveraged by AI for future copy generation.

What this unlocks

The shift doesn't come from any singular component.

It's what happens when the full testing loop runs for a full year.

Every send becomes a data point. There's no "throwaway" campaign anymore, because even routine sends quietly sharpen the hypothesis store. Twelve months in, you have a growing map of cause and effect instead of a folder of non-scalable "winning variants".

At enterprise scale, this testing system creates institutional memory. All of the testing knowledge survives a marketer changing roles or a quarter ending. One team can read what another already proved and skip the test.

It also fixes the opportunity cost problem. Since we test the highest impact claims first, wasted sends drop, meaning we waste less revenue. And because this is structured data, it feeds straight into whatever AI writes your copy next.

← Previous Project: The Society

Or jump to: ClickUp Ramble The Society AI Writing Systems Testimonials

AI Writing Systems

Select an article

The Current Problem With AI Copywriting

I Scraped All The Emails I've Ever Written

I Built a Knowledge Base of Industry Specific Texts

I Analyzed the Client Emails Against the Knowledge Base

The Cross-Reference At Work

Watch the system read an excerpt.

The moves this excerpt is making

Now try it on others.

Moves detected

Other passages from her body of work using these moves

I Tested Agentic Writing Patterns

How it's wired

What it produced

How Did I Test This?

What I've Learned so Far

Watch it write.

I Built a More Collaborative AI Writing Interface

Why this matters.

What's broken today

What's possible

When you combine all 3, you encode taste and create a scalable AI copywriting and brand voice system.

Who cares?

But isn't brand voice too... complicated?

How does brand voice become numbers?

Scoring a draft

The worked example

Testing more than one

It works! So... now what?

We start with two questions

What does the model actually do?

The results are in!

What two predictions together can tell you

Sleeper conversion

Gold

Rework

Engagement trap

The next layer: a live copy editor

How is that feedback possible?

The building blocks of rigorous testing

How the spreadsheet works

How new hypotheses are discovered

Picking the next test

Making sure the test is solid

What about multi-armed bandits?

Why I'm confident this works (without having built it yet)

What this unlocks