AI Writing Systems

Four case study articles about how I've used artificial intelligence and machine learning to strengthen my copywriting process and create marketing knowledge systems that scale.

The goal of each is to empower AI with unique context, so marketing strategy is more objective, while leaving room for invaluable human input.

Select an article

Predicting Copy Performance with Machine Learning

Forecasting how a draft is likely to perform before it ships.

The performance of every piece of copy (email, social posts, etc.) can feel like a gamble.

Maybe it's based on hunches you've had from past tests, maybe it's based on data, maybe it's based on something else...

The reality is that, until you publish a piece and get the numbers back, it's basically a guess.

But what if you could make a quantifiable prediction of how well a piece would perform BEFORE publishing it?

Something that says, "This will go viral and get around 1M views, but it won't move people further down the funnel."

If we knew that, it would be the marketing equivalent of time travel. Before we publish anything, we can go back and tweak it until it's optimized for the right thing.

How freakin' cool would that be?!

So that's what I tried to build here!

If you can't figure out what variables in your copy affect outcomes, this project is for you. 

We start with two questions

Every piece of copy should optimize for ONE goal at a time. The way you find that goal is by asking, "What one action do we want the reader to take?"

If we want more leads coming in top of funnel, our goal should be to optimize for virality and follower counts.

If we want leads going deeper into the funnel, we need to optimize for opt-ins, booked calls, etc.

Therefore, we need this performance predictor to differentiate between goals.

So I started by asking two questions.

Question 1: "Will this piece go viral, but not convert?"

Question 2: "Will the audience actually take the action the copy asks them to take?"

To train the performance predictor on this, we split up pieces of copy based on their goal. Did it include a CTA to go deeper into the funnel? Or was it optimizing for reach?

Illustrative
Post B

Low engagement · high CTA conversion

“Comment FREEDOM and I’ll DM you the rate sheet I used to close my first duplex.”
low high
Engagement
22 / 100
low high
CTA Conversion
82 / 100

Outcome: not much reach, but massive action taking.

Once you have both numbers, you can do something no off-the-shelf software does. You can determine the variables that contribute to reach and conversion learn and where they intersect.

But HOW? What makes all this possible?


What does the model actually do?

To do this, we're using a machine learning model called LightGBM.

To start, the human creates variables that they believe might or might not affect the performance of copy. Then LightGBM takes those variables, creates a spreadsheet with a column for every variable, then creates one spreadsheet row for every piece of copy.

Then you let it loose to fill out every single column for every single piece of copy.

To give you an idea, here's a sample row for one IG post:

Row for one IG post
  caption_length              : 187
  hashtag_count               : 9
  has_manychat_keyword        : 1
  sentiment_arc_p1            : -0.12
  sentiment_arc_p10           : +0.34
  style_distance_to_centroid  : 0.41
  fw_to_ratio                 : 0.018
  dim_offer                   : "lead-magnet"
  ... about 100 more like this

I decided that I wanted to be able to predict 6 different copywriting objectives across IG, YouTube, Threads, and Email.

Here's the list:

Model What it predicts Trained on
IG conversion ManyChat trigger volume IG posts that contain a keyword
IG engagement (ManyChat keyword) likes + saves + shares same posts as above
IG engagement (no ManyChat keyword) likes + saves + shares IG posts without a keyword
Email open rate open % Brent-authored emails only
YouTube engagement engagement composite YouTube videos
Threads engagement engagement composite Threads posts

After splitting all of the samples along these lines, and filling out the spreadsheet for each piece of copy, it's time to start training this thing!

What does "train" even mean?

Think of it like studying with flashcards (where each piece of copy is a single flashcard).

  1. You take a stack of past pieces of copy whose outcomes are already known.
  2. You show the algorithm one piece of copy (one row from the spreadsheet).
  3. You ask it to guess the outcome.
  4. You tell it the right answer (based on your real performance data).
  5. You let it adjust, so its next guess on that kind of post will be a hair closer to correct.
  6. Then you do it again. Then again. Thousands of times across all past posts.

Eventually the algorithm gets pretty good at guessing the outcome on posts it's already seen, and (if you set it up right) on posts it hasn't.

But it's not really flashcards.

Under the hood, this algorithm uses something called "decision trees".

A decision tree is a short flowchart of yes/no questions about variables, ending in a guess. Here's one the model might actually build for IG conversion:

One tree the model built (illustrative)

  Does the caption have a ManyChat keyword?
    No  → guess: 4 ManyChat triggers
    Yes → How many hashtags?
            ≤ 9   → guess: 12 triggers
            > 9   → How long is the caption?
                      ≤ 220 chars → guess: 8 ManyChat triggers
                      > 220 chars → guess: 6 ManyChat triggers

The power behind LightGBM and these decision trees is that they build on each others' learning.

For example, the first tree makes a guess to find it was wrong by a score of 10. Tree 2 focuses on figuring out what Tree 1 got wrong. Over time, this process gets closer and closer to finding the patterns that make predictions more accurate.

This is called "Gradient Boosting" (hence the "GB" in LightGBM).

How do we know LightGBM is actually working though?

We also trained a simpler version of every model as a sanity check, called ElasticNet. ElasticNet draws straight-line rules (every extra word does X to the score, that kind of thing).

LightGBM beat it on every one of the six datasets.

That's because LightGBM successfully catches the complex patterns where performance depends on multiple factors in any given piece.

And to prove that we're ABSOLUTELY SURE about the successful results, I ran the LightGBM process on each copywriting objective 15 times.

The ± number next to each score (something like ±0.025) is the spread across those 15 runs. A small number means a smaller spread and a higher confidence in performance. A bigger number means a wider spread and less confidence.

The results are in!

Spearman rank correlation · 15-fold CV 0 = random  ·  1 = perfect ranking
IG conversion keyword path
0.708
CV ±0.025
Email open rate single-author corpus
0.671
CV ±0.078
IG engagement non-keyword path
0.645
CV ±0.058
Email click rate
skipped: target too noisy at n=184
·
YouTube engagement
0.371
CV ±0.045
Threads engagement
0.236
CV ±0.038
IG keyword-path engagement kept on purpose. The divergence detector needs both signals to find the trap.
0.080weak on purpose
CV ±0.089
≥ 0.6 strong 0.4–0.6 useful 0.2–0.4 weak < 0.2 essentially noise

The top three are strong. This tells us that the model is able to predict higher performing copy around 64% to 71% of the time, which is statistically good enough to be useful in production.

Middle two (YouTube and Threads) aren't quite there yet, but I honestly haven't spent as much time on them.

The bottom one (IG keyword-path engagement at 0.08) is weak ON PURPOSE. We can use this as a way to predict what DOESN'T perform.

(Also, I left out the email click rate because there just wasn't enough data to get statistically significant predictions)

What two predictions together can tell you

When an IG post has a ManyChat keyword in it, two prediction models run on it, so you get two predictions for one post.

This allows you to predict how the post will convert and how much reach it will get. Then you can plot the post on a grid.

Predicted virality on one axis. Predicted conversion on the other.

Virality: low  ←  →  high
Conversion: low  ←  →  high
low virality · high conversion
Sleeper conversion
7.3%
precision on held-out data

Quiet but converts. Worth amplifying if you wanted reach too.

high virality · high conversion
Gold
30.8%
precision on held-out data

Both numbers strong. Whatever your goal, this is going to land.

low virality · low conversion
Rework
62.1%
precision on held-out data

Weak on both axes. Rework if you wanted either signal.

high virality · low conversion
Engagement trap
49.1%
precision on held-out data

Big reach, few triggers. Great for top-of-funnel awareness. A trap only if you were trying to convert.

What the percentages mean: when the model puts a draft in this quadrant, how often was it actually in this quadrant on data it had never seen? Above 25% beats random chance.

Let's take an actual IG post as an example:

Olivia hosted a women's investing retreat in Sedona last spring. The reel she posted afterward opened with this caption:

"I just hosted a retreat in Sedona and I am still sitting in the afterglow
☀ Comment LEAP and I'll DM you the details on the Leap Year Minimind."

The LEAP keyword was the ManyChat trigger for a Leap Year Minimind launch sequence. The objective of the post was to drive conversion to that launch.

When I ran the prediction model on this post, two predictions came back.

Predicted virality was in the top 5% for the cohort. Predicted conversion was in the bottom 1%.

How did the post actually perform?

Reach: 3,407. Likes: 137. ManyChat triggers: zero.

The model called it. High reach, no conversion. Lots of likes, no leads.

How often is the model right about which box a draft lands in?

On the most recent 20% of posts it never saw at training, the high-reach, low-conversion was predicted 49.1% against a 25% random-chance baseline.

That's statistically significant precision meaning the model reliably flags when something needs work. But of course, it's the human's call to push publish in the end.

The next layer: a live copy editor

Imagine writing your next piece of copy inside an editor that scores it as you type. The feedback would look like this:

  • Your first sentence is longer than 90% of high-performers in this cohort.
  • Your emotional sentiment dips negative around sentence 7. High-performers hold positive longer.
  • Adding an emoji here historically lifted pieces like this by 0.4 points.
  • Your hashtag count is one above the high-performer median for promotional reels.

How is that feedback possible?

Remember how LightGBM uses the spreadsheet row of 100+ variables to predict performance for a post?

A standard technique called SHAP cracks the prediction open and tells you which features pushed the score up, which pulled it down, and by how much.

For example, say you have a draft and you run the prediction model on it.

The IG conversion model scores it 0.65.

SHAP breaks that 0.65 score into parts.

  • +0.18 came from the post's purpose tag (lead-magnet promo)
  • +0.07 from a ManyChat trigger word in the caption
  • -0.04 from the hashtag count being one too high

Basically, LightGBM gives you a prediction. SHAP gives you the WHY behind the prediction.

Run the same process across every post the model has seen, and you get the features it leans on most. Here's the aggregated view for IG conversion:

The IG conversion model's top 5 features
↑ pushes prediction toward HIGH conversion · ↓ pushes toward LOW
  1. What the post is trying to do

    Lead-magnet-promo posts drive triggers. Lifestyle posts don't. Dominant by a wide margin.

    rank1
  2. Caption literally contains a ManyChat trigger word partial confound

    Captions without the keyword obviously can't trigger ManyChat. The model is partially learning the rule, not the cause.

    rank2
  3. Which product or offer the post is promoting

    The specific offer in the post pushes the prediction down on lower-converting products. Different offers convert at different rates.

    rank3
  4. How many hashtags the caption uses

    Denser hashtag stacks correlate with lower conversion. Reads as desperation, not signal.

    rank4
  5. How heavily the caption uses the word "to"

    Instruction-heavy chains ("click the link to grab the guide to start your...") convert worse than plain talk.

    rank5
These are the features the model leans on most. Number two is partially a confound. Captions without the trigger word obviously can't trigger ManyChat. The other four are doing real work.

This opens the door to a powerful source of feedback.

You type a draft. The performance model scores it as you write. SHAP runs underneath and surfaces feedback line by line.

And even better, you can easily hook this up to an AI agent to help with planning, ideation, testing, writing, and beyond!