Whom to send an offer to

Predicting the success of Starbucks marketing offers

8 min readOct 5, 2021

Introduction

The basis for the project I am presenting here is a dataset from Starbucks that I analyzed as part of my Data Science Nanodegree at Udacity.

The data contains information about customers, about different types of offers made to these customers and about the purchase behaviour of the customers after they received these offers. From a marketing perspective it is interesting to know if is worth sending offers to certain customers.

In order to solve this problem I took two steps:

A classification model was trained to predict the outcome of an offer depending on features regarding user demographics as well as offer characteristics.
This model was used to predict the oucome of each offer in the offer portfolio for any new or existing user.

The result of the project can be seen as a marketing tool to decide what kind of offers to send to users.

Data exploration

The starting point were three tables:

An offer portfolio containing 10 different offers and information on the offer type, distribution channels, duration, reward for and necessary investment by the customer.
A user table with information on the age, gender, income and membership duration of each user.
A table recording each of the following events including the date it occurred on: offer received, offer viewed, offer completed and transaction.

Below I will give an overview of these three data tables. The details of the data cleaning process are documented in Juypter Notebook I used, which can be found in my Github repository.

Offer portfolio

In the following table we can see that the mean reward of the offers is 4,20 $ while the mean investement is 7,70 $ and most offers have a duration of one week (duration is in days). 90% of the offers are send via mobile channels, 80% via the web and 60% via social media.

The distribution of offer types shows that most offers are either “buy-one-get-one-free” (bogo) or discount offers. Only two offers are informational.

Users

The age of the users is almost normally distributed between 18 and 101 years with a slight right-skewdness. The mean age of users is 54 years.

The income has a clearly right-skewed distribution, which is rather typical for income data. The average annual income is about 65.000 $, the mimimum 30.000 $ and the maximum 120.000 $.

The figure on the left shows that there are about 2000 more males than females in the user database. Furthermore, only very few users indicated a non-binary gender.

Events

The events that occurred most frequently were transactions, i.e. payments made by users, as can be seen in the figure below. Also, there is a gradual decrease from offer received, to offer viewed, to offer completed. This means that users were lost along this intended pathway of events.

The next figure shows the events as a time series between the starting and end point of the test period. It can clearly be seen that the offers were at first sent in a weekly rhythm. In the second half of the test period the frequency of offers was increased.

Secondly, we can see that the offers were mostly viewed on the day they were received and als that the reception of offers triggered transactions and thus the completion of offers.

Data preprocessing

As stated in the introdcution, my aim was to train a classification algorithm to predict the outcome of sendin an offer to a user. Before starting with the prediction, the labels and features for the prediction had to be defined. The intention was to predict different offer outcomes (labels) with different characteristics of the offers and users (features).

Labels: Offer outcomes

The outcome of sending an offer to a user can be categorized depending on the combinations of receiving, viewing and completing offers:

Success: The offer had its intended effect, because it was received, viewed and completed in this order.
Waste: The offer was unnecessary because the effect was reached even though the user did not view the offer or viewed it after completing the offer.
Failure: The offer was not completed event though it was used.
Potential: The offer was not completed but it was also not viewed. Since we do not know why is was not viewed, we might send it again in the future.

It is very important to note that this categorization does not hold for informational offers, because this offer type by definition is never completed. These offer types thus had to be removed from the training data.

It was quite a challenge to create the dataframe with all offers sent out to users and the respective outcome as defined above. My solution was to take the events data frame as a basis select all events that were not associated with a transaction or a with an informational offer. I then iterated through all users and received offers in this new data frame. For each received offer I checked if it had been viewed and/or completed. By comparing the order of these occurrences I assigned one of the above defined outcomes to the received offer and moved on to the next one.

Features: Characteristics of offers and users

I chose to use the following characteristics as features for the prediction:

Offers: distribution channel, difficulty (i.e. investment), reward
Users: gender, age, income, membership duration

Almost all of these features were already present in the data, only the membership duration had to be calculated. This was done by subtracting the date a user became a member from today’s date, yielding the duration in days.

Merging features and labels

After creating a label dataframe and a feature dataframe and converting all categorical variables to numeric, I merged these two dataframes. This resulted in the following dataframe that was my basis for the prediction algorithms:

By using the train_test_split implementation from scikitlearn, I split this dataframe into training and test data with 20% test data.

Prediction

Metrics

I decided to compare different classification algorithms and choose the best one. To this end, I also needed to decide on an evaluation metric. After reading this post by Rahul Agarwal, I chose the f1-score because it outputs a good balance between precision and recall. When using the f1-score implemented in scikitlearn, I decided to use the weighted average because according to the documentation, this accounts for label imbalance. As the following output shows, there is an imbalance in the labels:

Models

When it came to choosing algorithms, I had a look at the section on multiclass and multioutput algorithms in the scikitlearn documentation. This gives a very good overview of possible models. I chose to use only inherently multiclass classifiers and selected the following:

Decision tree
Gaussian naive bayes
K-neighbors
Logistic regression
Random forest

I did not use support vector machine classification because a first try with default parameters showed no extraordinary performance and it was very time consuming.

Parameters

To decide on the best parameters I used the grid search implemented in scikitlearn. Depending on the model and its available parameters, I varied those parameters associated with regularization, leaf size and splitting criteria.

The following table summarizes the results of the prediction algorithms as well as the best parameters as chosen by the grid search:

It is quite apparent that none of the models worked very well. This implies that the demograhic data of users is not very well suited to predict the outcome of offers. Nevertheless, since the random forest classification worked best, I chose to save this model and use it in a script to predict the outcome of offers for new users.

Prediction script

My last step was to write a short python script with user input to predict the outcome of all non-informational offers in the offer portfolio when sent to any new customer. The user of the script is prompted to enter the demographic data and is presented with the predicted outcome for all offers in the offer portfolio.

Conclusion

In this project I used data on offers, customers and purchase behaviour to predict the outcome of sending certain offers to new users.

As often, the data preprocessing steps were most time consuming and challenging. Especially categorizing the offer outcomes in the available data was challenging.

I then compared five different classifcation algorithms with optimized parameters. It turned out that none of them worked very well, which was a bit disappointing. It could be that the demographic data on users is not suited well to predict how they will react to an offer. Testing the prediction script also showed that users react to all offers in the same way, i.e. if one offer has potential for a certain users, the other offers in the portfolio will also have potential. I expected some differences depending on the offer types, but this does not seem to be the case.

Further work on this could involve finding better features for the prediction. Possible approaches could be to try XGBoost and LightGBM models to try combining all the analyzed algorithms into a custom ensemble model.