Simpsons Paradox

Connor Coutts
November 24, 2021
Data Science
Back to blogs
It is often easy to draw conclusions from analysing data at a first glance but the key to interpreting data is exercising a certain level of scepticism with any conclusions made.
Simpsons paradox is a perfect example of why you should exercise caution when making data driven decisions before the underlying data has been fully interrogated.
Simpsons Paradox is a statistical phenomenon in which a trend appears in groups of data, but is then reversed when the groups are combined together.  
An example
Suppose we have run an Ibex marketing campaign in a few regions and analysed some conversion data, comparing campaign and control groups.
Evidently, it looks as though this campaign isn’t having a positive impact compared to a ‘do nothing’ baseline.
Now suppose we investigate further and dig into the regions to determine what is driving this trend and see the following conversion performance.
We now see that we are in fact having a positive conversion rate from this marketing campaign in both regions which will lead us to the inevitable question - how is this possible?
The reason why this is possible and the main reason Simpsons paradox arises is due to the fact that there are hidden variables not being taken into account when combining groups of data.
In this case, the hidden variable is the size of the control groups.
UK Figures
EU Figures
Looking at the raw data from this marketing campaign, we can see there is a difference in the size of the control groups in both of the regions.
In the UK the control group size is around 5%, whereas in the EU this is around 20%.
The difference in the size of control groups (the hidden variable) immediately puts more of a weighting on the EU control conversion rate, driving the overall control conversion rate for this marketing campaign up.
Overall Figures
This leads to a conversions rate impact of -0.25% when the regions are combined together.
The following graph shows what this data looks like when we do take into account this hidden variable by rebalancing the control groups.
The act of rebalancing the control groups removes the effects of Simpsons paradox and allows us to analyse the overall performance of this marketing campaign. As the data from the two separate regions suggest, this marketing campaign has had a positive conversion rate impact of 0.24%.
This type of scenario shows exactly why conclusions from analysis should be scrutinised and is exactly the type of problem our team of data scientists has vast experience in dealing with.

Grow your business

Use Ibex to drive customer engagement and deliver the right messages at the right time
Start Now