Hi, everybody. Today Iām going to be showing you that correlation does not imply causation. This whole area really starts with us as digital marketers needing to embrace correlation and causation.
Why marketers need to understand correlation and causation
Now why is this? Well firstly we have a wealth of data across all of the digital marketing disciplines that we operate in. And, as such, we are all required to analyse data on a regular basis.
Correlation and causation are real kind of centre pieces now in terms of analysing data, so therefore neglecting correlation and causation ultimately leads to wrong insight and wrong decisions being made.
Three main types of correlation
Letās start with correlation. In effect, there are three main types of correlation to look for when analysing data.
Positive correlation
We have positive correlation. So, letās say we have two variables, X and Y. As X increases Y increases, and youāll get a scatter graph a little bit like this. We then put something like that in ā a line of best fit ā to understand how strong your correlation is.
Negative correlation
We then have a negative correlation, so in this instance, weāve got the value of X increasing, but the value of Y decreasing. Again, to understand how strong that correlation is, we put a line of best fit in across the graph like that.
No correlation
Finally, there will be datasets that we, that to analyse that just have no relationship or no correlation whatsoever. Youāll have a true scatter graph like this where you canāt put a line of best fit in at all.
Pearson correlation coefficient
If you can imagine, every time weāre looking at data and trying to analyse data for correlation, it would actually take quite a long for us to get the data, plot a scatter graph, draw a line of best fit. And then when you do that, thereās nothing really scientific in saying: āWell, how strong is that correlation?ā
This is where something called the āPearson correlation coefficientā comes in. We can use the Pearson correlation coefficient to statistically measure correlation strength. The great thing with this is, itās really easy to do in Excel and in Google Sheets as well.
Using the Pearson correlation coefficient in Excel and Google Sheets
In terms of what you need to do in Excel or in Google Sheets, you type the formula =CORREL, open up your bracket, select your first dataset, and then a comma, and then your second dataset, close the bracket, and that will return you a figure between minus one and plus one. That figure will tell you how strong the correlation is.
If youāre looking at a positive correlation, you want to see the Pearson coefficient coming between 0.5 and one. The closer the number is to one, the stronger the positive correlation.
For negative correlation, the Pearson coefficient will come between minus 0.5 and minus one. The closer the figure is to minus one, the stronger the negative correlation.
Any values between plus and minus 0.5 and getting closer and closer to zero, a weak correlation, or indeed, no correlation like we saw earlier.
When correlation doesnāt imply causation
Letās apply the principles of correlation and the Pearsonās correlation coefficient to a real-life example. Hereās an interesting dataset. Weāve got two variables trending over time, and we can see from the initial first graph that the trends are pretty similar.
We ran the Pearson correlation coefficient across this dataset and it returned a figure of 0.94. Now, this implies that this is a really strong positive correlation between these two datasets and that there seems to be some kind of relationship between the data. Now, importantly at the moment, whatās missing is context. We donāt know what these two pieces of data are. If we have a look, we can see that the orange line trending is the number of shark attacks per month, and the blue line is the amount of ice cream consumed in pound weight.
Letās take this situation. If we were only to consider the correlation, and remember we have the Pearsonās coefficient of 0.94, which is really, really strong, and omit any form of causation, itās going to lead to wrong decisions being taken and wrong insight. For example, āan increase in ice cream consumption causes an increase in shark attacksā. Or how about, āan increase in shark attacks is causing people to consume more ice creamā?
Now, clearly this isnāt the case, and this is a really strong example of the principle, correlation does not imply causation. The correlation was very, very strong, but ultimately there is no direct causation between shark attacks and ice cream consumption, which led us to that wrong insight and wrong decision.
Indirect causation
In this example, however, there is indirect causation ā thatās coming from the sun. Not literally. Itās to do with the hours of sunshine. Letās take the data of hours of sunshine per month and plot that against the number of shark attacks, and also plot that against the ice cream consumption. Now you have causation, and you can actually draw a more meaningful conclusion from the data.
For example, āan increase in sunshine hours over the summer months causes more people to swim in the sea and increase the likelihood of shark attacksā. Much more sensible, Iām sure youāll agree. Or how about, āan increase in sunshine hours over the summer months is causing an increase in ice cream consumptionā? Again, that makes much more sense. Itās really, really important to ensure youāve got that causation behind the data for it to make sense and to drive, kind of, correct assumptions and correct conclusions from your data.
An example for marketers
Letās bring these concepts of correlation and causation back to the world of digital marketing. One of the biggest tips I can share here, is always consider causation for any analysis is done. Now this is a particularly bad example, donāt do it this way. In terms of the data that weāve got here, we ignored that point, and weāve got data looking at the age ranges going across the X axis, and weāve got page load speed going across the Y axis. We can see here in terms of correlation, as users get older, the page load speed has decreased.
Of course, if we were to take this data at face value and omit causation, we would come to that conclusion and weād say: āRight, letās increase traffic, get an older demographic to the site, and itās going to cause a reduction in page load speedā.
Again, another example of where we havenāt thought about the causation, weāve put two variables together and we could potentially come up with something like that, which we know clearly isnāt true. What the reality tells us in this example, is that page load speed changes are caused by technical factors, and thereās no direct causation between age and page load speed.
In this example, weād say: āOkay, right. Page load speed, what actually causes a change in page load speed?ā Thereās a number of different things, but some examples here. For example, it could be the browser the userās on or the device that theyāre accessing the site from. It could be things like server response time or page download time, or domain lookup time, or redirection time. What we could then do is correlate each of those sources of data against page load speed, and then if we have a strong correlation coefficient coming back from that, again weāve got the causation to build some meaningful insights off the back of the data.
Key takeaways
Really to bring this all together, the key takeaways Iād like to share today. First and most importantly, always have causation in front of mind for analysing or correlating any dataset. Before you even put a graph together, before we even think about anything to do with correlation coefficients, think about the variables youāre analysing and answer the question: āIs there causation between the two variables?ā If youāre not sure, a top tip and something that works well for me, itās just to write it down.
For example, if I was looking to correlate server response time data and page load speed data, Iād write something like this down: āAn increase in server response time is causing an increase in page load speedā. Now that seems completely plausible, so I know from that that when I analyse that data Iāve got that causation there, and thatās going to help deliver good analysis for me. If we go back to the few slides back, if I replace that by saying: āAn increase in the users age is causing an increase in page load speed,ā that just, it sounds ridiculous just saying it. At that point Iād question the causation and not even go down the route of producing a chart, like you saw earlier.
Once youāve got your causation sort of solid and front of mind, the next step is actually then to see if thereās any correlation in your data. Use the CORREL formula, as we saw earlier, in Excel or Google Sheets, to statistically analyse the strength of correlation. As a recap, for a positive correlation youāre looking for a value between 0.5 and one. The closer to one, the stronger the correlation. For a negative correlation, anything between minus 0.5 and minus one. The closer to minus one, the stronger negative correlation.
Really to end the talk today, I would say ignore causation at your peril. Why? As without it ice cream is actually a highly dangerous food. We saw earlier that there was a strong correlation in ice cream consumption against an increase in shark attacks, but also a very, very strong correlation in an increase in violent crime, an increase in cases of polio, and an increase in forest fires.
Thank you very much.