Analytics
Hi, everybody. Today Iâm going to be showing you that correlation does not imply causation. This whole area really starts with us as digital marketers needing to embrace correlation and causation.
Why marketers need to understand correlation and causation
Now why is this? Well firstly we have a wealth of data across all of the digital marketing disciplines that we operate in. And, as such, we are all required to analyse data on a regular basis.
Correlation and causation are real kind of centre pieces now in terms of analysing data, so therefore neglecting correlation and causation ultimately leads to wrong insight and wrong decisions being made.
Three main types of correlation
Letâs start with correlation. In effect, there are three main types of correlation to look for when analysing data.
Positive correlation
We have positive correlation. So, letâs say we have two variables, X and Y. As X increases Y increases, and youâll get a scatter graph a little bit like this. We then put something like that in â a line of best fit â to understand how strong your correlation is.
Negative correlation
We then have a negative correlation, so in this instance, weâve got the value of X increasing, but the value of Y decreasing. Again, to understand how strong that correlation is, we put a line of best fit in across the graph like that.
No correlation
Finally, there will be datasets that we, that to analyse that just have no relationship or no correlation whatsoever. Youâll have a true scatter graph like this where you canât put a line of best fit in at all.
Pearson correlation coefficient
If you can imagine, every time weâre looking at data and trying to analyse data for correlation, it would actually take quite a long for us to get the data, plot a scatter graph, draw a line of best fit. And then when you do that, thereâs nothing really scientific in saying: âWell, how strong is that correlation?â
This is where something called the âPearson correlation coefficientâ comes in. We can use the Pearson correlation coefficient to statistically measure correlation strength. The great thing with this is, itâs really easy to do in Excel and in Google Sheets as well.
Using the Pearson correlation coefficient in Excel and Google Sheets
In terms of what you need to do in Excel or in Google Sheets, you type the formula =CORREL, open up your bracket, select your first dataset, and then a comma, and then your second dataset, close the bracket, and that will return you a figure between minus one and plus one. That figure will tell you how strong the correlation is.
If youâre looking at a positive correlation, you want to see the Pearson coefficient coming between 0.5 and one. The closer the number is to one, the stronger the positive correlation.
For negative correlation, the Pearson coefficient will come between minus 0.5 and minus one. The closer the figure is to minus one, the stronger the negative correlation.
Any values between plus and minus 0.5 and getting closer and closer to zero, a weak correlation, or indeed, no correlation like we saw earlier.
When correlation doesnât imply causation
Letâs apply the principles of correlation and the Pearsonâs correlation coefficient to a real-life example. Hereâs an interesting dataset. Weâve got two variables trending over time, and we can see from the initial first graph that the trends are pretty similar.
We ran the Pearson correlation coefficient across this dataset and it returned a figure of 0.94. Now, this implies that this is a really strong positive correlation between these two datasets and that there seems to be some kind of relationship between the data. Now, importantly at the moment, whatâs missing is context. We donât know what these two pieces of data are. If we have a look, we can see that the orange line trending is the number of shark attacks per month, and the blue line is the amount of ice cream consumed in pound weight.
Letâs take this situation. If we were only to consider the correlation, and remember we have the Pearsonâs coefficient of 0.94, which is really, really strong, and omit any form of causation, itâs going to lead to wrong decisions being taken and wrong insight. For example, âan increase in ice cream consumption causes an increase in shark attacksâ. Or how about, âan increase in shark attacks is causing people to consume more ice creamâ?
Now, clearly this isnât the case, and this is a really strong example of the principle, correlation does not imply causation. The correlation was very, very strong, but ultimately there is no direct causation between shark attacks and ice cream consumption, which led us to that wrong insight and wrong decision.
Indirect causation
In this example, however, there is indirect causation â thatâs coming from the sun. Not literally. Itâs to do with the hours of sunshine. Letâs take the data of hours of sunshine per month and plot that against the number of shark attacks, and also plot that against the ice cream consumption. Now you have causation, and you can actually draw a more meaningful conclusion from the data.
For example, âan increase in sunshine hours over the summer months causes more people to swim in the sea and increase the likelihood of shark attacksâ. Much more sensible, Iâm sure youâll agree. Or how about, âan increase in sunshine hours over the summer months is causing an increase in ice cream consumptionâ? Again, that makes much more sense. Itâs really, really important to ensure youâve got that causation behind the data for it to make sense and to drive, kind of, correct assumptions and correct conclusions from your data.
An example for marketers
Letâs bring these concepts of correlation and causation back to the world of digital marketing. One of the biggest tips I can share here, is always consider causation for any analysis is done. Now this is a particularly bad example, donât do it this way. In terms of the data that weâve got here, we ignored that point, and weâve got data looking at the age ranges going across the X axis, and weâve got page load speed going across the Y axis. We can see here in terms of correlation, as users get older, the page load speed has decreased.
Of course, if we were to take this data at face value and omit causation, we would come to that conclusion and weâd say: âRight, letâs increase traffic, get an older demographic to the site, and itâs going to cause a reduction in page load speedâ.
Again, another example of where we havenât thought about the causation, weâve put two variables together and we could potentially come up with something like that, which we know clearly isnât true. What the reality tells us in this example, is that page load speed changes are caused by technical factors, and thereâs no direct causation between age and page load speed.
In this example, weâd say: âOkay, right. Page load speed, what actually causes a change in page load speed?â Thereâs a number of different things, but some examples here. For example, it could be the browser the userâs on or the device that theyâre accessing the site from. It could be things like server response time or page download time, or domain lookup time, or redirection time. What we could then do is correlate each of those sources of data against page load speed, and then if we have a strong correlation coefficient coming back from that, again weâve got the causation to build some meaningful insights off the back of the data.
Key takeaways
Really to bring this all together, the key takeaways Iâd like to share today. First and most importantly, always have causation in front of mind for analysing or correlating any dataset. Before you even put a graph together, before we even think about anything to do with correlation coefficients, think about the variables youâre analysing and answer the question: âIs there causation between the two variables?â If youâre not sure, a top tip and something that works well for me, itâs just to write it down.
For example, if I was looking to correlate server response time data and page load speed data, Iâd write something like this down: âAn increase in server response time is causing an increase in page load speedâ. Now that seems completely plausible, so I know from that that when I analyse that data Iâve got that causation there, and thatâs going to help deliver good analysis for me. If we go back to the few slides back, if I replace that by saying: âAn increase in the users age is causing an increase in page load speed,â that just, it sounds ridiculous just saying it. At that point Iâd question the causation and not even go down the route of producing a chart, like you saw earlier.
Once youâve got your causation sort of solid and front of mind, the next step is actually then to see if thereâs any correlation in your data. Use the CORREL formula, as we saw earlier, in Excel or Google Sheets, to statistically analyse the strength of correlation. As a recap, for a positive correlation youâre looking for a value between 0.5 and one. The closer to one, the stronger the correlation. For a negative correlation, anything between minus 0.5 and minus one. The closer to minus one, the stronger negative correlation.
Really to end the talk today, I would say ignore causation at your peril. Why? As without it ice cream is actually a highly dangerous food. We saw earlier that there was a strong correlation in ice cream consumption against an increase in shark attacks, but also a very, very strong correlation in an increase in violent crime, an increase in cases of polio, and an increase in forest fires.
Thank you very much.