Correlation does not imply causation

Related Articles

Analytics

Creating calculated metrics in Google Analytics 4

Google Analytics 4

What is data thresholding in Google Analytics 4?

Analytics

Correlation does not imply causation

Hi, everybody. Today I’m going to be showing you that correlation does not imply causation. This whole area really starts with us as digital marketers needing to embrace correlation and causation.

Why marketers need to understand correlation and causation

Now why is this? Well firstly we have a wealth of data across all of the digital marketing disciplines that we operate in. And, as such, we are all required to analyse data on a regular basis.

Correlation and causation are real kind of centre pieces now in terms of analysing data, so therefore neglecting correlation and causation ultimately leads to wrong insight and wrong decisions being made.

Three main types of correlation

Let’s start with correlation. In effect, there are three main types of correlation to look for when analysing data.

Positive correlation

We have positive correlation. So, let’s say we have two variables, X and Y. As X increases Y increases, and you’ll get a scatter graph a little bit like this. We then put something like that in – a line of best fit – to understand how strong your correlation is.

Negative correlation

We then have a negative correlation, so in this instance, we’ve got the value of X increasing, but the value of Y decreasing. Again, to understand how strong that correlation is, we put a line of best fit in across the graph like that.

No correlation

Finally, there will be datasets that we, that to analyse that just have no relationship or no correlation whatsoever. You’ll have a true scatter graph like this where you can’t put a line of best fit in at all.

Pearson correlation coefficient

If you can imagine, every time we’re looking at data and trying to analyse data for correlation, it would actually take quite a long for us to get the data, plot a scatter graph, draw a line of best fit. And then when you do that, there’s nothing really scientific in saying: “Well, how strong is that correlation?”

This is where something called the ‘Pearson correlation coefficient’ comes in. We can use the Pearson correlation coefficient to statistically measure correlation strength. The great thing with this is, it’s really easy to do in Excel and in Google Sheets as well.

Using the Pearson correlation coefficient in Excel and Google Sheets

In terms of what you need to do in Excel or in Google Sheets, you type the formula =CORREL, open up your bracket, select your first dataset, and then a comma, and then your second dataset, close the bracket, and that will return you a figure between minus one and plus one. That figure will tell you how strong the correlation is.

If you’re looking at a positive correlation, you want to see the Pearson coefficient coming between 0.5 and one. The closer the number is to one, the stronger the positive correlation.

For negative correlation, the Pearson coefficient will come between minus 0.5 and minus one. The closer the figure is to minus one, the stronger the negative correlation.

Any values between plus and minus 0.5 and getting closer and closer to zero, a weak correlation, or indeed, no correlation like we saw earlier.

When correlation doesn’t imply causation

Let’s apply the principles of correlation and the Pearson’s correlation coefficient to a real-life example. Here’s an interesting dataset. We’ve got two variables trending over time, and we can see from the initial first graph that the trends are pretty similar.

We ran the Pearson correlation coefficient across this dataset and it returned a figure of 0.94. Now, this implies that this is a really strong positive correlation between these two datasets and that there seems to be some kind of relationship between the data. Now, importantly at the moment, what’s missing is context. We don’t know what these two pieces of data are. If we have a look, we can see that the orange line trending is the number of shark attacks per month, and the blue line is the amount of ice cream consumed in pound weight.

Let’s take this situation. If we were only to consider the correlation, and remember we have the Pearson’s coefficient of 0.94, which is really, really strong, and omit any form of causation, it’s going to lead to wrong decisions being taken and wrong insight. For example, “an increase in ice cream consumption causes an increase in shark attacks”. Or how about, “an increase in shark attacks is causing people to consume more ice cream”?

Now, clearly this isn’t the case, and this is a really strong example of the principle, correlation does not imply causation. The correlation was very, very strong, but ultimately there is no direct causation between shark attacks and ice cream consumption, which led us to that wrong insight and wrong decision.

Indirect causation

In this example, however, there is indirect causation – that’s coming from the sun. Not literally. It’s to do with the hours of sunshine. Let’s take the data of hours of sunshine per month and plot that against the number of shark attacks, and also plot that against the ice cream consumption. Now you have causation, and you can actually draw a more meaningful conclusion from the data.

For example, “an increase in sunshine hours over the summer months causes more people to swim in the sea and increase the likelihood of shark attacks”. Much more sensible, I’m sure you’ll agree. Or how about, “an increase in sunshine hours over the summer months is causing an increase in ice cream consumption”? Again, that makes much more sense. It’s really, really important to ensure you’ve got that causation behind the data for it to make sense and to drive, kind of, correct assumptions and correct conclusions from your data.

An example for marketers

Let’s bring these concepts of correlation and causation back to the world of digital marketing. One of the biggest tips I can share here, is always consider causation for any analysis is done. Now this is a particularly bad example, don’t do it this way. In terms of the data that we’ve got here, we ignored that point, and we’ve got data looking at the age ranges going across the X axis, and we’ve got page load speed going across the Y axis. We can see here in terms of correlation, as users get older, the page load speed has decreased.

Of course, if we were to take this data at face value and omit causation, we would come to that conclusion and we’d say: “Right, let’s increase traffic, get an older demographic to the site, and it’s going to cause a reduction in page load speed”.

Again, another example of where we haven’t thought about the causation, we’ve put two variables together and we could potentially come up with something like that, which we know clearly isn’t true. What the reality tells us in this example, is that page load speed changes are caused by technical factors, and there’s no direct causation between age and page load speed.

In this example, we’d say: “Okay, right. Page load speed, what actually causes a change in page load speed?” There’s a number of different things, but some examples here. For example, it could be the browser the user’s on or the device that they’re accessing the site from. It could be things like server response time or page download time, or domain lookup time, or redirection time. What we could then do is correlate each of those sources of data against page load speed, and then if we have a strong correlation coefficient coming back from that, again we’ve got the causation to build some meaningful insights off the back of the data.

Key takeaways

Really to bring this all together, the key takeaways I’d like to share today. First and most importantly, always have causation in front of mind for analysing or correlating any dataset. Before you even put a graph together, before we even think about anything to do with correlation coefficients, think about the variables you’re analysing and answer the question: “Is there causation between the two variables?” If you’re not sure, a top tip and something that works well for me, it’s just to write it down.

For example, if I was looking to correlate server response time data and page load speed data, I’d write something like this down: “An increase in server response time is causing an increase in page load speed”. Now that seems completely plausible, so I know from that that when I analyse that data I’ve got that causation there, and that’s going to help deliver good analysis for me. If we go back to the few slides back, if I replace that by saying: “An increase in the users age is causing an increase in page load speed,” that just, it sounds ridiculous just saying it. At that point I’d question the causation and not even go down the route of producing a chart, like you saw earlier.

Once you’ve got your causation sort of solid and front of mind, the next step is actually then to see if there’s any correlation in your data. Use the CORREL formula, as we saw earlier, in Excel or Google Sheets, to statistically analyse the strength of correlation. As a recap, for a positive correlation you’re looking for a value between 0.5 and one. The closer to one, the stronger the correlation. For a negative correlation, anything between minus 0.5 and minus one. The closer to minus one, the stronger negative correlation.

Really to end the talk today, I would say ignore causation at your peril. Why? As without it ice cream is actually a highly dangerous food. We saw earlier that there was a strong correlation in ice cream consumption against an increase in shark attacks, but also a very, very strong correlation in an increase in violent crime, an increase in cases of polio, and an increase in forest fires.

Thank you very much.