There are numerous articles all over the internet explaining Bayes’ theorem. It is thought to be the basis for logical reasoning and is important for us all to understand. Yet most of us find mathematical symbols confronting and examples used in the explanations hard to follow. So, this post is my attempt to explain the theorem intuitively and plainly as possible.

Mathematically, it is defined as:

where P(X) is the probability of event X occurring while P(X|Y) implies probability of event X happening given that event Y has happened

To put things into perspective, let’s discuss two simple examples:

# Example in usage with percentage data

In a fictional suburb called Livingstone, the census says that an average of 70% of the population suffers from a drinking problem. Now imagine you and I are driving around this region. I point to a random person walking on the street and ask you what is the likelihood that this person has a drinking problem. It is very likely that you’ll say 70% or that the probability is 0.7 (get from percentage to probability by dividing with 100).

Well if the person was a man from say 18 to 30 yrs of age, it would most likely be right and you’d be much more confident on your answer. But what if I told you it was a 5 yr old child instead? Saying that a 5 yr old child has the likelihood of having a drinking problem is 70% is obviously absurd. Now we see the problem. Yes, I deliberately gave you incomplete information so that you would trick yourself by making false assumptions when you read the words ‘average’ and ‘person’ in the description above .

So, what should you have done to be ‘closer’ to correct estimation? For one, not assume that one statistic – here it was percentage – is enough to tell the full truth. This is why there is a whole branch of mathematics called Statistics. Secondly, when someone quotes average measure of a dataset it is usually cumulative of different distribution and so we must correct it for our target distribution – which in the latter case was children. This second requirement is exactly what Bayes’ theorem is.

# Example in usage with using evidence to support hypothesis

Bayes’ theorem is usually used for reasoning how well a theorem is supported by evidence. So, let’s now explore the implication of this theorem for supporting a hypothesis using evidence.

You have a cat and a dog as pets. When you return home from work and see that the milk box is toppled and leaking (observation). You know that the cats like milk more (prior information before looking at the evidence). But then you see some milk on the dog’s face (evidence).

Before we continue, let’s us evaluate this. Notice that if we don’t consider the evidence, the cat seems like the likely suspect but when we do, the dog seems like the suspect. The similarity with the previous example can also be evident. In it, before we knew that the ‘person’ was a child, we were confident that the answer was 0.7 but after we did know that we had doubts that it was not the right answer. This is Bayes’ theorem in summation.

This whole scenario can be shown in probability diagram as follows:

From the diagram above, we can see that we want to take the evidence agreeing with hypothesis part (represented in the diagram with *P(Evidence is true) **∩ **P(Hypothesis is true)* intersection) from the probability space of evidence being true to probability space of hypothesis being true given that evidence is true.

How? First we should know that the intersection part has contribution from both P(Evidence is true | Hypothesis is true) and P(Hypothesis is true) probability spaces. We need is contribution only from P(Evidence is true | Hypothesis is true). For that, we divide the intersection probability with P(Hypothesis is true) as:

Now we will want to rescale this from P(Evidence is true) to P(Hypothesis is true), in the next section.

# Rewriting the equation

We can rewrite the original mathematical Bayes’ theorem equation as:

Or

If you are not familiar, this shows the associative property of multiplication.

In the rewrite above, separating the terms with ‘*’ (multiply operation) is important to make a point as we will see. Multiplication operation has an associative property (which means ‘(A*B)/C’ is equal to ‘A*(B/C)’). This property ‘obscures’ the implied meaning when the formula is written in the original form. This is unfortunate because mathematical statements are supposed to show truth statement simply.

First, a minor detour to hit the point home. Remember how we calculate percentage? If you get 7 out of 10 in a test your percentage is calculated as:

Now, notice the similarity with the rewritten equation above. This similarity is not coincidental. When calculating percentage we are moving our scale from ‘out of 10’ to ‘out of hundred’.

Similarly, in the first example involving percentage, we are moving from average drinking problem from all age groups to average of no. of drinkers given selection from a specific age group. And in the second example we are moving from the prior belief that cats prefer milk to evidence to how evidence supports its likelihood to be correct and moved away from it as evidence against it was strong.

For the first example, if thinking in terms of probability is difficult, the following might make it easier to understand:

# Summation

We start with initial belief called prior. As we collect evidence, its likelihood of being true allows us to change our belief either towards or away from prior.