What is Hypothesis Testing and how do we use it?

Actually, you already use it

Suppose you meet someone for the first time, and they tell you that they can run at 10 km per hour (or just under 7 miles per hour, for those three nations that still use this system). 

    Hint: average top running speed women ~ 10km/hr, and men ~ 12.8km/hr

You might think, okay, that's close to average; that sounds reasonable. 

Now if they said they can run as fast as 30km/hr. Now that's a bit of a stretch.

The average professional sprinter could sustain speeds of 24km/hr, so 30km/hr is improbable, but not impossible since Usain Bolt tops out at a whopping 44km/hr during his Olympic record run. 

Some people might say, nahhh, that's too far a stretch, but you might choose to believe them for now. 

But if they said they could reach more than 50km/hr. Is it possible? Sure, they might be a hidden Olympic record-shattering machine, an athletic monster like nothing the world has seen before, but at this point, the chances are so slim that you'd reason, that at this point, they're just screwing with you. 

So, as humans, we have a general idea of how to tell the truth apart from lies. You start with some baseline, like knowing that average running speeds, and if they are within a reasonable range from the average, you say, okay, that's probable and I'll believe that. If they are wayyy off, like 50km/hr, then you could pretty confidently deduce that they are not telling the truth. 

But the question is, how far off do they need to be from the average for you to no longer believe them? Like in the story, 30km/hr is much faster than the average, but people have achieved speeds above that. So, in that case, should you still believe them?

Let's just say, suppose we managed to collect the top running speeds of everyone in the world. And suppose more than 10% of the population can run at or faster than 30km/hr. Then, technically, if you were to random pluck a person out of everyone in the world, then there's a 10% chance that this person could, in fact, reach speeds of 30km/hr. 

You might say, that's quite a small chance. But as statisticians, we like to be really sure before we start challenging their statement. To us, 10% chance is still a decently high chance as far as we're concerned. 

So what is considered too small of a chance? Collectively, we've decided, 5%. So, if less than 5% of the world can run at or over 30km/hr, then at random, there would less than a 5% chance that you would find someone who could reach that pace. That, we say, is too improbable for us to believe that the person you're talking to just so happens to be part of that small 5% of people with extraordinary running abilities. It's much more likely, with a chance 95% in fact, that the person cannot reach that speed. 

Can we still be wrong? Yes, but the chances are so low that, at this point, we conclude that he's not telling the truth. 


Let me tell you about Benjamin

When I was little, I had a friend named Benjamin. I'd go over to his house every day during summer vacation to play his NES. Best time of my life. But every day, after about an hour of gaming, we'd get thirsty, so one of us had to run to a convenience store about 10 minutes away. With the scorching sun outside, none of us were eager to leave the nicely air-conditioned room. So, to decide who would make the run, Benjamin proposed we flip a coin. If Heads it's me; tails it's him. 

Heads. Aww shoot. Alright. So I ran and got us two sodas. 
Next day. 
Heads. Just my luck. I went to get some sodas.
Next day. Heads. Fine. Got some sodas.

Over the span of 10 days, this coin landed on heads nine of the ten times. At this point, I thought, bro, this can't be a fair coin, right? Gotta be a trick coin. Now, without an understanding of the materials used to make the coin, or a way to check whether the coin is weighted in a certain way, how could I possibly prove that the coin was not fair? Perhaps, I thought, it is fair, and I was just unlucky. 

But what I can do, is toss a coin 10 times myself and see how probable 9/10 heads was. 
This is the same ideas as getting data on everyone's running speeds, and seeing how likely or unlikely 30km/hr speed is. 

So, what I did was, I tossed a fair coin 10 times, saw that heads appeared 6 times, and wrote that down. 
Then I repeated that 5 more runs, so a total of 50 tosses, where run I would toss the coin 10 times and write down how many times heads appeared. This left me with these results, 6 times, 5 times, 6 times, 4 times, 3 times, and 7 times. 
Then I realized, wait, I why do this physically when I could just simulate coin tosses. 

So I did, 10,000 times, with this following snippet, and threw it into a histogram. 

*Don't worry about the code, not core to the story. Only here if you want to replicate this simulation.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

runs = np.random.rand(100000).reshape(10000, 10) > 0.5
heads = np.sum(runs, axis = 1)

# Label the heights of the bars
counts, bins, patches = plt.hist(heads, bins=11)
for i in range(len(patches)):
    plt.annotate(str(int(counts[i])), 
                 xy=(patches[i].get_x() + patches[i].get_width() / 2, 
                     patches[i].get_height()), 
                 ha='center', 
                 va='bottom')

# Create a histogram plot of the data using seaborn
sns.histplot(data=heads, bins=11)    

plt.show()


This is what the results looks like. 
So we can see out of our 10,000 runs of 10 tosses, we got 5 heads the most frequent, in fact 2525 times. And we got exactly 9 heads a total of 98 times. So, a probability of 98 out of 10000, or 0.0098. Only 0.98% of the times would this situation every happen. 
You know what, statisticians are nice so we'll give our friend some leniency, and let not just count the exact number fo times we get 9 heads, but the times we get 9 heads or more, for a total of 98 + 7 = 105 times.
which results in 105/10000 = 0.105% of the times. That's less than 1% of the time! Meaning, if it was a fair coin, then only less than 1% of the time, do nine or more heads appear out of ten tosses. That situation is so improbable that it's much more likely that the coin was unfair. 
 
Just like how we gathered data on people's running speeds, we gathered data on how the coin should land, or specifically, we flipped a coin (virtually) 10,000 times (like asking 10,000 people for their running speeds), and found that only 105 times do we get nine or more heads, thus ~0.1% of the times. 

So, with that baseline in mind, and keeping in mind that we like to say, events that happen with less than 5% chance is probably not normal, then we say that with only a 0.1% chance that you could possibly get nine or more heads out of ten tosses, I conclude that my friend Benjamin is not telling the truth when he said that the coin was fair (lands on both sides with equal chance). 

To summarize, we started by giving him the benefit of the doubt (trust that his coin was fair), then we did lots of trials and put our trial results in a cool graph that shows how often a certain outcome occurs (called a distribution), and then counted the number of times that this situation or more extreme (getting 9 heads or more out of 10 times) could occur, divided that by the total number of trials to get probability, and showed that it was so rare (0.1% chance, which is less than 5%, our threshold), that it would happen, that I refuse to believe that the coin was fair to begin with. 


And that, is hypothesis testing.


Note from Author

Some of the math-savvy folks in the audience might have realized that we could treat 10 flips as a sequence, e.g. [head, tail, head, head, tail, tail, head, ...] and so on for 10 elements. Then, since each sequence is equally likely to happen, to get the probability of 9 heads, we just need to count the number of sequences that contain 9 heads and divide that by the total number of sequences.

\documentclass[16pt]{article} \begin{document} \[ = \frac{sequences\ with\ 9\ heads}{total\ number\ of\ sequences} = \frac{\binom{9}{10}} { 2^{10} } = 0.00977 \] \end{document}

which is equivalent to our simulation result of 0.0098 from before, saving us the need to simulate. 

To that I say, good job, now get the fuck out of here. This post isn't for you. 


Comments

Popular posts from this blog

How does Entropy work to split Decision Trees?

Idiosyncrasies of Modulo Arithmetic

Lumen Candela Lux Nits