from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
def plot_beta(a, b):
= np.linspace(0,1,num=500)
x = stats.beta.pdf(x, a, b)
px
= plt.subplots()
fig, ax = x, y = px)
sns.lineplot(x r'$\theta$')
ax.set_xlabel( plt.show()
5 Hierarchical generalization
In previous examples, there was always a finite number of hypotheses that we were making inferences about (number of black balls, fair or trick coin, yellow or green taxi). Sometimes, we want to consider an infinite set of hypotheses. For example, after flipping a coin, what is the probability of that coin coming up heads? The answer to this question could be any number in the interval [0,1].
5.1 The Beta-Binomial model 🪙
We can answer this question with a model called the Beta-Binomial model, named for the probability distributions it uses. First, let’s set up the basic assumptions of the model.
Let
The data
The notation for the
We can define the prior,
Alternatively, a convenient choice (for reasons explained below) for
The Beta distribution has two parameters:
plot_beta
takes two arguments: a
(b
(
Let’s see what it looks like with a few different values.
1,1) plot_beta(
When
3,3) plot_beta(
When
50,50) plot_beta(
What about when
4,2) plot_beta(
This allows us to capture skewed priors, perhaps capturing a belief that the coin has a specific bias.
Now, what if
0.5,0.5) plot_beta(
This might capture the belief that the coin is strongly biased, but we aren’t sure in which direction.
5.1.1 Conjugate distributions
The Beta distribution is the conjugate distribution for the Binomial distribution. This means that when the likelihood is a Binomial distribution and the prior is a Beta distribution, then the posterior is also a Beta distribution. Specifically, after making these assumptions,
The parameters of the posterior distribution are (1) the sum of
5.1.2 Parameter estimation
Because we used a conjugate distribution, we can use our same plot_beta
function to generate posterior probability distributions after some coin flips.
Suppose you start with a fairly strong belief that a coin is fair, represented by this distribution:
30,30) plot_beta(
Now, suppose you flip a coin 20 times and it comes up heads every time. What should you think about the bias of the coin now? According to our model:
30+20,30) plot_beta(
As you can see, this should cause you to shift your beliefs somewhat.
This wasn’t totally realistic, though. If you picked a coin off the ground, your prior beliefs about it being biased would probably look more like this:
2000,2000) plot_beta(
What happens if you now flipped this coin 20 times and it came up heads every time?
2000+20,2000) plot_beta(
You might be mildly surprised, but those 20 flips wouldn’t be enough to budge your estimate about the bias of the coin by much.
Finally, let’s imagine a situation in which you had a weak prior belief that a coin was biased:
5,1) plot_beta(
Now you flip the coin 100 times and it comes up heads 48 times. What should your updated beliefs be?
5+48,1+52) plot_beta(
As you can see, the posterior distribution shows that you should think this coin is probably fair now. This illustrates how sufficient evidence can override prior beliefs.
5.1.3 Hypothesis averaging
In Chapter 3, we solved the generalization problem by summing over all hypotheses, weighted by their posterior probabilities. Here, we can do something similar.
Suppose we want to know the probability of the next flip coming up heads. In other words, we want to know
5.2 Overhypotheses 🙆
Now consider a slightly different situation. You flip 19 different coins in a row, each one time, and they all come up heads. Now you pick up a 20th coin from the same bag as the previous 19 coins. What do you think is the probability of that 20th coin coming up heads? Is it higher than 0.5?
If you answered yes, it’s probably because you formed an overhypothesis about the bias of the coins. After flipping all those coins, you may have concluded that this particular set of coins is more likely than usual to be biased. As a result, your estimate about the probability of the 20th coin coming up heads was higher than it otherwise would be.
5.2.1 The shape bias
This coins example is pretty artificial, but the concept of overhypotheses is one that you find in language learning. A phenomenon known as the shape bias refers to the fact that even young children are more likely to generalize a new word based on its shape rather than other properties like color or texture.
This makes sense because objects tend to have common shapes and are less likely to have common colors or textures.
5.2.2 Modeling the learning of overhypotheses through hierarchical Bayesian learning
Charles Kemp, Andy Perfors, and Josh Tenenbaum developed a model of this kind of learning. They focused on bags of black and white marbles rather than flipping coins. They imagine a problem in which you have many bags of marbles that you draw from. After drawing from many bags, you draw a single marble from a new bag and make a prediction about the proportion of black and white marbles in that bag.
The details of the model are outside the scope of this book. But the basic idea is that the model learns at two levels simultaneously. At the higher level, the model learns the parameters
At the lower level, the model learns the specific distribution of marbles within a bag. If you draw 20 marbles and 5 of them are black, you may have some uncertainty about the overall proportion in the bag, but your best estimate will be around 5/20 (or 1/4).
Where the model excels is being able to draw inferences across bags. If you see many bags that are full of only black or only white marbles, and then you draw a single black marble out of a new bag, you are likely to be very confident that the rest of the marbles in that bag are black.
But if you see many bags that have mixed proportions of black and white marbles, and then you draw a single black marble out of a new bag, you will be far less confident about the proportion of black marbles in that bag. A model that doesn’t make inferences at multiple levels would struggle to draw this distinction.