Sunday, December 3, 2017

The bias/variance tradeoff


In putting models together, you'll typically have "overfitting (high variance) or underfitting (high bias)" [1] The bias/variance tradeoff is the process of balancing the two.

"Variance is how sensitive a prediction is to what training set was used. Ideally, how we choose the training set shouldn’t matter – meaning a lower variance is desired. Bias is the strength of assumptions made about the training dataset. Making too many assumptions might make it hard to generalize, so we prefer low bias as well." [2]

But what is variance and bias?

Random variables are not random variables

Firstly, we clean up some terminology. From the excellent Count Baysie: "Random Variables, which are neither random nor variables. The Random Variable is the thing that translates 'H' or 'T' into 1 or 0. But when we hear "Random Variable" it's very tempting to think "Oh this must work like a 'Random Number Generator' where each time I look at this variable it has a random value."

Great Expectations

Now, "the Expectation of a Random Variable is the sum of its values weighted by their probability... Expectation is just the mean of a Random Variable! But words like mean and average carry a special intuition"

Variance

"If we took a random sample of our data, say, 100 points and generated a linear model, we' d have a set of weights. Say we took another set of random points and generated another linear model. Comparing the amount that the weights change will tell us the variance in our model." [3]

Variance is typically taught at high school to be:

σ2 = Σ (x - μ)2 / n

where μ is the mean and n is the number of samples. In a continuous probability distribution, this is:

σ2 = ∫ (x - μ)2 f(x) dx

which looks a lot like the definition of expectation

[Aside: note that if (x - μ) were raised to the power 3, we'd be talking about the skew and if it were raised to the power 4, we'd be talking about the kurtosis ("how pointy a distribution is"). These are the moments of a random variable. But that is another story. See here for more.]

Anyway, the variance of a random variable X is:

Var(X) = E[(X - μ)2] = E[(X - E[X])2]

expanding and noting that an expectation of an expectation is the original expectation itself:

Var(X) = E[X2] - E[X]2

Thus when X is a random variable, Variance is the difference between the expectation of X squared and the squared expectation of X.

Note that the Akaike Information Criteria [Wikipedia] can be used in model selection to dismiss models that have too much overfitting. Note that this only applies to certain types of models (eg GLMs).

Aside: PCA and Variance

In PCA, we're trying to minimize the squared distances of all points from a vector. But in doing so, we're increasing the spacing between the positions on this vector that are the shortest distance between the points and the line (see this StackExchange answer) - basically, the "errors".

That is, this is the vector with greatest variance between all the data points. This fits our intuition that the principal components are the most distinctive elements of the data (in the example given, it is the qualities of wines - alcohol content, colour etc - or more often linear combinations of these qualities that distinguish wines.

Bias

"Bias measures how far off the predictions are from the correct values in general if we rebuild the model multiple times on different training datasets; bias is the measure of systematic error that is not due to randomness." [1] In other words, E[approximate f(x) - true f(x)].

Why would we introduce bias? Well, one reason would be that we have more columns than features. For instance, in linear regression we need to invert a matrix made of the inputs (see the last equation in my post on linear regression). This is not possible if the matrix has a rank less than the number of rows (viewed as a set of equations, it has no solution). So, we could reduce the number of columns thus making the model less complex and adding bias.

Another reason is to introduce regularization. "The concept behind regularization is to introduce additional information (bias) to penalize extreme parameter weights." [1]

[1] Python Machine Learning
[2] Machine Learning with TensorFlow
[3] Machine Learning in Action

No comments:

Post a Comment