Whenever we discuss prediction models, it’s important to understand prediction errors, i.e. bias and variance. A proper understanding of these concepts would help us not only to build accurate models but also to avoid the mistake of over-fitting and under-fitting.

We quickly explain the two concept using the following illustration.

Suppose that a man is trying to shoot in the bull’s eye. ** His shooting skill can be considered the prediction model.** The shooting results are the model’s prediction.

## What is bias?

- Bias shows the difference between the
**prediction (average) and the correct value**. - If the shoot results are far-away from the bull’s eye, the bias is high and likewise.

#### Some causes of high bias:

- Oversimplifies the model
- Not taking into account all the key features
- Not enough data
- Wrong model selection

## What is variance?

- Variance shows the spread of our data.
- Or the variability of model prediction for a given data point or a value

#### Some causes on high variance:

- Noisy training dataset
- Sparse dataset
- Algorithm lack of generalization to capture the underlying patterns

## Overfitting and Underfitting

**Under-fitting**: often high bias + low variance**Over-fitting**: often low bias + high variance, good at training dataset, bad at testing dataset

### Tung Nguyen

PhD/Researcher/Programmer at Up Education - YooBee Colleges

I received the B.Eng. Degree in telecommunication from Shanghai University, MSc of the same major from Paris-Sud University and a PhD degree in Computer Science from Auckland University of Technology. My research interests include machine learning, game theory, computational trust, multi-agent systems and software engineering.

#### Latest posts by Tung Nguyen (see all)

- The evolution of Trust – fun - July 17, 2019
- Bias vs Variance Quick note - May 29, 2019
- String interpolation in different programming languages - December 3, 2018

1. Under-fit (high bias): More training data doesn’t help, so don’t waste time on collecting more data.

2. Over-fit (high variance): getting more training data is likely to help.

Choosing reasonable number of features, degree of polynomial, and appropriate regularization parameter (lambda) is the key to keep balance between Overfit and Underfit.

Training set (60%), Cross Verification Set (20%), Test Set (20%) is helpful in choosing the best polynomial degree and regularization parameter