+31 (0)30 636 39 70 | info@onmarc.nl

Overfitting in real life

x

Written by Kevin van Kalkeren – September 13, 2017

Recently, I have been dusting off some of my math skills, to enhance my understanding of machine learning algorithms. As I worked through different kinds of cost functions, probability equations and other concepts that make you the light of the party, memories of papers read, classes taken and webinars watched resurfaced. The biggest warning in all of them: beware of overfitting. Which weirdly enough, we often leave mostly to the algorithm itself.

Barking up the right tree

First perhaps a small notion on the concept of overfitting. The ultimate goal of creating a model through machine learning algorithms, is enabling you the predict some quality about new instances of data, based on patterns found in prior ones. Up to a certain extent, this is something we do as humans on a daily basis. For instance, when a friend of yours buys a new dog, your brain automatically labels it ‘dog!’, learned from prior exposure to the canine species. Actually, our brain is rather good at this, seeing that I rarely mislabel dogs. What’s more, I am often able to classify an animal as a dog correctly, even if I was not yet familiar with the particular breed the dog belongs to. Apparently, my brain keeps some kind of generalization margin in account when learning the ‘rules’ to classify animals, which helps in new encounters. At the same time, it makes sure that the rules are just specific enough to be able to make a guess somewhere in the right ballpark (i.e., that you don’t underfit). If I wouldn’t, I could have labelled all newly encountered dog breeds as ‘goldfish’, because my rules were too much tuned to prior experience, leaving no room for new ones. That state is called overfitting (although in a pet shop, this may just be referred to as ‘confusing’).

Something quite similar happens when we apply artificial intelligence to find patterns in data that are just too complex to work out in our head. You can parametrize an algorithm by (among other things) telling it to apply a certain amount of this generalization margin. As a result, the rules by which the model then will predict new instance labels, will not be overly adjusted to the training data. Mathematicians and computer scientists with way more insight in this than I do, have come up with very elegant ways to minimize these errors as much as possible, to make sure that a model is both not too underfitted, and not too overfitted.

Bias

So while machine learning tempts us to leave the thinking to the machine, it also conceals a major risk. As humans, we’ve evolved successfully by creating heuristics: rules of thumb that allow for quick decisions and generalizations. Depending on the situation, that may either cause under- or overfitting in our judgement. While this is dangerous for anyone, there are two fields that I feel might be prone to this in particular: psychology and marketing. I happen to tick-off both.

Humans do not easily fall into exclusive categories (not even mentioning their reluctance to be put into them). So, when ‘classifying’ people, be it on a psychological profile or as a sales prospect, we need to consider what we are dealing with: models. Simplified and abstracted versions of reality per definition. This inadvertently makes for error margin, which we as humans must weigh against its consequences.

Just because it barks and wags its tail, does not mean it’s a goldfish

New techniques make it easier by the day to develop and train simple models, even without having to fully grasp programming them nor the mathematics behind them. This makes data applicable to almost everyone, a development I can only encourage. As a result, I foresee a shift in focus from the model itself, to a need to focus on the input and output. That is where the upcoming challenges of our field lie. Our predictions are only ever as good as the data we feed our models, which is something to keep in mind as we rely more and more on automation tools. If we don’t think about our own biases in perception, this will most likely find its way through to our models as well. Machine learning is meant to find patterns which we cannot detect on our own, not to tell us what we already think we know. Beware of overfitting, human.

Error: Please enter a valid email address

Error: Invalid email

Error: Please enter your first name

Error: Please enter your last name

Error: Please enter a username

Error: Please enter a password

Error: Please confirm your password

Error: Password and password confirmation do not match