All Articles

Fighting Archetypal Overfitting of Data

Author’s note: the rest of the post applies predominantly to processing raw, numerical data, especially time series data. Indeed, while this may be relevant for computer vision or NLP problems, it is likely less interesting at the level of the raw data.

I find the archetypes by which we work with data fascinating. When we work with data, we see it nearly exclusively in tabular format our in some form of a graph. We engage only one sense (sight), or perhaps taste or hearing in the rare cases we are dealing with a food or audio dataset. On the other hand, when we want to draw conclusions from our data, we try to explain it, quite often, as a story. In many ways, this makes sense: data science is about understanding, and frequently, we are trying to understanding the passage of time, or other context that can be well represented by stories.

However, when we work take data from any arbitrary context and represent it either as a story, or at most, in three visual dimensions, we not only lose nuance of our data, we also can fall into a trap. At an underlying level, we frequently assume that the data can fall into an archetype that we can ration about well. Even if we can make the data reflect an archetype, however, we may be drawing the wrong conclusions. I call this archetypal overfitting.

Archetypal overfitting is when we focus, consciously or subconsciously, on fitting the data into a pre-established archetype, often during our initial data analysis. Indeed, if we accept the premise that we data scientists may often search to match an archetype with our data, we find that the search space for potential archetypes is smaller than expected. Take for a moment, two-dimensional scatterplots, and let’s sample a few common archetypes:

  • Random noise;
  • Multiple, evenly-weighted and distributed clusters;
  • Single main cluster, with several smaller peripheral clusters;
  • Logistic curve;
  • Linear;

The list goes on and across a few different types of graphs to be sure, but by my estimates, this list would totals only ~100 different archetypes for two-dimensional visualizations. Combine this with the seven core archetypes for stories as defined by my classical English literature courses (they work surprisingly well even for customer journeys), and we end up with, at the very most, 2^7 archetypes for our data. This is shockingly small—and even if we include a few other common visualization approaches (3D, etc.), we still end up not crossing 2^8.

This dearth of archetypes impacts every step of the data science process: from what we capture, to how we perform an exploratory data analysis, to our metric development, modeling, and of course, visualization. Indeed, by limiting ourself to these archetypes, we may create less good results, as we be ignoring or manipulating parts of the data to hide its natural qualities, hence the overfitting component of archetypal overfitting. But, we aren’t without reasons to do it. By harnessing these archetypes, we’re able to simplify the modeling and learning process, to the point that most data scientists can take a look at a visualized relationship and apply a reasonable first guess of a model based on the archetype. In fact, I’m not suggesting that we get rid of our data archetypes. Instead, my recommendation is that we broaden the archetypes to better handle more than three dimensions, and to give us a far wider range than we have previously been using.

So how do we do that? We add mediums of viewing our data, especially at the critical phase of our EDAs.

Take one example: music. We can listen to our data, and it is a context that we already have natural archetypes built in. This domain is already beginning to be explored, and it’s one I’m actively diving into more and more (although, the state of the tooling is still quite limited). For one great example of this, check out Mary Hogan and Flavio Esposito’s paper Music Defined Networking. It’s well worth a read.

Why is music in particular so good? Well, it spans a greater range of dimensions than visualizations can. You can easily have several different instruments or sounds in harmony. It brings existing archetypes, but a quite broad range, which means that it generally imposes less structure. So too, does music bring less requirements regarding outliers. Indeed, while visualizing outliers can be difficult on a scatterplot, music handles these well (so long as they’re in an audible spectrum). A few outliers can be good for the music as disharmony can be pleasant as long as it is resolved. However, if the outliers dominate the dataset, well, just as that would pose trouble in modeling, so too is piano keyboard mashing unpleasant on our ears.

Other mediums are harder to incorporate in our process: we don’t yet have computers that can generate tastes or smells, nor is it usually cost effective to 3D print our data into patterns that we can manipulate by touch. But, with time, these may become available, and as they do, I’m looking forward to seeing data in ever new contexts, and drawing new conclusions that our archetypes might have caused us to previously overlook.