I’ve observed that there are two broad approaches that people take to getting information out of data. One approach is to simply throw a kitchen sink full of analytical techniques at the data. Without really trying to understand what the data looks like, and what the relationships may be, the analyst simply uses one method after another to try and get insight from the data. Along the way, a “model” will get built.
The other approach (which I’m partial to) involves understanding each variable, and relationship between variables as a first step to getting insight from the data. Here, too, a model might get built, but it will be conditional on the analyst’s view on what kind of a model might suit the data after looking at the data.
Considering that both these approaches are used by large numbers of analysts, it is highly likely that both are legitimate. Then what explains the fact that some analysts use one approach, and others use another? Having thought about it for a long time, I have a hypothesis – it depends on the kind of data being analysed. More precisely, it has to do with the dimensionality of the data.
The first approach (which one might classify as “machine learning”) works well when the data is of high dimensions – where the number of predictors that can be used for predictors is really large, of the order of thousands or larger. For example, even a seemingly low-resolution 32 by 32 pixel image, looked at as a data point, has 1024 dimensions (colour of the image at each of the 1024 pixels is a different dimension).
Moreover, in such situations, it is likely that the signal in the data doesn’t come from one, or, two, or a handful of predictors. In high dimension data science, the signal usually comes from complex interplay of data along various dimensions. And this kind of search is not something humans are fit for – it is best that the machines are left to “learn” the model by themselves, and so you get machine learning.
On the other hand, when the dimensionality of the dataset is low, it is possible (and “easy”) for an analyst to look at the interplay of factors in detail, and understand the data before going on to build the model. Doing so can help the analyst identify patterns in the data that may not be that apparent to a machine, and it is also likely that in such datasets, the signal lies with data along a small number of dimensions, where relatively simple manipulation will suffice. The low dimensionality also means that complex machine learning techniques are unlikely to contribute much in such cases.
As you might expect, from an organisational perspective, the solution is quite simple – to deploy high-dimension data scientists on high-dimension problems, and likewise with low-dimension data scientists. Since this distinction between high-dimension and low-dimension data scientists isn’t very well known, it’s quite possible that the scientists might be made to work on a problem of dimensionality that is outside of their comfort zone.
So when you have low dimensional data scientists faced with a large number of dimensions of data, you will see them use brute force to try and find signals in bivariate relationships in the data – an approach that will never work since the signal lies in a more complex interplay of dimensions.
On the other hand, when you put high dimension data scientists on a low dimension problem, you will either see them missing out on associations that a human could easily find but a machine might find hard to find, or they might unnecessarily “reduce the problem to a known problem” by generating and importing large amounts of data in order to turn it into a high dimension problem!
PS: I wanted to tweet this today but forgot. Basically, you use logistic regression when you think the signal is an “or” of conditions on some of the underlying variables. On the other hand, if you think the signal is more likely to be an “and” condition of certain conditions, then you should use decision trees!
Interesting observations! But when viewed as a general comment, this expert vs. high-dimensional has a slight falseness of dichotomy.
Some thoughts:
Solutions to a lot of data-science problems are a combination of expert-knowledge based (let’s call it Type 1) and purely data-driven inference (Type 2).
The major use-case for low-dimensional expert-based careful inference is when you are dealing with ‘customers’ from a discipline that is not highly mathematical. E.g. epidemiologists or clinicians prefer simple decision tree models where more sophisticated models can yield much better results for disease prediction, etc. Another example (I’m guessing, your perspective) is business folk who may not like it when the reasoning-process is obscured by high-dimensional math.
For many problems, it is often hard to know which domain a given problem lies in. It’d be dogmatic to be partial to one or the other without a good reason. An example where such dogma can be detrimental is image recognition: for many years, a lot of work was focused on using domain-knowledge from computer vision and image processing to build good general image recognition algorithms. Then, deep convolutional neural network algorithms were proposed and became feasible around 2012 and it became possible to train a general recognition algorithm on yuuuge corpora, essentially changing image recognition overnight from (Type 1+Type 2) to Type 2.
High-dimensionality becomes a challenge to overcome when: (1) your data are too few, (2) if they are not representative of the diversity inherent to the phenomenon you’re predicting, (3) algorithm issues (e.g. decision trees may overfit if you have too many dimensions and too few data, but ‘regularized’ approaches like support vector machines don’t). In many situations, you can transform the data into lower-dimensional spaces without loss of information using sparsifying transforms/dictionaries, PCA, feature-selection, etc (as is common in areas like genomics where you have high-dimensional data).
Image and speech recognition today mostly belong to Type 2, being treated as raw high-dimensional problems due two reasons: (1) there’s an enormous amount of diverse/non-unfirom data to learn high-dimensional models from- Google’s speech voicemails, image repositories, etc. (after all, as a thumb-rule you can’t reliably learn high-dimensional models from small amounts (or uniform) data) (2) the developments in deep learning that obviate the need to do heavy ‘feature engineering’ using expert knowledge.
As such, images are not super high-dimensional, in terms of the amount of information they contain (while they are high-dimensional in the raw). They are smooth and compressible. You can, after all, compress a 10 megapixel raw image by a factor 10s or 100s and still not lose much in terms of perceptual quality. There isn’t much of a need to do this using expert knowledge today because of the effectiveness and computational efficiency of deep convolutional neural networks.
Most non-image non-speech problems are some combination of Type 1 and Type 2.
Further, there are also hierarchies of expert knowledge. For instance, if you take time-series classification, few applications would use each time-sample as a feature. Rather, you’d represent the time-series as ARMA, spectral coefficients, statistical parameters, etc. You’d do this if you’re a financial time-series guy, or a speech-recognition guy, or a climate scientist. So this is one level of expert-knowledge (knowledge of time-series methods). The next level in the hierarchy is domain expertise (speech/language, finance, etc.): e.g. if you know that your data consist of speech, then you’d transform it further into speech-relevant features (linear predictive coding, spectrograms, language models, etc.).
Ugh, formatting is weird and paragraphs got squished.