Attractive graphics without chart junk

A picture is worth a thousand words, but ten pictures are worth much less than ten thousand words One of the most common problems with visualisation, especially in the media, is that of “chart junk”. Graphics designers working for newspapers and television channels like to decorate their graphs, to make it more visually appealing. And…

Meaningful and meaningless variables (and correlations)

A number of data scientists I know like to go about their business in a domain-free manner. They make a conscious choice to not know anything about the domain in which they are solving the problem, and instead treat a dataset as just a set of anonymised data, and attack it with the usual methods.…

Yet another way of classifying data scientists

There are many axes along which we can classify data scientists. We can classify based on the primary specialty, in terms “analytics”, “business intelligence” and “machine learning”. We can classify based on domain, into “financial data scientists” and “retail data scientists” and “industrial data scientists”. We can classify by the choice of primary software tool,…

Beer and diapers: Netflix edition

When we started using Netflix last May, we created three personas for the three of us in the family – “Karthik”, “Priyanka” and “Berry”. At that time we didn’t realise that there was already a pre-created “kids” (subsequently renamed “children” – don’t know why that happened) persona there. So while Priyanka and I mostly use…

Stirring the pile efficiently

Warning: This is a technical post, and involves some code, etc.  As I’ve ranted a fair bit on this blog over the last year, a lot of “machine learning” in the industry can be described as “stirring the pile”. Regular readers of this blog will be familiar with this image from XKCD by now: Basically…