For a long time now, I’ve been sceptical of the practice of finding out the average income in a country or state or city or locality by doing a random survey. The argument I’ve made is “whether you keep Mukesh Ambani in the sample or not makes a huge difference in your estimate”. So far,…
Vlogging!
The first seed was sown in my head by Harish “the Psycho” J, who told me a few months back that nobody reads blogs any more, and I should start making “analytics videos” to increase my reach and hopefully hit a new kind of audience with my work. While the idea was great, I wasn’t…
Premier League Points Efficiency
It would be tautological to say that you win in football by scoring more goals than your opponent. What is interesting is that scoring more goals and letting in fewer works across games in a season as well, as data from the English Premier League shows. We had seen an inkling of this last year,…
Built by Shanks
This morning, I found this tweet by John Burn-Murdoch, a statistician at the Financial Times, about a graphic he had made for a Simon Kuper (of Soccernomics fame) piece on Jose Mourinho. Burn-Murdoch also helpfully shared the code he had written to produce this graphic, through which I discovered ClubElo, a website that produces chess-style…
Bangalore names are getting shorter
The Bangalore Names Dataset, derived from the Bangalore Voter Rolls (cleaned version here), validates a hypothesis that a lot of people had – that given names in Bangalore are becoming shorter. From an average of 9 letters in the name for a male aged around 80, the length of the name comes down to 6.5…
Single Malt Recommendation App
Life is too short to drink whisky you don’t like. How often have you found yourself in a duty free shop in an airport, wondering which whisky to take back home? Unless you are a pro at this already, you might want something you haven’t tried before, but don’t want to end up buying something…
Book challenge update
At the beginning of this year, I took a break from Twitter (which lasted three months), and set myself a target to read at least 50 books during the calendar year. As things stand now, the number stands at 28, and it’s unlikely that I’ll hit my target, unless I count Berry’s story books in…
Taking your audience through your graphics
A few weeks back, I got involved in a Twitter flamewar with Shamika Ravi, a member of the Indian Prime Minister’s Economic Advisory Council. The object of the argument was a set of gifs she had released to show different aspects of the Indian economy. Admittedly I started the flamewar. Guilty as charged. Thinking about…
Analytics for general managers
While good managers have always been required to be analytical, the level of analytical ability being asked of managers has been going up over the years, with the increase in availability of data. Now, this post is once again based on that one single and familiar data point – my wife. In fact, if you…
The missing middle in data science
Over a year back, when I had just moved to London and was job-hunting, I was getting frustrated by the fact that potential employers didn’t recognise my combination of skills of wrangling data and analysing businesses. A few saw me purely as a business guy, and most saw me purely as a data guy, trying…
Statistics and machine learning approaches
A couple of years back, I was part of a team that delivered a workshop in machine learning. Given my background, I had been asked to do a half-day session on Regression, and was told that the standard software package being used was the scikit-learn package in python. Both the programming language and the package…
Why data scientists should be comfortable with MS Excel
Most people who call themselves “data scientists” aren’t usually fond of MS Excel. It is slow and clunky, can only handle a million rows of data (and nearly crash your computer if you go anywhere close to that), and despite the best efforts of Visual Basic, is not very easy to program for doing repeatable…
Meaningful and meaningless variables (and correlations)
A number of data scientists I know like to go about their business in a domain-free manner. They make a conscious choice to not know anything about the domain in which they are solving the problem, and instead treat a dataset as just a set of anonymised data, and attack it with the usual methods.…
Yet another way of classifying data scientists
There are many axes along which we can classify data scientists. We can classify based on the primary specialty, in terms “analytics”, “business intelligence” and “machine learning”. We can classify based on domain, into “financial data scientists” and “retail data scientists” and “industrial data scientists”. We can classify by the choice of primary software tool,…
Stocks and flows
One common mistake even a lot of experienced analysts make is comparing stocks to flows. Recently, for example, Apple’s trillion dollar valuation was compared to countries’ GDP. A few years back, an article compared the quantum of bad loans in Indian banks to the country’s GDP. Following an IPL auction a few years back, a…
New blog on visualisations
For a while now I’ve been commenting on visualisations on Twitter, pointing out the good (and especially bad) graphs. I also have a “chart of the edition” section in my newsletter. Recently, the legendary Krish Ashok suggested that I collect all these bad visualisations in a Tumblr, and I decided to oblige. You can follow…
Beer and diapers: Netflix edition
When we started using Netflix last May, we created three personas for the three of us in the family – “Karthik”, “Priyanka” and “Berry”. At that time we didn’t realise that there was already a pre-created “kids” (subsequently renamed “children” – don’t know why that happened) persona there. So while Priyanka and I mostly use…
More on interactive graphics
So for a while now I’ve been building this cricket visualisation thingy. Basically it’s what I think is a pseudo-innovative way of describing a cricket match, by showing how the game ebbs and flows, and marking off the key events. Here’s a sample, from the ongoing game between Chennai Super Kings and Kolkata Knight Riders.…
A banker’s apology
Whenever there is a massive stock market crash, like the one in 1987, or the crisis in 2008, it is common for investment banking quants to talk about how it was a “1 in zillion years” event. This is on account of their models that typically assume that stock prices are lognormal, and that stock…
English Premier League: Goal Difference to points correlation
So I was just looking down the English Premier League Table for the season, and I found that as I went down the list, the goal difference went lower. There’s nothing counterintuitive in this, but the degree of correlation seemed eerie. So I downloaded the data and plotted a scatter-plot. And what do you have?…
Stirring the pile efficiently
Warning: This is a technical post, and involves some code, etc. As I’ve ranted a fair bit on this blog over the last year, a lot of “machine learning” in the industry can be described as “stirring the pile”. Regular readers of this blog will be familiar with this image from XKCD by now: Basically…