Worth a thousand words: 2016

Friday, 21 October 2016

Is there a good alternative to the boxplot for comparing distributions?

I think box plots can be incredibly useful. Many charts in analytics presentations focus on measures of central tendency which can be misleading. Two groups may have similar means but very different standard deviations.

In market research, for example, it could be important to understand that one group of respondents has no strong view either positively or negatively, whilst another group shows a wide range of feeling on the same question. This sort of distinction is made immediately clear when a boxplot is used:

Boxplots have one major drawback. They are not readily understood by non-technical audiences. Even attempting to explain what is shown is problematic as the audience would still need to to be comfortable with the concept of quartiles.

I'm going to look at three alternatives that I think could work better in a non-technical situation. As is nearly always the case, each has its drawbacks but I think each could be useful.

Option 1 - Plot individual data point in columns

The chart below plots all of the data points in my sample. For each group I've randomly perturbed the points along the x-axis to create columns.

This has the advantage of enabling intuitive understanding of the distribution. However, it is only suitable for small numbers of observations. It would be possible to sample from larger sets of data but this would then be describing the sample rather than the population. Depending on the circumstances, this may or may not matter.

Option 2 - Heat map

This approach gives a clear view of the general distribution but I feel it makes the specific statistics such as the median harder to estimate. From a purely practical point of view, I also found it hard to create a succinct way to convey the encoding of values to colour.

Option 3 - Stacked density plots

With a single distribution to convey, a histogram or density plot would be the obvious alternative to a boxplot. However, with multiple groups these become harder to convey on a single chart. Stacking the density plots can provide a solution to this, although I feel it works better for the tighter distributions than for the dispersed groups.

Are there other methods that work well for comparing distributions? Should I consider violin and bean plots? Let me know in the comments below.

Saturday, 15 October 2016

Doughnut Pie

I was mortified at a recent conference organised by a big player in visualisation software (who shall remain nameless) to see an example dashboard sporting a hybrid doughnut-pie chart. The chart in question looked something like this:

In fact it was worse than this, I've tried to make the best of a bad chart by adding in some better practice around titles and labelling.

I think that there are four main things wrong with this chart:

The linear distances for the pie chart are much shorter than those for the donut. For instance Segment B has a lower value in 2015 but a longer line when compared to 2014.
Only the first and last segments can be compared directly between years as they have a commonly aligned point - the start of Segment A or the end of Segment E.
There's no easy way to denote what the inner and outer sections represent. I've added this to the title as the best way that I could think to do it.
As with all pie charts, the segments have to be coloured differently, preventing the use of colour to draw attention to key points.

I set about looking at alternatives to this chart, only to discover that Cole Nussbaumer Knaflic has been there before me looking at a similar problem on her excellent blog. Rather than create a poor reproduction of her thinking, I'd suggest having a read there.

Suffice it to say here that I decided upon a slope chart as the best way to convey the particular point that I chose to make with the data. With a slope chart every segment is easily comparable across the two years and it's possible to be sparing with colour to highlight a key point.

Incidentally, I did come across a doughnut pie that I feel I could get behind.

Monday, 3 October 2016

Avoiding y-axis issues

The issues with a y-axis that does not start at zero are well known so I won't go into detail on them here. To summarise:

A y-axis that doesn't start at zero is problematic with any chart type where the height above the axis encodes data, i.e. bar, column and area charts.
Line charts can be used with a non-zero axis, the key being to choose a scale that highlights the variation in the data. The height above the axis is not being read.

Sometimes we will want to plot data that breaks this rule. What do we do when we run into a y-axis problem? My answer is that it is often possible to reshape the data to work around the issue. Let me explain by way of a (very) made up example.

The Haggis is a noble beast*. It spends its life walking around the mountain it was born on in one direction. For that reason the legs on one side of its body are shorter than on the other - perfect for life on a slope. The Clockwise Haggis, with short right legs and long left legs, is native to Scotland and until recently has been thriving. However, the introduction of the continental Anti-clockwise Haggis has started to lead to a decline in numbers.

To illustrate the threat, naturalists produced the following chart:

Whilst the trend is stark - the native species is being replaced by the invader - the scale, with a non-zero minimum, is misleading, suggesting a much greater proportion of Anti-clockwise Haggis than is really the case.

Correcting the scale however, gives us a problem. The trend is all but lost in the greater population size of the Clockwise Haggis.

My solution to this problem is to change the metric that we use to tell the story. What we're really interested in is the relative change in the population of the two species. What happens if we chart that?

Taking this approach makes the worrying trend clear. Each quarter since Q2 2014 has seen the native population decline roughly in proportion to the growth of the introduced species. By changing the metric we chart we're able to tell a clear story without being open to any accusations of creating a misleading picture.

Inevitably, changing the metric will come at the cost of some information, in this case the size of the decline relative to the total population. This is unavoidable. If we feel that the additional information is important then we can consider using a second line chart to show this information.

*Not really

Saturday, 1 October 2016

Visualising cricket - an alternative to the worm?

Cricket is a great game for statisticians. I think this is because of the structured nature to the game. Like baseball, well known for its stats, the play is broken into discrete units, each with a finite number of outcomes. Balls and overs act as natural denominators for many metrics such as:

Run rate - the mean runs per over of the batting side
Strike rate - the mean number of balls a bowler needs to take a wicket
Economy - the mean number of runs given away by a bowler per over.

Unlike baseball however, some would say that the statistics used in cricket are not as helpful as they could be. A bowler's strike rate, for instance, very much depends on the quality of the batters that they are up against. With this in mind I set myself the challenge of seeing if I could improve upon one of the key visualisations used to describe a cricket match - the worm.

The worm shows the cumulative runs scored by each team across the overs in the game. In a recent One Day International (50 over game) between England and Pakistan, the worm looked like this:

Pakistan batted first and scored 247, by today's standards a fairly low score. The worm reveals that England were able to stay ahead of Pakistan's position at the same point in the game throughout their innings, winning with around three overs to spare.

But is that the whole story? The worm doesn't show a crucial part of the equation, which is wickets in hand (the number of batters still available to the batting team minus one). After a given number of overs, a team could be well ahead in terms of the number of runs they have relative to the other team at the same point of their innings. However, if they've lost nine wickets, you would not expect them to score many more. A quick 100 all out from 10 overs, does not beat a slow 200 from all 50 overs.

In my attempt to improve on the worm, I've focused on the second innings. Is there a better way to show how the team batting second is doing in their run chase? Are they looking like they are going to win or lose? To do this I've enlisted the help of the famed Duckworth Lewis Method.

Duckworth Lewis is a calculation that is used for interrupted cricket matches to determine whether the score the team batting second has reached is a winning score at that point in the game. Crucially, it takes into account how many wickets the batting team have remaining.

I've plotted the net position of England against the Duckworth Lewis target at the end of each over of their innings. If England are ahead of that target then they're on track to win. If they're behind then they need to turn things around. This approach reveals a very different interpretation of the game:

For the first 15 overs England were not doing well. They lost too many wickets to be in a winning position, even though they were just ahead of Pakistan's score at the same point in the innings. A strong partnership between Ben Stokes and Jonny Bairstow was key to turning the game around, allowing England to cruise for the last 15 overs.

Personally, I think this second chart tells a much more interesting story about the match. The feeling of momentum swinging from one team to the other is key to enjoying cricket and I think this comes across strongly.

I'm interested to hear your thoughts? If you're a cricket fan (or a fan of another sport) is there a visualisation you love? Are there any you hate? What do you think of mine?