Friday, 21 October 2016

Is there a good alternative to the boxplot for comparing distributions?

I think box plots can be incredibly useful. Many charts in analytics presentations focus on measures of central tendency which can be misleading. Two groups may have similar means but very different standard deviations. 

In market research, for example, it could be important to understand that one group of respondents has no strong view either positively or negatively, whilst another group shows a wide range of feeling on the same question. This sort of distinction is made immediately clear when a boxplot is used:






















Boxplots have one major drawback. They are not readily understood by non-technical audiences. Even attempting to explain what is shown is problematic as the audience would still need to to be comfortable with the concept of quartiles.

I'm going to look at three alternatives that I think could work better in a non-technical situation. As is nearly always the case, each has its drawbacks but I think each could be useful.

Option 1 - Plot individual data point in columns

The chart below plots all of the data points in my sample. For each group I've randomly perturbed the points along the x-axis to create columns.





















This has the advantage of enabling intuitive understanding of the distribution. However, it is only suitable for small numbers of observations. It would be possible to sample from larger sets of data but this would then be describing the sample rather than the population. Depending on the circumstances, this may or may not matter.

Option 2 - Heat map

This approach gives a clear view of the general distribution but I feel it makes the specific statistics such as the median harder to estimate. From a purely practical point of view, I also found it hard to create a succinct way to convey the encoding of values to colour.

















Option 3 - Stacked density plots

With a single distribution to convey, a histogram or density plot would be the obvious alternative to a boxplot. However, with multiple groups these become harder to convey on a single chart. Stacking the density plots can provide a solution to this, although I feel it works better for the tighter distributions than for the dispersed groups.





























Are there other methods that work well for comparing distributions? Should I consider violin and bean plots? Let me know in the comments below.

No comments:

Post a Comment