Ease comparisons
Rafael Irizarry
Use common axes
Since there are so many points, it is more effective to show distributions rather than individual points. We therefore show histograms for each group:
data:image/s3,"s3://crabby-images/26004/260047d59858c015862fdcee90327dbea8e8f39b" alt="Histograms showing the average heights between females and males."
However, from this plot it is not immediately obvious that males are, on average, taller than females. We have to look carefully to notice that the x-axis has a higher range of values in the male histogram. An important principle here is to keep the axes the same when comparing data across two plots. Below we see how the comparison becomes easier:
data:image/s3,"s3://crabby-images/922c4/922c49f3ec5da05e2c488484be1c38e6ec887466" alt="Histograms showing the average heights between sexes, but both now have the same axes, showing the data with more accuracy."
Align plots vertically to see horizontal changes and horizontally to see vertical changes
In these histograms, the visual cue related to decreases or increases in height are shifts to the left or right, respectively: horizontal changes. Aligning the plots vertically helps us see this change when the axes are fixed:
data:image/s3,"s3://crabby-images/c52a7/c52a73b5adf20ea4756d3353b15b1e0612daf20f" alt="Two histograms titled "Female" and "Male" compare the distribution of heights. Both distributions show heights concentrated in the middle with fewer people at the extremes."
This plot makes it much easier to notice that men are, on average, taller.
If , we want the more compact summary provided by boxplots, we then align them horizontally since, by default, boxplots move up and down with changes in height. Following our show the data principle, we then overlay all the data points:
data:image/s3,"s3://crabby-images/4c10f/4c10ff10a8b8cd603d74f6610b0400e0551b5e12" alt="A dot plot overlayed with boxplots to show the most common heights in inches for both sexes, falling commonly near 65 inches for females, and 69 for males."
Now contrast and compare these three plots, based on exactly the same data:
data:image/s3,"s3://crabby-images/85631/85631983a200f33012f868872ed4be3946a27e71" alt="Several density plots comparing height distribution between the sexes. From left to right: The barplot with two averaged heights; the histograms showing distribution density between heights; and the dot plot with a boxplot overlaid."
Notice how much more we learn from the two plots on the right. Barplots are useful for showing one number, but not very useful when we want to describe distributions.
Consider transformations
We have motivated the use of the log transformation in cases where the changes are multiplicative. Population size was an example in which we found a log transformation to yield a more informative transformation.
The combination of an incorrectly chosen barplot and a failure to use a log transformation when one is merited can be particularly distorting. As an example, consider this barplot showing the average population sizes for each continent in 2015:
data:image/s3,"s3://crabby-images/8671c/8671c2be4649657dff328b67b49a646876ce8b13" alt="Bar chart titled "Population in Millions" shows countries grouped by continent. Asia has the highest population, followed by the Americas."
From this plot, one would conclude that countries in Asia are much more populous than in other continents. Following the show the data principle, we quickly notice that this is due to two very large countries, which we assume are India and China:
data:image/s3,"s3://crabby-images/35f4b/35f4b3ff3101b4ec51c25816a57e6d63ac2fda23" alt="A dotplot. It uses jitter and alpha blending, so some of the data points appear next to each other, or may be lighter to darker to show distribution."
Using a log transformation here provides a much more informative plot. We compare the original barplot to a boxplot using the log scale transformation for the y-axis:
data:image/s3,"s3://crabby-images/b824f/b824f1c0c68ed654e60e4045605e66c147309306" alt="A barplot and dotplot. On the left is a barplot in which Asia remains to appear as the highest populated continent. The other graph, a dotplot, contains the same labels, but now using log transformation, appears less skewed towards Asia."
With the new plot, we realize that countries in Africa actually have a larger median population size than those in Asia.
Other transformations you should consider are the logistic transformation (logit
), useful to better see fold changes in odds, and the square root transformation (sqrt
), useful for count data.
Visual cues to be compared should be adjacent
For each continent, let’s compare income in 1970 versus 2010. When comparing income data across regions between 1970 and 2010, we made a figure similar to the one below, but this time we investigate continents rather than regions.
data:image/s3,"s3://crabby-images/2436c/2436c2061bbb665751c7896be3bcf0ea480a153b" alt="A boxplot graph titled "Income in dollars per day" compares income in 1970 and 2010 across five continents."
The default is to order labels alphabetically so the labels with 1970 come before the labels with 2010, making the comparisons challenging because a continent’s distribution in 1970 is visually far from its distribution in 2010. It is much easier to make the comparison between 1970 and 2010 for each continent when the boxplots for that continent are next to each other:
data:image/s3,"s3://crabby-images/571ab/571abadf87bac132d9c811629e621acdae04a4a9" alt="Line chart titled "Income in dollars per day" compares income in 1970 and 2010 across five continents. The x-axis now shows the 1970 and 2010 incomes next to each other."
Use color
The comparison becomes even easier to make if we use color to denote the two things we want to compare:
data:image/s3,"s3://crabby-images/3b969/3b969bd374a1e846b268dfb6dd18903db1949349" alt="A boxplot with it's data is color coordinated, with income data from 1970 colored pink, and income data from 2010 colored blue."