Box Plot Basics

The Box-Plot, also known as the box and whisker plot, is a graphical method of displaying five descriptive statistics: the median, the upper and lower quartiles, and the minimum and maximum data values. First created by John Tukey in a 1977 publication, Box Plots have evolved into a familiar and useful standard in data interpretation. Interpretation of a Box Plot is relatively straightforward.

The “box” itself represents the middle 50 percent of the data. The upper boundary (also known as the “hinge”) of the box locates the 75th percentile of the data set while the lower boundary indicates the 25th percentile. Quite simply, the 25th percentile represents the value where 25 percent of the data is lower, and likewise, the 75th percentile represents the value that 75 percent of the data falls below. The area between these two boundaries is known as the “inter-quartile range” and this gives a useful indication of the “spread” of the middle 50 percent of the data. This is a more robust range for interpretation because the middle 50 percent is not affected by outliers or extreme values, and gives a less biased visualization of the data spread.

There is also a line in the box that indicates the “median” (or central most value) of the data. Not to be confused with the “mean”, the median is the value that is the middle of the data set when the values are ranked in order, resulting in the same number of values above as below. This is a measure of “central tendency”, or in layman’s terms, where the center of the data is. Knowing this is important to estimating the type of data distribution you have.

The “whiskers” of the box-plot are the vertical lines of the plot extending from the box, and indicate the minimum and maximum values in the dataset. If there are “outliers” in the data, the whiskers extend to their maximum of 1.5 times the inter-quartile range. Now that the pieces of the Box Plot have been identified, it is useful to understand that the box, the whiskers, even the median can reveal much information about a dataset by virtue of their position, length, or size.

The strong point of the Box Plot is its ability to compare two populations without knowing anything about the underlying statistical distributions of those populations. The distribution that defines a population also determines the type of statistical analyses that can be properly applied, so the Box Plot actually allows you to compare “apples and oranges” graphically that might not be directly comparable statistically. Other strong points of the Box Plot include its ability to display data spread at a glance, reveal data symmetry and skewness as well as the presence of outliers. With a software package that allows you to not only create the Box Plot but display the data points as well, one can readily locate and identify “outliers” (i.e. values that may belong to an entirely different population or may be result of measurement error). Box Plots can also be displayed side by side, allowing the direct visual comparison of categorical variables.

Box Plots allow for the early evaluation of data before conducting time-consuming, in-depth statistical analysis. A few of the things can be observed directly from Box Plots. Some useful things one can do with a Box Plot includes:

1) Discerning skewed data by noting the median in the box is not equidistant from the hinges.

2) Identifying outliers and extreme values by their position with regard to the whiskers.

3) Visually estimating the degree of skewness and presence of outliers in the data allows the analyst to properly select statistical analyses that can robustly handle such data, rather than spending time performing analysis which may be adversely affected by these factors.

4) Visually estimating the expected range of data by noting the width of the “box” as compared to other variables, or groups of data.

5) Monitoring the data collection process. The tendency of data is to distribute itself around a mean. If one notices the “box” is getting wider (rather than narrower) after the inclusion of more data, perhaps the data collection methodology requires more scrutiny.

6) If the whiskers are getting longer with additional data, perhaps there is a measurement or instrument calibration issue providing more extreme data.

7) If the median consistently becomes more skewed with additional data, perhaps there is a systemic bias in the data collection or “drift” in the data itself that merits attention.

The bottom line is the Box Plot is a useful graphic method of data evaluation and estimation. Proper use of the Box Plot allows the analyst to note issues that may affect statistical modeling adversely BEFORE investing time performing analyses which may lead to biased results.