A short but simple explanation of boxplots and violin plots – easily explained with an example!

Violin plots and boxplots are a great way to visualise and compare a continuous variable across different groups or categories.

For example, you might want to find out which group of mice is heavier, and if there are differences in the range of the weights. Or the mitochondrial gene expression percentage across different cell types, when analysing scRNAseq data. Or if the expression of a particular gene in a population is higher or lower than another in your sample. Violin plots and boxplots allow you to answer these questions by enabling you to visualise how are your data points distributed across different groups.

In this post, you will find out how to interpret a violin plots and boxplots.

    So if you are ready… let’s dive in!

    Click on the video to follow my easy boxplots + violin plots explanation with an example on Youtube!

    Let’s start with some mice.

    In biology, we are often interested in the distribution of our data. In our previous plot, we talked about density plots. Another great way to visualise and summarise the distribution of a continuous numeric variable is through violin plots and boxplots.

    Let’s begin with an example.

    We have two samples of mice, a group of diabetic mice, and our control group. Let’s have a look at the weights of our 20 diabetic mice. We can already get a few descriptive statistics from our data, like the minimum and maximum values, the median, the mean

     

    The key to boxplots: quartiles

    Other common descriptive statistics are the quartiles. There are 3 quartiles that divide a dataset into four parts.

    The first quartile, Q1 or lower quartile, is the value below which 25% of the data falls. In other words, 25% of the data points in the dataset are less than or equal to Q1.

    The second quartile, Q2, is the middle value of the dataset when ordered from least to greatest. It divides the dataset into two equal parts, with 50% of the data falling below Q2 and 50% above it. You probably know another word for this, it’s the median.

    The third quartile, Q3, or upper quartile, is the value below which 75% of the data falls. In other words, 75% of the data points in the dataset are less than or equal to Q3.

    Squidtip

    To interpret a boxplot, think of dividing your dataset into four equal parts using Q1, Q2 (median) and Q3.

    The interquartile range (IQR) is the difference between the third and the first quartile. It represents the spread of the middle 50% of the data and is often used as a robust measure of variability. Why robust? The IQR is less sensitive to outliers compared to the range, which goes from minimum to maximum. So if you have a really high value in your dataset, like this mouse, then the range will be really affected, but the interquartile range not much, because it’s just taking the variability of the middle 50% of the data.

    Squidtip

    The IQR is less sensitive to outliers compared to the range.

    Plot the quartiles to get a boxplot

    Let’s put all this into a plot.

    If we mark our minimum, maximum, Q1 and Q3, and the median…

    we get a boxplot!

    Squidtip

    The length of the box is the IQR.

    We can flip it over and see it vertically, which is how boxplots are often shown.

    How to interpret a boxplot

    So now, we can easily get a visual summary of our data. This way, with a  boxplot we can quickly identify:

    MEDIAN

    The median values are shown by the line that divides the box into two parts. Half the scores are greater than or equal to this value, and half are less.

    DISPERSION

    The dispersion of the dataset is basically how stretched or squeezed is our distribution. We can observe the range, and the interquartile range.

    SIGNS OF SKEWNESS

    Skewness just means if the data is symmetrical, or if most values are concentrated towards either end.

    • When the median is in the middle of the box, and the whiskers are about the same on both sides of the box, then the distribution is symmetric.
    • When the median is closer to the bottom of the box, and if the whisker is shorter on the lower end of the box, then the distribution is positively skewed (skewed right).
    • When the median is closer to the top of the box, and if the whisker is shorter on the upper end of the box, then the distribution is negatively skewed (skewed left).
    Squidtip

    Swipe below to see how skewness changes the look of our boxplot!

    OUTLIERS

    An outlier is an observation that is numerically distant from the rest of the data. Sometimes, the whiskers of the boxplots don’t go all the way to the minimum and maximum, but show -1.5 and +1.5 times the IQR. In other words, one and a half times the length of the box. Then, outliers, the values that fall outside this range, are shown as dots outside the whiskers.

    Boxplots to compare groups in 4 simple steps

    Boxplots are especially useful to compare a continuous variable (for example, weight) across different groups.

    Let’s go back to the weights of the diabetic versus control mice.

    STEP 1

    The first thing we should check is the median, which is a good way to compare both groups. If the median line of a box plot lies outside of the box of a comparison box plot, then there is likely to be a difference between the two groups. The median weight is higher for diabetic mice.

    Note that to be able to say significantly higher we would have to use a statistical test. A boxplot is just a way to visualise the data, we’re not doing any statistical test.

    STEP 2

    We can there compare the interquartile ranges (that is, the box lengths) to examine how the data is dispersed between each sample. The longer the box, the more dispersed the data. The smaller, the less dispersed the data. In this case, the spread of the data is higher in control mice, as we can see from the size of the box, which is the interquartile range.

    Next, look at the overall spread as shown by the extreme values at the end of two whiskers. This shows the range of scores (another type of dispersion). Larger ranges indicate wider distribution, that is, more scattered data.

    Squidtip

    Sometimes, the whiskers of the boxplots don’t go all the way to the minimum and maximum, but show -1.5 and +1.5 times the IQR

    STEP 3

    Look for potential outliers. When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the boxplot. We can see a few outliers in the control group.

    STEP 4

    Look for signs of skewness – if the data does not appear symmetric. Do you expect a normal distribution, or should most of the values be concentrated towards one end?

    Nice! As you can see, boxplots are great to compare different groups or categories.

    So… what about violin plots?

    We’ve seen how to visualise the distribution of a continuous variable (such as weight) with density plots (check out this post for more) and with boxplots.

    What if we could combine both?

    This will give us a violin plot. A violin plot is basically a boxplot, but then has a rotated density plot on each side.

    With a violin plot, not only can we easily visualise stats like the median and IQR, but we also get a the advantages of seeing the probability density of the data, so we can quickly see if our data is normally distributed, or if it has a skew or multimodality. Wider sections of the violin plot indicate a higher density of data points, while narrower sections indicate lower density.

    Final notes on density plots

    In summary, violin plots and boxplots are particularly useful for comparing the distributions of a continuous variable across different groups or categories, allowing for easy identification of differences in central tendency, spread, and skewness.

    In this other post I explain how to create your own violin plots in R.

    Want to know more?

    Additional resources

    If you would like to know more about violin plots and boxplots, check out:

    You might be interested in…

      Ending notes

      Wohoo! You made it ’til the end!

      In this post, I shared some insights on violin plots and boxplots.

      Hopefully you found some of my notes and resources useful! Don’t hesitate to leave a comment if there is anything unclear, that you would like explained differently/ further, or if you’re looking for more resources on biostatistics! Your feedback is really appreciated and it helps me create more useful content:)

      Before you go, you might want to check:

      Squidtastic!

      You made it till the end! Hope you found this post useful.

      If you have any questions, or if there are any more topics you would like to see here, leave me a comment down below.

      Otherwise, have a very nice day and… see you in the next one!

      Squids don't care much for coffee,

      but Laura loves a hot cup in the morning!

      If you like my content, you might consider buying me a coffee.

      You can also leave a comment or a 'like' in my posts or Youtube channel, knowing that they're helpful really motivates me to keep going:)

      Cheers and have a 'squidtastic' day!

      0 Comments

      Submit a Comment

      Your email address will not be published. Required fields are marked *