You have done an experiment or a survey or you have collected data and are ready to analyze it. But what is the first step? Everyone will tell you that you need to visualize your data with the graphs. Yes, but how? and why? What are the most common pitfalls to avoid?
We are very grateful to Marion Louveaux, biographical data analyst for this combination, adaptation and translation of the two French articles "Les pièges de la représentation de données" and "Comment se passer d & # 39; un barbarplot?".
Imagine that the data set contains the dimensions of two groups of individuals, A and B. You want to know if the size of individuals differ between the two groups. In this article, we'll first show you an example of a wrong approach to viewing data and we'll give you advice on how to proceed in the right way.
The script used to generate this dataset is available at the end of this article.
To compare two populations, you've probably learned to look at the average and standard deviation. Your first idea is to make a barplot to represent the average, with a standard deviation error bar. This type of chart is commonly called a "Barbarplot". To calculate a summary variable (here, the mean and the standard deviation) separately for the different groups, we use
group for +
library (tidyverse) Information% group_by (Group)%>% summarize (Media = average (Size), SD = sd (Size)) ggplot (information, aes (x = group, y = medium, fill = group)) + geom_bar (stat = "identity", position = position_dodge (), color = "grey30") + geom_errorbar ( aes (ymin = Media - SD, ymax = Media + SD), width = .2, position = position_dodge (.9), size = 1 ) + scale_fill_viridis_d ()
Well, it seems there is no variation here!
For security, you decide to take a look at the distribution of data. You probably learned that a boxplot (
geom_boxplot) is usually used to display a distribution. Then we go!
ggplot (data, aes (x = group, y = size, fill = group)) + geom_boxplot (color = "grey30") + scale_fill_viridis_d () + scale_y_continuous (limits = c (0, NA), expand = c (0, 0)) + guide (fill = FALSE)
Well, with regard to size distributions, the two populations still seem quite similar.
The histogram of raw data
You are very tempted to stop here, but all of a sudden you remember that distributions can also be viewed with a histogram, even if you find it less easy to read. So let's make a histogram (
geom_histogram) with the raw data.
ggplot (data) + geom_histogram (aes (Size, fill = Group), position = "identity", alpha = 0.60, bins = 10) + scale_fill_viridis_d () + scale_y_continuous (limits = c (0, NA), expand = c (0, 0))
Oooooops! What the hell is this mess ?! The distribution in group B has two peaks! ??
Viewing data before analyzing them is important. Help identify trends and potential effects. It also helps to look for extreme values or anomalous values and decides to remove them or keep them according to the request. And it helps to choose the most appropriate statistical tests. In fact, depending on the type of data (binary, linear, quadratic, bimodal, …), you will not do the same analysis. At that point, you're probably wondering why we're doing a new article today, while there are already many blog articles on data visualization, starting with our ggplot2: welcome viridis. Well, this comes from a simple observation: yes, the display of data with beautiful graphics is a good start, but it is preferable to choose the display of relevant data …
Bad choices of graphic representations can lead to charts that are difficult to read, but above all can influence interpretation. One can see a difference where there is not, or on the contrary, it does not see a difference when this is very important! Or you may have the impression that there are links between the measured variables, when there are not, or the opposite. You do not want to misunderstand your data, but you would not even want to present misleading charts to your readers and run the risk of someone thinking you're trying to lie about your data, right?
As for the misleading charts, here are some comments on the barbarplot shown above:
- the group legend should be removed, as was done for the boxplot. It makes no sense to have the same information in two different places.
- The X axis must be moved to the zero level of the Y axis because the size of individuals can only be positive. We do not want to lie on the scale of differences …
- the so-called error bar is in no way a visual means to represent the result of a possible statistical test of the average comparison: (i) the statistical difference between the means of two groups can not be detected by comparing the standard deviations of the distributions. (ii) the statistical difference can be represented through the confidence interval of the average estimate, not through the error bar.
- if you calculate the standard deviation of the average estimate distribution, you should display a bar about twice the standard deviation on each side of the mean for a representation of the 95% confidence interval.
By the way, the visualization of a "p-value" has never been a guarantee of your sincerity and no longer gives you confidence in misrepresentation.
Remember that a graphical view is a simplification of raw data. You could lose information. It is therefore important to choose an appropriate graphic display. In addition, graphical representations such as bar graphs only use summary statistics of the data, namely mean / median and standard deviation. The boxplot shows additional statistics, but to know the distribution of data, so nothing is better than a histogram or a violin graph to look at the raw data.
Exploration of raw data: selection of the appropriate graph
The first and most important (and usually forgotten) step is to take a look at the distribution of raw data, with one variable at a time.
Use a barplot (
geom_bar) for the calculation of data in categories
The surface under each bar represents a quantity: the barplot is like a picture on the side of large bags of potatoes in which you put all the objects you wanted to count. If you are not able to do it for real, the barplot is not the correct graphical representation. Here, we can use a barplot to compare the number of individuals in each group, because it would be possible to put all people for real in two bags of potatoes.
ggplot (data) + geom_bar (aes (Group, fill = Group), color = "grey30") + scale_fill_viridis_d () + scale_y_continuous (limits = c (0, NA), expand = c (0, 0)) + guide (fill = FALSE)
With this graph, we see that we do not have the same number of individuals in group A and B.
And yes, we cheated a little here to put all the individuals in the "bags", but this is just to give you an example …
Use a histogram (
geom_histogram) to represent numerical data
While one can not put all the size values from all the individuals in potato bags, one can put all individuals of the same size in the same bag, then the bars of the histogram.
ggplot (data) + geom_histogram (aes (Size), bin = 30)
Here the size distribution does not correspond to a known simple distribution, but it seems there are at least two groups.
Exploring data using multiple variables
Once you know the distribution of each variable in the data set, you can take a look at their behavior in combination with another variable. This is where we can separate the histogram based on the groups.
For a final graph, the histogram is not the cutest representation. For starters, we could smooth it a little bit with
geom_density. Note that you can choose the degree of honing, such as the number of classes in a histogram.
ggplot (data) + geom_density (aes (Size, fill = Group), position = "identity", alpha = 0.60) + scale_fill_viridis_d () + scale_y_continuous (limits = c (0, NA), expand = c (0, 0))
The violin texture
The plot of the violin offers a nice alternative to the histogram. We can display the previous distribution densities vertically for better viewing. And we will add the median as a bonus.
ggplot (data) + geom_violin (aes (Group, Size, fill = Group), position = "identity", draw_quantiles = c (0.5), color = "grey30") + scale_fill_viridis_d () + scale_y_continuous (limits = c (0, NA), expand = c (0, 0)) + guide (fill = FALSE)
It is no longer possible to hide the bimodal appearance of group B. However, in this figure, we can not see how many individuals there are in each group. We do not want to lie to our readers …? We will now use people on the plot of the violin
Note that if you have few people in each group, the violin texture It does not always look very good In this case, you can choose to use only the
geom_dotplot with a
binwidth adapted to your data.
library (cowplot) # test measurement sample_size% count (Group) DATA_SIZE% left_join (sample_size, by = "Group")%>% mutate (myaxis = paste0 (Group, " n", "n =", n)) # violin with median g1
With this type of data, we now know that we can not go further in statistical analysis without violating the hypotheses of constructing some tests …
- always look first at the distribution of raw data
- check the variables one by one, then give a look at the relationships between variables
- never take a statistical test before the exploration steps above
- for a small number of individuals, you superimpose the points on the chart (for example, use
geom_dotplotin combination with
- choosing the wrong graphic representation can make people think that you are trying to lie about the data. Take a look at the data-to-viz.com website to choose an appropriate graphic representation that will leave no doubt about your intellectual honesty.
Do not hesitate to leave your comments and questions at the end of the article. We will read them carefully.
- To create the same fake data set:
set.seed (4321) data
- For charts in this article, we have defined a default ggplot2 theme using
theme_set (theme_classic ()).
Translated and adapted by Marion Louveaux, bio image data analyst
The pitfalls in data visualization: how to avoid barbarotypes? appeared first on (en) The R Task Force.