Differential species analysis

Dr.TomAbout 18 wordsLess than 1 minute

Differential Species Analysis

Difference analysis

The differences in the experimental results are composed of real differences and experimental errors. In order to understand whether the differences in the experimental data are caused by the real differences, it is necessary to test the differences (also known as hypothesis testing). Use different difference test methods according to whether the sample conforms to the specific distribution law:

  • Parameter test: it is used when data in accordance with a specific distribution law (usually normal distribution). T-test/ANNOVA is commonly used. The Alpha diversity and Beta diversity of metagenomic data can use parameter test.
  • Nonparametric test:it is used when it is impossible to determine whether the data satisfies a certain distribution law. Wilcoxon/Kruskal-Wallis is commonly used. The abundance data of microbial community composition usually use nonparametric test.

According to the type of comparative sample, it can be divided into:

  • Single sample: compare the experimental data with a specific value. For example, in the process of biological fermentation, whether the yield of fermentation products in a fermentor reaches the preset value.
  • Paired samples: whether there are differences between different treatments and different time samples, the sample number of different comparison groups should be the same. For example, whether the fermentation raw materials with different C / N ratio have any effect on the fermentation ability of the same strain.
  • Independent samples: comparing the data obtained from different batches of experiments, the number of samples is not necessarily the same. For example, in order to determine whether the yield per mu of an improved rice variety is increased, the data of yield per mu of the improved rice variety is compared with that of rice per mu before improvement.

The comparison of microbial abundance in different groups can belong to paired samples or independent samples.

T-test / ANNOVA

The parameter test data can be well described by the distribution defined by one or more parameters, and in most cases the normal distribution, so it is necessary to verify whether the sample distribution conforms to the normal distribution between the parameter tests. The parameter test was used to test the average value and variance of the parameters. T-test is used when the number of groups is 2, and ANNOVA is used when the number of samples is greater than 2.

Wilcoxon / Kruskal-Wallis

Wilcoxonrank-sumtest (Wilcoxon Rank Sum Test) is a method of nonparametric test for two groups of independent samples. The original assumption is that there is no significant difference in the overall species distribution between the two groups of independent samples. the average rank of the two groups of samples is calculated to determine whether there is a difference in the distribution of the two populations. This analysis can test the significant difference between the species of the two groups of samples and correct the p-value calculated FDR value.

Kruskal-Wallis (KW) is a nonparametric test for three or more sets of data. Its essence is the generalization of the Mann-Whitney U test of two independent samples in multiple groups (greater than or equal to three groups), and it is also used to test whether there are significant differences in the distribution of multiple populations.

LEfSe analysis

LEfSe(LDA Effect Size)[1] is an analytical tool for discovering and interpreting high-dimensional data biomarkers (genes, pathways, taxons, etc.), which can be compared between two or more groups. LEfSe emphasizes statistical significance and biological correlation, and can find biomarkers with statistical differences between groups.

There are three steps to LEfSe analysis:

  1. First of all, the Kruskal-Wallis rank sum test was used to detect the species abundance differences between different groups, and the significant characteristics of the differences were obtained (the classification of specified classification level).
  2. According to the significant characteristics of the difference in the previous step, the groups participating in the test are compared by using the group Wilcoxon rank sum test.
  3. Finally, LDA (Linear discriminant analysis, linear discriminant analysis) is used to estimate the influence of these different species on the differences between groups, that is, LDA score is obtained.

LDA [2] is a classical and popular algorithm in the field of machine learning and data mining. It is a dimensionality reduction technology of supervised learning (supervised learning), that is to say, each sample in its data set has category output. Unlike PCA, PCA is an unsupervised dimensionality reduction technique that does not take into account the output of sample categories. Compared with PCA analysis, LDA algorithm can make good use of the grouping information of samples, and the result is more reliable.

FAQ

Q:What is the basis for choosing Venn diagram and UpSetR?

A:Select according to the number of groups. When the number of groups is less than 5, you can choose Venn diagram or UpSetR; when the number of groups is greater than 5, the Venn diagram cannot display so much data, so you can only choose UpSetR.

Q:What are the principles of using parametric test and nonparametric test?

A:If the data conform to normal distribution, parametric test is used first, followed by non-parametric test. Because nonparametric test is not as sensitive as parametric tests. If the data contain extreme values and cannot be removed for some reasons, non-parametric test can be used to test the ranking of the data, and extreme values have little impact on the results.

Q:How to interpret the results of the analysis of significant differences between groups?

A:The difference analysis between groups is the analysis of whether the difference between groups is significant at different taxonomic levels. Usually, a hypothesis test is used to verify that the probability of the difference evaluation data (such as the mean) being equal is the p value, and the p value is corrected to obtain the q value, and then the p value or q value is used to judge if there are the significant difference between groups.

Q:What does a statistical test p value mean?

A:p is the probability that the statistical test H0H_0 holds. The statistical test will propose two mutually exclusive hypotheses, H0H_0 is called the null hypothesis, H1H_1 is called the alternative hypothesis, the purpose of the significance test is to decide whether to accept H0H_0 and reject H1H_1_p_value, Or accept H1H_1 and reject H0H_0. When p < 0.05, the probability of H0H_0 being true is small, so H1H_1 is accepted and H0H_0 is rejected.

For example, test whether the abundance of a certain species is different in different groups. First come up with the hypothesis:

  • H0H_0:The species do not differ between the two groups
  • H1H_1:The species differs in the two groups

Then use the Wilcoxon test or T-test to calculate the p value, which represents the probability that H0H_0 holds. If p < 0.05, the probability of H0H_0 being true at this time is very low, and we think that the low probability thing cannot happen, so we reject H0H_0, thinking that the species is different in the two groups.

Q:Why are there no difference analysis results in some comparison groups?

A:Differential analysis requires at least 2 groups, and there should be ≥ 3 biological replicates in each group. For groups that do not meet the sample requirement, no difference analysis will be performed.

Q:How to modify LEfSe group colors?

A:Currently, the system does not support modifying the LEfSe group color.

Q:What is the LEfSe analysis process?

A:The LEfSe analysis process is as follows:

  1. First, use the Kruskal-Wallis rank sum test to detect the difference in species abundance between different groups, and obtain significant differences (specified taxonomic level);
  2. For the significant differences in the previous step, compare the groups participating in the test, and use the grouped Wilcoxon rank sum test;
  3. Finally, use LDA (Linear discriminant analysis) to estimate the impact of these different species on the difference between groups, that is, to obtain the LDA score.
Q:What does the LDA score represent?

A:The LDA score represents the contribution of the species to the difference between groups.

Q:What is the main level of screening by the LEfSe analysis?

A:LEfSe analysis showed species with significant differences at all taxonomic levels.

References


  1. Segata, N., Izard, J., Waldron, L. et al. Metagenomic biomarker discovery and explanation. Genome Biol 12, R60 (2011). https://doi.org/10.1186/gb-2011-12-6-r60open in new window ↩︎

  2. Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(2), 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.xopen in new window ↩︎