Differential function analysis
Difference Analysis
Difference analysis
The differences in the experimental results are composed of real differences and experimental errors. In order to understand whether the differences in the experimental data are caused by the real differences, it is necessary to test the differences (also known as hypothesis testing). Use different difference test methods according to whether the sample conforms to the specific distribution law:
- Parameter test: it is used when data in accordance with a specific distribution law (usually normal distribution). T-test/ANNOVA is commonly used. The Alpha diversity and Beta diversity of metagenomic data can use parameter test.
- Nonparametric test:it is used when it is impossible to determine whether the data satisfies a certain distribution law. Wilcoxon/Kruskal-Wallis is commonly used. The abundance data of microbial community composition usually use nonparametric test.
According to the type of comparative sample, it can be divided into:
- Single sample: compare the experimental data with a specific value. For example, in the process of biological fermentation, whether the yield of fermentation products in a fermentor reaches the preset value.
- Paired samples: whether there are differences between different treatments and different time samples, the sample number of different comparison groups should be the same. For example, whether the fermentation raw materials with different C / N ratio have any effect on the fermentation ability of the same strain.
- Independent samples: comparing the data obtained from different batches of experiments, the number of samples is not necessarily the same. For example, in order to determine whether the yield per mu of an improved rice variety is increased, the data of yield per mu of the improved rice variety is compared with that of rice per mu before improvement.
The comparison of microbial abundance in different groups can belong to paired samples or independent samples.
T-test / ANNOVA
The parameter test data can be well described by the distribution defined by one or more parameters, and in most cases the normal distribution, so it is necessary to verify whether the sample distribution conforms to the normal distribution between the parameter tests. The parameter test was used to test the average value and variance of the parameters. T-test is used when the number of groups is 2, and ANNOVA is used when the number of samples is greater than 2.
Wilcoxon / Kruskal-Wallis
Wilcoxonrank-sumtest (Wilcoxon Rank Sum Test) is a method of nonparametric test for two groups of independent samples. The original assumption is that there is no significant difference in the overall species distribution between the two groups of independent samples. the average rank of the two groups of samples is calculated to determine whether there is a difference in the distribution of the two populations. This analysis can test the significant difference between the species of the two groups of samples and correct the p-value calculated FDR value.
Kruskal-Wallis (KW) is a nonparametric test for three or more sets of data. Its essence is the generalization of the Mann-Whitney U test of two independent samples in multiple groups (greater than or equal to three groups), and it is also used to test whether there are significant differences in the distribution of multiple populations.
KEGG pathway enrichment analysis
The difference of pathway is difficult to reflect the overall change through the microscopic difference of KO. The Reporter Score [1] method makes a statistical test of all the KO involved in a certain path, reflects the change of the path with the overall cumulative trend, and realizes the connection between the micro and the macro.
Reporter Score is calculated as follows:
The p value of each KO difference was obtained by rank sum test, and the Z value corresponding to each p value was obtained by inverse normal distribution. The calculation method is as follows:
:it represents i-th KO in a path
:it represents the P value obtained by the rank sum test of the i-th KO between groups
Based on the Z value of KO, the Z value of the path is calculated, and the KO can be "raised" to the path. The calculation formula is as follows:
:Represents the Z value of a path
:Indicates that there are a total of k KO comments to the path
In order to evaluate the significance, the random distribution of Z value is obtained by permutation (permutation) 1000 times for a certain path to correct Z value. The correction formula is as follows:
:The mean value of 1000 random paths
:Standard deviation of 1000 random data
Make it obey the (0,1) standard normal distribution by correction. The corrected Z value is the Reporter Score value. When Z <-1.65 or Z > 1.65 corresponds to p < 0.05.
FAQ
Q:What is the basis for choosing Venn diagram and UpSetR?
A:Select according to the number of groups. When the number of groups is less than 5, you can choose Venn diagram or UpSetR; when the number of groups is greater than 5, the Venn diagram cannot display so much data, so you can only choose UpSetR.
Q:What are the principles of using parametric test and nonparametric test?
A:If the data conform to normal distribution, parametric test is used first, followed by non-parametric test. Because nonparametric test is not as sensitive as parametric tests. If the data contain extreme values and cannot be removed for some reasons, non-parametric test can be used to test the ranking of the data, and extreme values have little impact on the results.
Q:How to interpret the results of the analysis of significant differences between groups?
A:The difference analysis between groups is the analysis of whether the difference between groups is significant at different taxonomic levels. Usually, a hypothesis test is used to verify that the probability of the difference evaluation data (such as the mean) being equal is the p value, and the p value is corrected to obtain the q value, and then the p value or q value is used to judge if there are the significant difference between groups.
Q:What does a statistical test p value mean?
A:p is the probability that the statistical test holds. The statistical test will propose two mutually exclusive hypotheses, is called the null hypothesis, is called the alternative hypothesis, the purpose of the significance test is to decide whether to accept and reject _p_value, Or accept and reject . When p < 0.05, the probability of being true is small, so is accepted and is rejected.
For example, test whether the abundance of a certain species is different in different groups. First come up with the hypothesis:
- :The species do not differ between the two groups
- :The species differs in the two groups
Then use the Wilcoxon test or T-test to calculate the p value, which represents the probability that holds. If p < 0.05, the probability of being true at this time is very low, and we think that the low probability thing cannot happen, so we reject , thinking that the species is different in the two groups.
Q:Why are there no difference analysis results in some comparison groups?
A:Differential analysis requires at least 2 groups, and there should be ≥ 3 biological replicates in each group. For groups that do not meet the sample requirement, no difference analysis will be performed.
Q:In the enrichment analysis, the enrichment Module or Pathway and its corresponding enrichment p values are not completely consistent between the two custom mapping results or between the custom mapping results and the existing results, and even the number in the enrichment results is not consistent. Why is this?
A:It has to do with the algorithm of the analysis. In the enrichment analysis, permutation tests will be performed, and each permutation test will extract data randomly, which will lead to slightly different results. Since the criteria for judging enrichment are fixed, the number of modules/pathways enriched may vary. Therefore, it is recommended to focus on those results with large differences.
Q:What is permutation test
A:In statistics, part of the sample data of the research object is collected to describe the whole object. The more samples being collected, the more accurate the description of the whole, but in practice, the number of samples collected is often limited.Therefore, permutation tests are used when the number of samples collected is small and the distribution is unknown.
The permutation test was originally proposed by Fisher in the 1930s and is essentially a resampling method. It randomly samples all (or part) of the sample data, and then compares the sample statistics obtained by sampling with the actual observed sample statistics. Through a large number of permutations (default 999 times in R), it calculates the probability that the statistics after the permutation is greater than the actual observed statistics, which is the p value of the permutation test. Statistical inferences are made based on p values.
Note
Because the permutation test is random sampling, the results of multiple permutation tests are not completely consistent.
Reference
Patil, K. R., & Nielsen, J. (2005). Uncovering Transcriptional Regulation of Metabolism by Using Metabolic Network Topology. Proceedings of the National Academy of Sciences, 102(8), 2685–2689. https://doi.org/10.1073/pnas.0406811102 ↩︎