Species association analysis

Dr.TomAbout 18 wordsLess than 1 minute

Species Environmental Factor Association Analysis

Correlation

The subjects involved in the study are intricately related to each other, some of which are tight, some of which are loose. There are two ways to express the relation of interrelated things: correlation and regression (functional relation). Correlation is not a definite relation. When the value of one or several things changes, the value of things associated with it (them) will also change, but the change value is not a definite value. Regression relation is a kind of definite relation, through the value of one or several things can get the value of another thing, which is realized by regression equation (functional equation). The status of two variables in correlation is the same, and it is not because the change of one variable leads to the change of the other variable, so correlation does not mean causation. However, regression relationship artificially defines independent variable and dependent variable, which are determined according to the realistic relationship of physical quantities, so regression relationship usually has causation relationship. Based on these differences, in data analysis, the correlation is generally analyzed first, and the functional relationship (regression relationship) between different variables is further determined after the correlation is clear.

In actual studies, correlation calculation is usually based on sample data obtained in experiments. When it is necessary to extend this sample data to the population (whether a difference in the population is caused by the variation of relevant data), statistical inference is required.

Sample: A portion of the population, usually represented as an experimental sample, is the data available
Population: All sample data, often hard to come by

Correlation classification

Correlation can be classified differently from different perspectives. First of all, it is divided according to the strength of correlation: complete correlation, weak correlation and no correlation. It can also be classified in the direction of correlation: positive correlation and negative correlation. The above two categories are the most commonly used. In addition, there are two categories that need to be highlighted.

  • According to the morphology of correlation, it can be divided into linear correlation and nonlinear correlation. When the value of one variable changes, the other variable changes in roughly the same way. In the rectangular coordinate system, the observed values of two variables are roughly in a straight line, so the correlation between the two variables is linear. If the observed value distribution of two variables is a curve in the rectangular index system, the correlation between them is nonlinear.
  • There is another principle of dividing correlation according to the number of variables, which can be divided into single correlation, complex correlation and partial correlation. A single correlation is the relationship between two variables, one is the dependent variable and one is the independent variable. The correlation analysis of two variables is also called the correlation analysis of two variables. Complex correlation refers to the relationship between three or more variables, that is, the correlation between one dependent variable and two or more independent variables. Partial correlation synthesizes the characteristics of single correlation and complex correlation. When a variable is correlated with multiple variables, but only one dependent variable is concerned with the relationship between the independent variable, and the influence of other dependent variables on the independent variable needs to be shielded, such correlation is called partial correlation.

Two-variable correlation analysis

Two-variable correlation analysis is to analyze the degree of correlation between two variables in the object (sample) (the degree that the change of one variable is similar or opposite to the change of another variable), such as whether the abundance of a certain microorganism in the environmental sample is correlated with the environmental PH, and whether PH is correlated with temperature. It is important to note that regardless of the strength of the correlation, correlation does not indicate causation.

Correlation coefficient

Correlation coefficients of the two variables are described by correlation coefficients. Commonly used correlation coefficients include Pearson correlation coefficient (suitable for quantitative data, and the data meets normal distribution and linear correlation), Spearman correlation coefficient (quantitative or rank data, any distribution mode, nonlinear correlation), and Kendall correlation coefficient (ordered and classified variables). Nonlinear correlation). The results of correlation analysis are ununion-free, so the same correlation between different data types can be directly compared. Only Pearson correlation coefficient and Spearman correlation coefficient commonly used in metagenomic analysis are introduced here.

Calculation of Pearson correlation coefficient

Pearson correlation coefficient (Pearson correlation coefficient) is used to measure between linear correlation degree of two continuous variables,letter rr (between samples, the experimental samples data) or the Greek letter ρ\rho (groups, one group of all samples). The value ranges from [1,+1][-1, +1],The value 1 indicates positive linear correlation, the value 0 indicates nonlinear correlation, and the value -1 indicates negative linear correlation. When using Pearson correlation coefficient, the following conditions should be noted:

1.There needs to be a linear correlation between the two variables, because in the nonlinear correlation, the Pearson correlation coefficient cannot indicate the strength of the correlation.
2. Pearson correlation coefficient requires the corresponding variables to be bivariate normal distribution. The bivariate normal distribution is not a simple requirement that x variables and y variables obey the normal distribution, but a joint bivariate normal distribution.

Pearson correlation coefficient is calculated as follows:

rx,y=cov(x,y)σxσy r_{x, y} = \frac{cov(x, y)}{\sigma_x\sigma_y}

cov(x,y)cov(x, y):The covariance of the variables x and y

σx\sigma_x:Standard deviation of variable x

σy\sigma_y:Standard deviation of variable y

Also note:

  1. If the extreme value in the sample has a great influence on Pearson correlation coefficient, it should be carefully considered and processed. If necessary, it can be eliminated or variable conversion can be carried out to avoid the error of conclusion caused by abnormal value
  2. A variable cannot have a missing value. A variable with a missing value cannot be calculated
  3. If one variable does not change, its standard deviation is 0, then the denominator in the calculation formula is 0, the result cannot be calculated, so any variable cannot be identical

Spearman correlation coefficient

Sperman correlation coefficient [1] uses the rank size of the two variables for analysis, so there is no requirement for the distribution of the original variables, even if the original data is hierarchical (e.g., the level of efficacy of two drugs for a disease symptom: Invalidity, improvement, obvious effect) can also be calculated as Spearman correlation coefficient. Spearman's correlation coefficient can also be calculated for data subject to Pearson's correlation coefficient, but the statistical efficiency is lower than Pearson's correlation coefficient (it is not easy to detect the actual correlation between the two). If there are no duplicates in the data, and if the two variables are completely monotonically correlated, Spearman's correlation coefficient is +1 or −1. Even if there are outliers in Spearman's correlation coefficient, the rank of outliers usually doesn't change significantly (for example, if they are too large or too small, they are either the first or the last), so the influence on Spearman's correlation coefficient is very small. Spearman correlation coefficient is represented by Greek letter ρ\rho (rho).

Calculation formula:

ρX,Y=16idi2n(n21) \rho_{X, Y} = 1-\frac{6\sum_i{d_i^2}}{n(n^2-1)}

did_i:The rank difference of indices X and Y

nn:The number of data in a variable

Correlation coefficient statistical inference

In order to generalize sample correlation rr to population correlation ρ\rho,statistical inference is required.

  • Pearson coefficient:When the sample size (n>30) is large, the correlation coefficient r follows normal distribution; otherwise, the sample is considered to follow t distribution, so the statistic used by statistical inference is t.
  • Spearman coefficient:

The null hypothesis is that H0H_0 is that the two samples are not correlated, that is, ρ\rho = 0and the alternative hypothesis is that H1H_1is that the two samples are correlated. t can be calculated by the following formula:

t=rn21r2 t = r\sqrt{\frac{n-2}{1-r^2}}

rr:correlation coefficient

nn:Sample size

t conforms to the T-distribution with n-2 degrees of freedom, and the corresponding PP value can be found in the T-value distribution table according to the T-value and the degree of freedom. It is generally believed that P<0.05P < 0.05, and the two are significantly correlated.

Please pay attention:

  1. Strong correlation( r>0.5|r| > 0.5 )of the two variables, the correlation is not necessarily significant; Similarly, a weak correlation does not necessarily mean a non-significant correlation. In particular, when the sample size is small, the null hypothesis may be wrongly rejected, that is, the type I error of significant test is thought to have been made. Therefore, it is prudent to draw conclusions of significant correlation when the sample size is small.
  2. It is also important to note whether the significance test is using a single or double tail. Two-tail test is commonly used in microbial analysis.

Linear regression

In data analysis, correlation analysis is generally carried out first, and the functional relationship (regression relationship) between different variables is further determined after the correlation is clear. LinearRegression is a statistical analysis method that uses regression analysis in mathematical statistics to determine the relationship between one or more independent variables and dependent variables. In the regression analysis of environmental factor ranking, the size of environmental factor is taken as the X-axis, and the score value on the first ranking axis of the results of ranking analysis such as PCA analysis or the size of Alpha diversity index is taken as the Y-axis. In addition, LinearRegression is carried out to make a scatter plot and mark R2, which can be used to evaluate the relationship between the two. Where, R2 is the determining coefficient, representing the proportion of variation explained by regression line. In order to make the analysis effect better, the number of samples should be more than 10 samples.

Two matrix correlation analysis

CCA / RDA

CCA or RDA is a ranking method developed on the basis of correspondence analysis, which combines correspondence analysis with multiple regression analysis. Each step of calculation is regression with environmental factors, also known as multiple direct gradient analysis. This analysis is mainly used to reflect the relationship between flora and environmental factors. RDA is based on linear model and CCA is based on unimodal model. The analysis can detect the relationship between environmental factors, samples and microflora, or the relationship between two groups. The resulting plots for CCA and RDA use dots to represent different samples and arrows from the origin to represent different environmental factors. The length of the arrow represents the intensity of the influence of the environmental factor on community change, and the longer the arrow length, the greater the influence of the environmental factor. The Angle between the arrow and the axis represents the correlation between the environmental factor and the axis. The smaller the Angle is, the higher the correlation is.

The selection principle of RDA or CCA model: Firstly use Species abundance table to do DCA analysis, look at the size of the first axis of gradient in the analysis results, if greater than 4.0, CCA should be selected; If it is between 3.0 and 4.0, RDA and CCA can be selected (this analysis ≧ 3.5 CCA); If less than 3.0, RDA results better than CCA.

(1)The maximum Pearson correlation coefficient of distribution difference between environmental factors and samples Specieswas determined by bioenv function, and a subset of environmental factors was obtained by the maximum correlation coefficient. (2)CCA or RDA analysis was performed on the sample Species abundance table and environmental factor or subset of environmental factor, respectively.
(3)The significance of CCA or RDA analysis was determined by permutest analysis similar to ANOVA's.

Mantel Test

Mantel test is Nathan Mantel named statistical testing method, first published in 1967 [2]. Mantel test focuses on the correlation between two matrices of the same dimension and is widely used in ecological data. It is often used to analyze the correlation between sample abundance distance matrix and environmental factor distance matrix. At this time, the test objective is whether the variation between samples is related to the variation of environmental factor.

FAQ

Q: How to choose CCA or RDA to display the impact of environmental factors on community structure?

A: Model selection principles: RDA is based on a linear model, and CCA is based on a unimodal model. CCA is generally chosen for direct gradient analysis. But if the effect of CCA sorting is not very good, you can consider using RDA analysis.

First, use species abundance data to run DCA (Detrended Correspondence Analysis), and analyze the size of the first axis of Lengths of Gradient in the results. If it is greater than 4.0, you should choose CCA; if it is between 3.0-4.0, you can choose RDA and CCA; if it is less than 3.0, the result of RDA is better than CCA.

Q:Why the number of environmental factors for CCA and RDA must be smaller than the number of samples?

A: Both belong to constrained ranking, CCA is a nonlinear model, and RDA is a linear model. PCA analysis is a kind of non-constraint ranking, without environmental factor restrictions, and its principal components are unknown environmental variables. Constrained sorting refers to sorting species on specific environmental variables (environmental factors), that is, sorting in the direction of the coordinate axis. Therefore, if there are n samples and n coordinate directions are provided, it will not be able to play the role of this specific direction constraint.

Reference:
Legendre, P. and L. Legendre (2012). Numerical Ecology, 3rd English ed. Amsterdam: Elsevier Science BV.

Q: What is the reason for the specific value required for the CCA/RDA variable?

A: CCA and RDA are correlation analyzes of environmental factors. The algorithm model requires the data of environmental factors to be continuous (the data is a number with a size and order, such as pH value, temperature, etc.), and non-continuous variables cannot be used as environmental factor data. For example, to compare the difference between the treatment group and the control group, the treatment is recorded as 1, and the non-treatment is recorded as 0. Such variables are non-continuous variables (or called discrete variables) and can be used as grouping variables.

Q: What is the difference between Pearson's correlation coefficient and Spearman's correlation coefficient?

A: Pearson correlation coefficient is applicable to quantitative data, which must conform to normal distribution and have linear correlation between data. Spearman correlation coefficient applies to quantitative or hierarchical data, the data can fit any distribution pattern, and the data involved in the calculation is usually nonlinearly correlated.

Reference


  1. Best, D. J., & Roberts, D. E. (1975). Algorithm AS 89: The Upper Tail Probabilities of Spearman’s Rho. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(3), 377–379. https://doi.org/10.2307/2347111open in new window ↩︎

  2. Mantel, N. (1967). The Detection of Disease Clustering and a Generalized Regression Approach. Cancer Research, 27(2), 209–220. ↩︎