Gene diversity

Dr.TomAbout 29 wordsLess than 1 minute

Gene diversity and difference analysis

Gene Alpha Diversity

Description of Genediversity within community

Alpha diversity [1](α diversity,Alpha diversity,α-diversity)is concerned with the composition of Gene with a habitat(within-habitatdiversity)or with a sample(within-sample),which is one of most important content in microbial ecology analysis. Alpha diversity analysis includes calculation of a series of diversity indices that reflect the number of Gene in microbial community and whether the distribution is uniform. The significance test can also compare whether the differences of samples in different habitat are significant.

  • Community richness describes the amount of Gene types in the environment sample。
  • Community eveness describes whether the Gene of the microbial community in the environment is uniform, i.e. the level of relative abundance.
  • Community diversity comprehensively considers the richness and uniformity of Gene in community.

The type of alpha diversity calculation can be genes and species.

Richness

Assume three communities A、B、C have three Gene including Gene1、Gene2 and Gene3, of which the distribution is as follows

Gene1Gene2Gene3
Community AYes-Yes
Community BYesYes-
Community CYesYesYes

“-” means the Gene is not included in the community,so we consider richnessC>richnessB=richnessArichness_C > richness_B = richness_A .The indices describing community include Chao1 index [2] and ACE index [3]. Larger index indicates higher richness。

Chao1 index is suitable for abundance data like metagenomic gene abundance and species abundance. However, Chao1 algorithm is more sensitive to low-abundance data, which means the low abundant Gene data influence Chao1 index more. The larger Chao1 index represents the larger totol number of Gene.

Schao1=Sobs+n1(n11)2(n2+1) S_{chao1} = S_{obs} + \frac{n_1(n_1-1)}{2(n_2+1)}

SobsS_{obs}:Actually observed number of Gene

n1n_1:The number of Gene(singletons)is observed only once

n2n_2:The number of Gene(doubletons)is observed only twice

ACE index is another index to show richness of Gene. The formula is:

Sace=Sabund+SrareCace+F1Caceγace2 S_{ace}=S_{abund}+\frac{S_{rare}}{C_{ace}}+ \frac{F_1}{C_{ace}}\gamma^2_{ace}

in which

γace2=max[SrareCacei=110i(i1)Fi(Nrare)(Nrare1)1,0] \gamma^2_{ace}=max\left[\frac{S_{rare}}{C_{ace}} \frac{\sum^{10}_{i=1}{i * \left(i-1\right)}F_i} {\left(N_{rare}\right)\left(N_{rare}-1\right)} -1,0\right]

Nrare=i=1abundini N_{rare} = \sum_{i=1}^{abund}in_i

Cace=1F1Nrare C_{ace} = 1-\frac{F1}{N_{rare}}

SabundS_{abund}:high abundance(above low abundance threshold)Gene quantity

SrareS_{rare}:low abundance(less than or equal to low abundance threshold)Gene quantity

ii:i-thGene

F1F_1:the number of Gene(singletons)is observed only once

FiF_i:abundance of i-th Gene

Evenness

Assume there are two Gene A and B in C and D communities,of which the distribution is as follows:

GeneAGeneB
Community C55
Community D28

So, richnessC=richnessD\text{richness}_\text{C} = \text{richness}_\text{D},but evennessC>evennessD\text{evenness}_\text{C} > \text{evenness}_\text{D}. The indices describing evenness are Pielou's evenness [4] and Simpson's evenness [5].

Pielou evenness, also known as Shannon's evenness, is the ratio of actual Shannon index of the community to the maximum Shannon index that can be obtained in a comunity with the same Gene richness; If all Gene have the same relative abundance, the value should be 1.

J=HHmax=HlogxS J = \frac{H}{H_{max}} = \frac{H}{log_{x}S}

HH:Shannon index

HmaxH_{max}:In condition of same abundance of Gene, the Shannon index reaches maximum (that is, when abundances of all Gene are the same)

SS:richness index of community Gene

xx:normally x=ex = e,then the index can be call Pielou_e

Simpson evenness (Simpson’s evenness),also called as equitability, means Simpson valid Gene number (i.e. Simpson diversity) to the richness index of Gene.

equitability=DensS equitability = \frac{D_{ens}}{S}

DensD_{ens}:Simpson valid Gene number

SS:richness index of community Gene

Diversity

The most commonly used indices in metagenomic analysis are the Shannon index [6] and the Simpson index [5:1], which comprehensively consider the richness and evenness of the community Gene.

Shannon index, also known as Shannon entropy index, Shannon-Wiener index, comprehensively considers the richness and evenness of the community. The Shannon index of the sample is large, indicating that the Gene is rich and uniform in the sample.

Hshannon=i=1SobsniNlnniN H_{shannon} = -\sum_{i=1}^{S_{obs}}\frac{n_i}{N}ln\frac{n_i}{N}

SobsS_{obs}:the actual number of OTU

nin_i:the number of OTU with i-th sequence

NN:total number of sequences

Simpson indexis one of the indices used to estimate the microbial diversity in a sample, describing the probability that the number of individuals obtained from two consecutive samplings from a community species belong to the same species. However, the common Gene and the dominant Gene in the sample have a greater impact on the index, that is to say, the low abundance in the sample will not have a great impact on the index Impact. Calculated as follows:

D=pi2 D = \sum{p_i^2}

pip_i:the relative abundance of the i-th Gene

The value range of the Simpson exponent calculated by this formula is [0, 1], and the larger the value, the smaller the diversity, which is contrary to our intuition, so 1D1 - D is often used to represent the Simpson exponent.

metagenomic analysis in Dr.Tom use follow Simpson formula

S=1pi2 S = 1 - \sum{p_i^2}

pip_i:the relative abundance of the i-th Gene

Therefore, the larger the Simpson index in the system, the higher the diversity of Gene.

Hypothetical Test

The diversity index describes the diversity of microbial communities within a sample, but how the diversity differs between samples requires a hypothesis test (also often called a significance test). The commonly used parametric test method is T test/analysis of variance, and the commonly used nonparametric test method is Wilcoxon/Kruskal-Wallis test.

  • Parametric test: It is necessary to assume that the sample conforms to a specific distribution (usually a normal distribution) and then perform a statistical test on the mean and variance of the parameters. When the number of comparison groups is equal to 2, the T test is commonly used, and when the number of comparison groups is greater than 2, analysis of variance is commonly used.
  • Nonparametric test: When it is impossible to determine which distribution the sample belongs to, first sort the samples according to certain sorting rules, and then perform a statistical test on the ranking. This method has no requirements on data distribution, but its sensitivity will be lower than that of parametric tests. When the number of comparison groups is equal to 2, Wilcoxon is commonly used, and when the number of comparison groups is greater than 2, Kruskal-Wallis is commonly used.

The null hypothesis H0H_0 of the significance test is that the diversity indices of the two samples are not different. It is generally considered that when the significance test result p < 0.05, the null hypothesis is rejected, and the α diversity difference between the two samples is considered to be significant.

Data Visualization

The Dr. Tom system uses boxplots to visualize the results of Alpha Diversity Analysis.

FAQ

Q: What are the specific methods used to test Alpha diversity statistics?

A: According to the selected method and the number of comparison groups, the specific statistical test methods are as follows:

Parametric testsnon-Parametric tests
Group=2T testWilcoxon
Group>2variance analysisKruskal-Wallis

Gene Beta Diversity

distance and similarity

Any statistical result that satisfies nonnegative, reflexive and trigonometric inequalities can be called distance. Distance is used to describe the distance of two statistical objects [7]. The object can be a community of one or more points on the axis. the greater distance represents the greater difference between objects

When two objects have more similar attributes, the more similar the two objects are. For example, the number and length of edges of regular polygons are two attributes, if the number of edges of two regular polygons is equal, then they are similar, if the number and length of edges are equal, then they are all equal. For the two communities, the Gene composition of the community is their attribute. The more the number of Gene shared by the two communities, the closer the relative abundance of the commonGene, and the greater the similarity between the two communities. Similarity is described by similarity index. the higher similarity index represents the more similar sample.

to be paid attention

All similarity indices can be converted into corresponding distances, but not all distances can be converted into similarity indices.

The essence of beta diversity

Beta diversity [1:1] uses the abundance information (gene, species or function) of each sample to calculate the distance or similarity between samples, and reflects whether there are significant microbial community differences among samples (groups) or habitats (between-habitat diversity) through distance, also known as inter-habitat diversity.

Simply, Beta diversity analysis focuses on the differences between samples.

Double zero problem

In microbial data, there is often a situation in which a microorganism is not detected in both samples, and the meaning of Gene deletion (double zero data) determines whether it can be used as a basis for judging the similarity of the two samples. There are two situations:

  • Neither of the sampling sites meets the survival conditions of the species, or the species has never spread here.
  • The species was not included in the sample at the time of sampling, or the content of the species in the sample is too low and too low to be annotated.

Double zero data often appear in microbial ecology, if the meaning of double zero data is the same, then double zero data can be used as a basis to judge the similarity between the two. However, the Mega Genome Project itself is exploring an unknown environment, so it is usually impossible to determine whether the meaning of double zero data is the same in different samples, and with the increase of the number of Gene, the probability of double zero data between samples increases, and this uncertainty is also increasing.

Symmetry and asymmetry are used to describe the above double zero problem, if the meaning of double zero data is the same, it can be used as the basis for judging similarity, which is called symmetry, otherwise it is asymmetry. In most cases, asymmetry should be preferred unless it can be determined that zero data has the same meaning.

Beta diversity measurement

Metagenomics project usually adopts Euclidean distance [8](Euclidean distance), Bray - Curtis phase heterosexual index (Bray - Curtis Distance) [9](Bray-curtis distance)and JSD distance (Jensen - Shannon divergence) to measure the difference between the sample (group). It is the basis of Beta diversity analysis to calculate the distance between multiple samples and place the distance information in the table to form a distance matrix.

here

The Euclidean distance is symmetric (double zero data have the same meaning), and the other two are asymmetric. Therefore, Euclidean distance should be used with caution when there are many zeros in the sample pair.

Euclidean distance

Euclidean distance is a distance often used in multivariate analysis, and its formula is as follows:

Eij=n(SinSjn)2 E_{ij}= \sqrt {\sum_{n}(S_{in} - S_{jn})^2}

iijj:two sample

nnn th item in sampleGene

SinS_{in}ii Abundance of nnth Gene in the sample

SjnS_{jn}jj Abundance of nnth Gene in the sample

According to the formula, the Euclidean distance depends on the abundance of the input, and its value range is [0,+][0, +\infty]. The greater the Euclidean distance, the greater the difference between the two samples.

Bray-Curtis distance

Bray-Curtis distance is one of the most commonly used distances to calculate microbial abundance differences. However, Bray-Curtis' calculation method does not conform to the triangle inequality in the definition of distance, so it is not strictly a distance. The correct name is Bray-Curtis anisotropy index. This report also refers to it as the Bray-Curtis distance.

When calculating the Bray-Curtis distance, the samples that are not detected in either sample will be ignored. Its calculation formula is as follows:

BCij=12CijSi+Sj BC_{ij} = 1-\frac{2C_{ij}}{S_i+S_j}

iijj:two sample

CijC_{ij}:Compare the abundance of each Genein the two samples and sum all the relatively low abundances of Gene.

SiS_iii Sum of abundances of all Gene in the sample

SjS_jjj Sum of abundances of all Gene in the sample

When using relative abundance data, the formula can be simplified to

BCij=1Cij BC_{ij} = 1-C_{ij}

According to the formula, the value range of Bray-Curtis distance is [0,1][0, 1],When BC=0BC=0, it indicates that Genecomposition of the two samples is completely consistent. BC=1BC=1 indicates that Gene is not shared between the two samples. The smaller the BC value, the higher the similarity between the two samples and the smaller the difference.

JSD distance

JSD distance(Jensen–Shannon divergence,JSD divergence)is a distance index developed from KL divergence (Kullback–Leibler divergence) and used to describe the similar differences between two samples in probability distribution.

The JSD distance is also asymmetric, so samples that are not detected in either sample are ignored. The formula is as follows

JSD(PQ)=12D(PM)+12D(QM) JSD(P\parallel Q)= \frac{1}{2}D(P\parallel M) + \frac{1}{2}D(Q\parallel M)

where,the principles for calculating MM is

M=12(P+Q) M = \frac{1}{2}(P+Q)

the principles for calculating DD is

D(PQ)=iP(i)lnP(i)Q(i) {D}(P\parallel Q)=\sum_{i} P(i)\ln \frac{P(i)}{Q(i)}

PP:The abundance matrix for the first sample

P(i)P(i):Abundance of iith Gene in the first sample

QQ:The abundance matrix of the second sample

Q(i)Q(i):Abundance of iith Gene in the second sample

According to the formula, the value range of JSD distance is [0,1][0, 1],The smaller the JSD value is, the higher the similarity between the two samples and the smaller the difference.

FAQ

Q: What is the difference between Bray-Curtis, JSD, and Euclidean distance?

A: The Euclidean distance is a symmetric index, treating double zero data as the same. The other two are asymmetric index, ignoring double zero data. A characteristic of Euclidean distance is that it is more sensitive to species abundance data than species existence, and the calculation method of Euclidean distance determines that its value range is infinite. So, for species abundance data with a lot of double zero data, Bray-Curtis and JSD are often used, and the Bray-Curtis dissimilarity index is the most commonly used index.

Reference


  1. Whittaker, R. H. (1960). Vegetation of the Siskiyou Mountains, Oregon and California. Ecological Monographs, 30(3), 279–338. https://doi.org/10.2307/1943563open in new window ↩︎ ↩︎

  2. Colwell, R. K., Mao, C. X., & Chang, J. (2004). Interpolating, Extrapolating, and Comparing Incidence-Based Species Accumulation Curves. Ecology, 85(10), 2717–2727. https://doi.org/10.1890/03-0557open in new window ↩︎

  3. Chao, A., & Yang, M. C. K. (1993). Stopping Rules and Estimation for Recapture Debugging with Unequal Failure Rates. Biometrika, 80(1), 193–201. https://doi.org/10.1093/biomet/80.1.193open in new window ↩︎

  4. Pielou, E. C. (1966). The Measurement of Diversity in Different Types of Biological Collections. Journal of Theoretical Biology, 13, 131–144. https://doi.org/10.1016/0022-5193(66)90013-0open in new window ↩︎

  5. Simpson, E. H. (1949). Measurement of Diversity. Nature, 163(4148), 688–688. https://doi.org/10.1038/163688a0open in new window ↩︎ ↩︎

  6. Shannon, C. E. (2001). A Mathematical Theory of Communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55. ↩︎

  7. Dodge, Y., Cox, D., & Commenges, D. (2006). The Oxford Dictionary of Statistical Terms. Oxford University Press. ↩︎

  8. Legendre, P., & Legendre, L. (2012). Numerical ecology (Third English edition). Elsevier. ↩︎

  9. Bray, J. R., & Curtis, J. T. (1957). An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecological Monographs, 27(4), 325–349. https://doi.org/10.2307/1942268open in new window ↩︎