Function distribution

Dr.TomAbout 17 wordsLess than 1 minute

Function Distribution

In addition to the species distribution of microbial communities, metagenomic sequencing can also obtain functional information of microbial communities. Similar to species distribution information, there is a lot of functional information data, which also needs to be displayed through appropriate visualization methods to show the functional composition of samples, making it easy to compare between samples/groups. Common visualization methods include stacked charts, abundance heatmaps, GraPhlAn charts, Circos charts, Krona charts, and pie charts.

Stacked Charts and Pie Charts

Stacked charts and pie charts are traditional data visualization tools. Pie charts can only display data for one sample/group, while stacked charts can display data for multiple samples/groups.

Abundance Heatmap

Heatmaps are used to display the distribution of different functions in different samples/groups. By placing different functions and samples/groups in one chart, it is very convenient to compare the differences between functions and samples.

Circos Charts

Circos chartopen in new window [1] (also called Chord diagram) draws groups and functional compositions on the same circle, and connects the function and groups to display the composition of different functions and the abundance differences of functions in different samples/groups.

FAQ

Q:How to process data in statistical distribution Function?

A:Showing too many elements in a single image makes the image appear crowded and it is difficult to get useful information from it. Based on this, the abundance data Function used for mapping need to be screened:

  • Classify to others: Keep the data that meets the filter conditions, and classify the data that does not meet the conditions into others.
  • FunctionFiltering: Keep the data that meets the filtering conditions, and discard the data that does not meet the conditions.

Filter criteria:

  • The top N relative abundanceFunction
  • The relative abundace greater than mFunction

Tips

Function Both 'Classify to others' and' filter 'are based on the parameters set in the analysis scheme. However, due to the reason of graphics display, we have limited the maximum number of Function displayed for some graphics. For details, please check the corresponding page description.

Q:What are the clustering distance and clustering method of the heatmap? How do different clustering distances and clustering methods differ? What are the commonly used clustering distances and clustering methods?

A:The purpose of clustering is to identify a subset of discontinuous objects, that is, clustering is to group data sets. The result of microbial clustering is a hierarchical clustering tree with a nested structure. Most clustering is based on distance. Dr.Tom system provides six clustering distances: euclidean, maximum, manhattan, canberra, binary and correlation. The calculation formulas and differences are as follows:

MethodsFormulasintroduction
euclideandeuc(x,y)=i=1n(xiyi)2d_{euc}(x,y) = \sqrt{\sum_{i=1}^n(x_i - y_i)^2}Euclidean distance: The square root of the sum of the squared differences of all objects between groups.
maximumDche(x,y)=miax(xiyi)D_{che}(x,y) = \underset{i} max(\vert x_i - y_i\vert)Chebyshev distance: the maximum absolute value of the difference between the coordinates of objects in two groups
manhattandman(x,y)=i=1n(xiyi)d_{man}(x,y) = \sum_{i=1}^n \vert{(x_i - y_i)\vert}Manhattan distance: Sum of absolute differences. This distance can be used when the group has more data types, such as age, gender, and height
canberradcan(x,y)=i=1nXiYiXi+Yid_{can}(x,y) = \sum_{i=1}^{n}{\frac{\vert X_{i} - Y_{i}\vert } {\vert X_{i}\vert + \vert Y_{i}\vert }}Canberra distance: This distance can be used when the samples are relatively similar
binarydbin(x,y)=1aa+b+cd_{bin}(x, y)=1-\frac{a}{a+b+c}Jacquard dissimilarity: a: number present in both samples; b: number of species present in one sample; c: number of species present in the other sample; d: number of species present in neither sample sample
correlationrx,y=cov(x,y)σxσyr_{x, y} = \frac{cov(x, y)}{\sigma_x\sigma_y}Pearson Correlation: Used in correlation heatmaps

Reference:

  • Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
  • Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. Academic Press.
  • Borg, I. and Groenen, P. (1997) Modern Multidimensional Scaling. Theory and Applications. Springer.
Q:What are the commonly used methods for heatmap clustering? What is the application of each method?

A:The system provides three different methods: connection-based hierarchical clustering, average aggregation clustering and minimum variance clustering.

  • Connection-based hierarchical clustering: determine the nearest connection of objects based on the longest or shortest distance between objects in two groups, including single connection and full connection two types
  • Average aggregation clustering: According to whether to calculate the weight (whether to calculate the number of objects in the group) and the distance calculation method (average distance: the average distance between the added object and the existing object; centroid distance: the geometric center of the distance), it is divided into UPGMA , UPGMC, WPGMA, WPGMA four types
  • Minimum variance clustering: Based on the least squares linear model, the sum of squares within the group is minimized

Each type of method includes at least two specific methods, as shown in the table below. The method marked with an asterisk indicates that the method is a common method for metagenomics

TypeMethodFeature
Connection-based hierarchical clusteringsingle *
complege *
Average aggregation clusteringUPGMA *Arithmetic average - equal weight
UPGMCArithmetic average - Equal weight
WPGMACentroid clustering - unequal weights
WPGMCCentroid clustering - unequal weights
Minimum variance clusteringward.D
ward.D2

Reference


  1. Krzywinski, M., Schein, J., Birol, İ., Connors, J., Gascoyne, R., Horsman, D., Jones, S. J., & Marra, M. A. (2009). Circos: An Information Aesthetic for Comparative Genomics. Genome Research, 19(9), 1639–1645. https://doi.org/10.1101/gr.092759.109open in new window ↩︎