Specied distribution
Species Distribution
The species distribution of microbial communities is a basic feature. Through appropriate visualization methods, the species composition of samples can be intuitively displayed, facilitating comparisons between samples/groups. Common visualization methods include stacked charts, abundance heatmaps, GraPhlAn charts, Circos charts, Krona charts, and pie charts.
Stacked Charts and Pie Charts
Stacked charts and pie charts are traditional data visualization tools. Pie charts can only display data for one sample/group, while stacked charts can display data for multiple samples/groups.
Abundance Heatmap
Heatmaps are used to display the distribution of species in different samples/groups. Since different species and samples/groups are placed in one chart, it is very convenient to compare differences between species and samples in heatmaps.
GraPhlAn Chart
GraPhlAn [1] is a circular visualization for multi-level data. It displays hierarchical relationships in the form of multiple rings, and the heatmap information can be added on the outer ring, making it very suitable for species classification visualization.
Circos Chart
Circos Chart [2] (also called chord diagram) displays groups and species composition on the same circle and connects species and groups to show the composition of different microbes in groups/samples and the abundance differences of microbes in different samples/groups.
Krona Chart
Krona Chart [3] is a visualization of multi-level data using Krona software to generate an interactive result in HTML. Each ring in the Krona chart is a discrete pie chart. The HTML interaction feature allows users to perform more interactive displays of data.
FAQ
Q:How to process data in statistical distribution Species?
A:Showing too many elements in a single image makes the image appear crowded and it is difficult to get useful information from it. Based on this, the abundance data Species used for mapping need to be screened:
- Classify to others: Keep the data that meets the filter conditions, and classify the data that does not meet the conditions into others.
- SpeciesFiltering: Keep the data that meets the filtering conditions, and discard the data that does not meet the conditions.
Filter criteria:
- The top N relative abundanceSpecies
- The relative abundace greater than mSpecies
Tips
Species Both 'Classify to others' and' filter 'are based on the parameters set in the analysis scheme. However, due to the reason of graphics display, we have limited the maximum number of Species displayed for some graphics. For details, please check the corresponding page description.
Q:What are the clustering distance and clustering method of the heatmap? How do different clustering distances and clustering methods differ? What are the commonly used clustering distances and clustering methods?
A:The purpose of clustering is to identify a subset of discontinuous objects, that is, clustering is to group data sets. The result of microbial clustering is a hierarchical clustering tree with a nested structure. Most clustering is based on distance. Dr.Tom system provides six clustering distances: euclidean, maximum, manhattan, canberra, binary and correlation. The calculation formulas and differences are as follows:
Methods | Formulas | introduction |
---|---|---|
euclidean | Euclidean distance: The square root of the sum of the squared differences of all objects between groups. | |
maximum | Chebyshev distance: the maximum absolute value of the difference between the coordinates of objects in two groups | |
manhattan | Manhattan distance: Sum of absolute differences. This distance can be used when the group has more data types, such as age, gender, and height | |
canberra | Canberra distance: This distance can be used when the samples are relatively similar | |
binary | Jacquard dissimilarity: a: number present in both samples; b: number of species present in one sample; c: number of species present in the other sample; d: number of species present in neither sample sample | |
correlation | Pearson Correlation: Used in correlation heatmaps |
Reference:
- Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
- Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. Academic Press.
- Borg, I. and Groenen, P. (1997) Modern Multidimensional Scaling. Theory and Applications. Springer.
Q:What are the commonly used methods for heatmap clustering? What is the application of each method?
A:The system provides three different methods: connection-based hierarchical clustering, average aggregation clustering and minimum variance clustering.
- Connection-based hierarchical clustering: determine the nearest connection of objects based on the longest or shortest distance between objects in two groups, including
single connection
andfull connection
two types - Average aggregation clustering: According to whether to calculate the weight (whether to calculate the number of objects in the group) and the distance calculation method (average distance: the average distance between the added object and the existing object; centroid distance: the geometric center of the distance), it is divided into
UPGMA
,UPGMC
,WPGMA
,WPGMA
four types - Minimum variance clustering: Based on the least squares linear model, the sum of squares within the group is minimized
Each type of method includes at least two specific methods, as shown in the table below. The method marked with an asterisk indicates that the method is a common method for metagenomics
Type | Method | Feature |
---|---|---|
Connection-based hierarchical clustering | single * | |
complege * | ||
Average aggregation clustering | UPGMA * | Arithmetic average - equal weight |
UPGMC | Arithmetic average - Equal weight | |
WPGMA | Centroid clustering - unequal weights | |
WPGMC | Centroid clustering - unequal weights | |
Minimum variance clustering | ward.D | |
ward.D2 |
Reference
Asnicar, F., Weingart, G., Tickle, T. L., Huttenhower, C., & Segata, N. (2015). Compact Graphical Representation of Phylogenetic Data and Metadata with Graphlan. PeerJ, 3, e1029. https://doi.org/10.7717/peerj.1029 ↩︎
Krzywinski, M., Schein, J., Birol, İ., Connors, J., Gascoyne, R., Horsman, D., Jones, S. J., & Marra, M. A. (2009). Circos: An Information Aesthetic for Comparative Genomics. Genome Research, 19(9), 1639–1645. https://doi.org/10.1101/gr.092759.109 ↩︎
Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive Metagenomic Visualization in a Web Browser. BMC Bioinformatics, 12, 385. https://doi.org/10.1186/1471-2105-12-385 ↩︎