Species annotation

Dr.TomAbout 17 wordsLess than 1 minute

Introduction

When exploring the composition of microorganisms in the sample, compare each read in the clean data to the database to obtain the species information corresponding to the read, and all the read annotation results constitute the composition information of the community, and then construct the species abundance through the algorithm to obtain species abundance.

Species Annotation

Kraken2 [1][2] is a species annotation software based on k-mer. Compared with traditional sequence (or gene) alignment methods, Kraken2 has the advantages of faster speed and more sensitivity to base differences. The principle is as follows:

Mer  is a  monomeric unit, and a  mer  can be regarded as a base. k-mer  refers to a nucleotide sequence containing k bases. Kraken2 will first build a sub-database from one or more sources of annotation databases as required, and split the genome sequences of all possible species into k-mer  and store them. For any unknown sequence of length  n  that needs to be aligned, it is first split into  n - k +1  k-mer , and then these  k-mer  will be aligned to the sub-database one by one. All  k-mer  that hit the database will be placed on a classification tree, and the missing  k-mer  will be ignored. The nodes on the classification tree record the number of  k-mer hit. According to the lowest common ancestor (LCA) algorithm, selection analysis obtains the consecutive lowest common ancestor in the classification tree.

Species Annotation Principle of Kraken2

Plot 1: Principle of Kraken2
Step 1, Splice the sequence to be predicted into k-mer, small fragments in the figure
Step2, According to the LCA algorithm, all k-mers are placed on the nodes of the classification tree, and the number of each node represents the number of k-mers hitting the node. The hit nodes on the classification tree can be connected to form different paths; Step 3, Count the sum of the number of hit k-mers on each path, which is called the score;
Step 4, The classification of the path with the highest score is the species annotation result of the sequence. If the two longest paths have the same score, use the LCA algorithm to get their lowest common ancestor.

Kraken2 needsdatabases for species annotation. Dr. Tom provides two annotation databases, and automatically selects the appropriate database according to the sample source:

DatabaseConstruction methodVersion
UHGG [3]Public Kraken2 databasev2.0open in new window
NCBI NT [4]Official Kraken2 prebuit sub-libraryk2_pluspf_16gb_20240112open in new window

Select the appropriate database according to the sample source:

  • Human intesinal samples:UseUnified Human Gastrointestinal Genome (UHGG)open in new window database. The database contains more than 200,000 human gut microbial genomes.

  • Other samples:Use sub-database built according to NCBI NT databaseopen in new window. The NCBI NT  database collects genome, gene, transcription sequence and other data from GeneBank, RefSeq, TPA and PDB  databases, which can be used for bioinformatics data mining. However, because the data is too large, a sub-library will be constructed for species annotation.

SoftwareVersionCommanders
Kraken2open in new window [1:1]2.1.2kraken2 --paired --use-names --gzip-compressed

Bracken Species Abundance Determination

Kraken2  classifies and annotates each read, but not every read can be annotated to the species level, and some reads can only be annotated to the genus or even higher level. If you simply add and count species abundance based on the results of reads and species annotations, the results will be inaccurate. Based on Bayesian theorem, Bracken2 [5] (Bayesian Reestimation of Abundance after Classification with KrakEN) assigns (re-assign) higher-level classifications that cannot be annotated to low-level in Kraken annotation results to low-level in the form of probabilistic estimate; for subspecies-level classification, Bracken2 directly add it up to the species level.

Through the calculation method described above, Bracken achieves accurate abundance. The abundance table will be used for subsequent statistical analysis.

SoftwareVersionCommanders
Bracken2open in new window [5:1]2.6.1bracken -r 150 -l S
Note:r represents read length, it may be 100 or 150

Info

Speicies abundance tables can be fetched in two location.

  1. Clean data, where sample names in these tables are the sample names when you delivery the sample or reconfirmed names when client manager required you to provide. File path(es) are
  • TaxonomyAnalysis/Abundance/{Classification level}.relative.xls:relative abundance of the specified taxonomic level
  • TaxonomyAnalysis/Abundance/{Classification level}.absolute.xls:absolute abundance of the specified taxonomic level
  1. On the top right of report webpage, you can download report data which include taxomony abundance table whose sample names are the name you set when sumbitting the analysis plan. File path(es) are
  • TaxonomyAnalysis/Rename_Abundance/{Classification level}.relative.xls:relative abundance of the specified taxonomic level
  • TaxonomyAnalysis/Rename_Abundance/{Classification level}.absolute.xls:relative abundance of the specified taxonomic level

FAQ

Q: How to download annotated species abundance data?

A: In the system, click the "Download analysis results" button at the top of the page, you can download all the analysis results and obtain complete species abundance data.

Path:TaxonomyAnalysis/Rename_Abundance:

  • {Classification level}.absolute.xls is an absolute abundance table for the specified taxonomic level
  • {Classification level}.relative.xls is the relative abundance table of the specified taxonomic level
  • {Classification level}.{data type}.xls.header is temporary analyzing data. No need to worry about it.
Q: Why do some projects have more annotations and more species than others?

A: The number of species annotation results is related to the following:

  1. The complexity of the environment: the more complex the environment, the more annotated species, and vice versa. Environments such as soil and water are high-complexity environments, while intestinal (feces) samples are low-complexity environments.
  2. Number of samples: As the number of samples increases, unique species in the samples are detected, which will increase the overall number of species annotations.
  3. Sequencing depth: The more sequencing data, the more rare species in the sample can be annotated.

If only a few annotated species in the sequencing results, there may be the following reasons after excluding experiments (experimental contamination) and sequencing (data amount):

  1. Samples come from low-complexity environments with fewer species;
  2. The number of samples grouped is small and cannot cover most species;
  3. There are few researches on the species concerned, and there are few representative sequences and related information of the species in the database.
Q: What does NA mean in annotation results?

A: The species annotation of the metagenomic project uses Kraken2 software. The principle is to break the sequencing sequence into subsequences of length k (default k=35), called k-mer, and then compare the k-mer to the database . A k-mer that matches the database is assigned one to the taxonomic tree, and those that do not match are ignored. When all k-mers of a sequence are annotated, the species annotation result at the lowest level on the annotation tree is the species annotation result of this sequence. When all k-mers are annotated only to high levels, such as genus, family, class, then the lower levels of the sequence will be marked as NA.

Q: What is the basis for the selection of species annotation databases?

A: The system automatically selects the database for species annotation according to the sample type. The UHGG data is used for the sample from the human gut, and the NCBI NT database is used for other sources.

Reference


  1. Wood, D. E., & Salzberg, S. L. (2014). Kraken: Ultrafast Metagenomic Sequence Classification Using Exact Alignments. Genome Biology, 15(3), R46. https://doi.org/10.1186/gb-2014-15-3-r46open in new window ↩︎ ↩︎

  2. Wood, D. E., Lu, J., & Langmead, B. (2019). Improved Metagenomic Analysis with Kraken 2. Genome Biology, 20(1), 257. https://doi.org/10.1186/s13059-019-1891-0open in new window ↩︎

  3. Almeida, A., Nayfach, S., Boland, M., Strozzi, F., Beracochea, M., Shi, Z. J., Pollard, K. S., Sakharova, E., Parks, D. H., Hugenholtz, P., Segata, N., Kyrpides, N. C., & Finn, R. D. (2021). A Unified Catalog of 204,938 Reference Genomes from the Human Gut Microbiome. Nature Biotechnology, 39(1), 105–114. https://doi.org/10.1038/s41587-020-0603-3open in new window ↩︎

  4. Home—Nucleotide—NCBI. (n.d.). Retrieved May 6, 2022, from https://www.ncbi.nlm.nih.gov/nucleotide/open in new window ↩︎

  5. Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: Estimating Species Abundance in Metagenomics Data. PeerJ Computer Science, 3, e104. https://doi.org/10.7717/peerj-cs.104open in new window ↩︎ ↩︎