Function annotation

Dr.TomAbout 19 wordsLess than 1 minute

Introduction

Functional Annotation

Functional annotation of metagenomes involves annotating the predicted non-redundant genes to relevant functional databases and calculating their abundance.

For non-redundant genes, the software Diamond [1] is generally used for functional annotation with its blastp function. This is because the databases to be aligned are large, and there are many genes to be annotated. Using traditional BLAST would consume a lot of computational resources and time, while using Diamond can be 500-20000 times faster and obtain results consistent with BLAST. Especially for large-scale sequence annotation in NR or large-scale protein database alignment, Diamond is the preferred choice.

Common functional databases:

  • KEGG [2], Kyoto Encyclopedia of Genes and Genomes; version: v101. KEGG is the main database for studying metabolic pathways, integrating genomic information, chemical information, system information, disease and health information.

  • COG [3], Clusters of Orthologous Groups; version: 20201125. COG is a database developed by NCBI for annotating homologous proteins. It is constructed by classifying the coding proteins of 21 complete genomes of bacteria, algae, and eukaryotes based on their phylogenetic relationships.

  • eggNOG [4], evolutionary genealogy of genes: Non-supervised Orthologous Groups; version: 5.0. EggNOG database is an evolutionary genealogy of genes: Non-supervised Orthologous Groups database created and maintained by EMBL. It expands on NCBI's COG database, providing orthologous group (OG) information for proteins at different taxonomic levels, including eukaryotes, prokaryotes, and viruses.

    Note

    eggNOG is displayed as NOG in the system

  • Swiss-Prot [5], version: release-2021_04. This database contains high-quality, manually annotated, non-redundant datasets. It includes descriptions of thousands of proteins, including their functions, structural domains, subcellular localization, post-translational modifications, and functional feature variations. The data mainly comes from published literature and E-value validated analysis results.

  • BacMet [6], Antibacterial Biocide and Metal Resistance Genes Database; version: 20180311. This database is one of the commonly used resistance gene databases, focusing on biocides and metal resistance genes. Both manually curated high-quality experimental validation data and predicted data through public database searches are included.

  • CARD [7], The Comprehensive Antibiotic Resistance Database; version: 3.1.4. CARD is another commonly used resistance database, focusing on antibiotic resistance. CARD collects more than 1600 known antibiotic resistance genes. Note: The Resistance Gene Identifier (RGI) software is used for annotation of this database.

  • CAZy [8], Carbohydrate-Active enZYmes Database; version: 20240326. The Carbohydrate-Active Enzymes Database includes enzyme gene capable of synthesizing and decomposing complex carbohydrates. It provides family information on enzyme sequences involved in the synthesis, metabolism, and transport of carbohydrates.

SoftwareVersionCommand
Diamondopen in new window [1:1]0.8.24diamond --evalue 1e-5 --threads 5 --outfmt 6 --seg no --max-target-seqs 20 --more-sensitive -b 0.5 --salltitles
RGIopen in new window5.2.1Use software default parameters, this software is only used for annotating CARD database

Functional Abundance Determination

Starting from the functional annotation results and the gene abundance table, the relative abundance of each functional level is equal to the sum of the relative abundances of genes annotated for this functional level. Among them, the KEGG database has 5 levels, and the CAZy database has 3 levels.

Info

Functional abundance tables are delivered along with clean data. Please note that sample names in these tables are the sample names when you delivery the sample or reconfirmed names when client manager required you to provide. File path(es) are

  • FunctionAnalysis/Abundance/{annatation database}.{level, if possible}.normalized.xls:realative abundance
  • FunctionAnalysis/Abundance/{annatation database}.{level, if possible}.rawCounts.xls:absolute abundance
  • FunctionAnalysis/Abundance/{annatation database}.filter.xls:annotation result

FAQ

Q:What is the basis for selecting a functional database?

A:The Dr.Tom system provides the functional annotation results of seven commonly used databases: KEGG, COG, eggNOG, Swiss-Prot, BacMet, CARD, and CAZy. Different databases have been properly optimized for specific data, such as BacMet focusing on fungicides and Metal resistance genes, CARD focuses on antibiotic resistance, you can choose the appropriate species annotation results according to the annotation purpose.

Q: How to download the functional annotation abundance table?

A: The functional abundance table is delivered offline together with Clean Data.

Q:Species annotation database version

A:The annotation database version is as follows:

DatabaseVersion
KEGGv101.0
COG20201125
eggNOGeggnog_5.0
Swissprotrelease-2021_04
BacMet220180311
Cardv3.1.4
Cazy20240326
Q: Do databases annotate microbiota function at the genus level? Do we have to choose the results of one of these databases to interpret?

A: No, functional annotations do not depend on any species annotation results. Functional annotation is based on sequence information, and multiple databases can be used for analysis at the same time.

Q: In the KEGG annotation results, there is a one-to-one correspondence between map and pathway level3, but the abundance values are different?

A: We use different statistical algorithms to calculate the map and pathway level3: when calculating the map, if a KO is annotated to multiple maps, the KO will be evenly distributed to each map; when calculating the pathway level3, directly accumulate the KO abundance value.

Reference


  1. Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and Sensitive Protein Alignment Using DIAMOND. Nature Methods, 12(1), 59–60. https://doi.org/10.1038/nmeth.3176open in new window ↩︎ ↩︎

  2. Kanehisa, M. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research, 28(1), 27–30. https://doi.org/10.1093/nar/28.1.27open in new window ↩︎

  3. Galperin, M. Y., Makarova, K. S., Wolf, Y. I., & Koonin, E. V. (2015). Expanded Microbial Genome Coverage and Improved Protein Family Annotation in the COG Database. Nucleic Acids Research, 43(D1), D261–D269. https://doi.org/10.1093/nar/gku1223open in new window ↩︎

  4. Huerta-Cepas, J. et al. eggNOG 5.0: A Hierarchical, Functionally and Phylogenetically Annotated Orthology Resource Based on 5090 Organisms and 2502 Viruses. Nucleic Acids Research 47, D309–D314 (2019). https://doi.org/10.1093/nar/gky1085open in new window ↩︎

  5. Poux, S., Arighi, C. N., Magrane, M., Bateman, A., Wei, C.-H., Lu, Z., Boutet, E., Bye-A-Jee, H., Famiglietti, M. L., Roechert, B., & UniProt Consortium, T. (2017). On Expert Curation and Scalability: UniProtKB/Swiss-Prot as a Case Study. Bioinformatics, 33(21), 3454–3460. https://doi.org/10.1093/bioinformatics/btx439open in new window ↩︎

  6. Pal, C., Bengtsson-Palme, J., Rensing, C., Kristiansson, E., & Larsson, D. G. J. (2014). BacMet: Antibacterial Biocide and Metal Resistance Genes Database. Nucleic Acids Research, 42(D1), D737–D743. https://doi.org/10.1093/nar/gkt1252open in new window ↩︎

  7. Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang, K. K., Lago, B. A., Dave, B. M., Pereira, S., Sharma, A. N., Doshi, S., Courtot, M., Lo, R., Williams, L. E., Frye, J. G., Elsayegh, T., Sardar, D., Westman, E. L., Pawlowski, A. C., … McArthur, A. G. (2017). CARD 2017: Expansion and Model-Centric Curation of the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research, 45(D1), D566–D573. https://doi.org/10.1093/nar/gkw1004open in new window ↩︎

  8. Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P. M., & Henrissat, B. (2014). The Carbohydrate-Active Enzymes Database (CAZy) in 2013. Nucleic Acids Research, 42(D1), D490–D495. https://doi.org/10.1093/nar/gkt1178open in new window ↩︎