Function annotation
Introduction
Functional Annotation
Functional annotation of metagenomes involves annotating the predicted non-redundant genes to relevant functional databases and calculating their abundance.
For non-redundant genes, the software Diamond [1] is generally used for functional annotation with its blastp function. This is because the databases to be aligned are large, and there are many genes to be annotated. Using traditional BLAST would consume a lot of computational resources and time, while using Diamond can be 500-20000 times faster and obtain results consistent with BLAST. Especially for large-scale sequence annotation in NR or large-scale protein database alignment, Diamond is the preferred choice.
Common functional databases:
KEGG [2], Kyoto Encyclopedia of Genes and Genomes; version: v101. KEGG is the main database for studying metabolic pathways, integrating genomic information, chemical information, system information, disease and health information.
COG [3], Clusters of Orthologous Groups; version: 20201125. COG is a database developed by NCBI for annotating homologous proteins. It is constructed by classifying the coding proteins of 21 complete genomes of bacteria, algae, and eukaryotes based on their phylogenetic relationships.
eggNOG [4], evolutionary genealogy of genes: Non-supervised Orthologous Groups; version: 5.0. EggNOG database is an evolutionary genealogy of genes: Non-supervised Orthologous Groups database created and maintained by EMBL. It expands on NCBI's COG database, providing orthologous group (OG) information for proteins at different taxonomic levels, including eukaryotes, prokaryotes, and viruses.
Note
eggNOG is displayed as NOG in the system
Swiss-Prot [5], version: release-2021_04. This database contains high-quality, manually annotated, non-redundant datasets. It includes descriptions of thousands of proteins, including their functions, structural domains, subcellular localization, post-translational modifications, and functional feature variations. The data mainly comes from published literature and E-value validated analysis results.
BacMet [6], Antibacterial Biocide and Metal Resistance Genes Database; version: 20180311. This database is one of the commonly used resistance gene databases, focusing on biocides and metal resistance genes. Both manually curated high-quality experimental validation data and predicted data through public database searches are included.
CARD [7], The Comprehensive Antibiotic Resistance Database; version: 3.1.4. CARD is another commonly used resistance database, focusing on antibiotic resistance. CARD collects more than 1600 known antibiotic resistance genes. Note: The Resistance Gene Identifier (RGI) software is used for annotation of this database.
CAZy [8], Carbohydrate-Active enZYmes Database; version: 20240326. The Carbohydrate-Active Enzymes Database includes enzyme gene capable of synthesizing and decomposing complex carbohydrates. It provides family information on enzyme sequences involved in the synthesis, metabolism, and transport of carbohydrates.
Software | Version | Command |
---|---|---|
Diamond [1:1] | 0.8.24 | diamond --evalue 1e-5 --threads 5 --outfmt 6 --seg no --max-target-seqs 20 --more-sensitive -b 0.5 --salltitles |
RGI | 5.2.1 | Use software default parameters, this software is only used for annotating CARD database |
Functional Abundance Determination
Starting from the functional annotation results and the gene abundance table, the relative abundance of each functional level is equal to the sum of the relative abundances of genes annotated for this functional level. Among them, the KEGG database has 5 levels, and the CAZy database has 3 levels.
Info
Functional abundance tables are delivered along with clean data. Please note that sample names in these tables are the sample names when you delivery the sample or reconfirmed names when client manager required you to provide. File path(es) are
- FunctionAnalysis/Abundance/{annatation database}.{level, if possible}.normalized.xls:realative abundance
- FunctionAnalysis/Abundance/{annatation database}.{level, if possible}.rawCounts.xls:absolute abundance
- FunctionAnalysis/Abundance/{annatation database}.filter.xls:annotation result
FAQ
Q:What is the basis for selecting a functional database?
A:The Dr.Tom system provides the functional annotation results of seven commonly used databases: KEGG, COG, eggNOG, Swiss-Prot, BacMet, CARD, and CAZy. Different databases have been properly optimized for specific data, such as BacMet focusing on fungicides and Metal resistance genes, CARD focuses on antibiotic resistance, you can choose the appropriate species annotation results according to the annotation purpose.
Q: How to download the functional annotation abundance table?
A: The functional abundance table is delivered offline together with Clean Data.
Q:Species annotation database version
A:The annotation database version is as follows:
Database | Version |
---|---|
KEGG | v101.0 |
COG | 20201125 |
eggNOG | eggnog_5.0 |
Swissprot | release-2021_04 |
BacMet2 | 20180311 |
Card | v3.1.4 |
Cazy | 20240326 |
Q: Do databases annotate microbiota function at the genus level? Do we have to choose the results of one of these databases to interpret?
A: No, functional annotations do not depend on any species annotation results. Functional annotation is based on sequence information, and multiple databases can be used for analysis at the same time.
Q: In the KEGG annotation results, there is a one-to-one correspondence between map and pathway level3, but the abundance values are different?
A: We use different statistical algorithms to calculate the map and pathway level3: when calculating the map, if a KO is annotated to multiple maps, the KO will be evenly distributed to each map; when calculating the pathway level3, directly accumulate the KO abundance value.
Reference
Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and Sensitive Protein Alignment Using DIAMOND. Nature Methods, 12(1), 59–60. https://doi.org/10.1038/nmeth.3176 ↩︎ ↩︎
Kanehisa, M. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research, 28(1), 27–30. https://doi.org/10.1093/nar/28.1.27 ↩︎
Galperin, M. Y., Makarova, K. S., Wolf, Y. I., & Koonin, E. V. (2015). Expanded Microbial Genome Coverage and Improved Protein Family Annotation in the COG Database. Nucleic Acids Research, 43(D1), D261–D269. https://doi.org/10.1093/nar/gku1223 ↩︎
Huerta-Cepas, J. et al. eggNOG 5.0: A Hierarchical, Functionally and Phylogenetically Annotated Orthology Resource Based on 5090 Organisms and 2502 Viruses. Nucleic Acids Research 47, D309–D314 (2019). https://doi.org/10.1093/nar/gky1085 ↩︎
Poux, S., Arighi, C. N., Magrane, M., Bateman, A., Wei, C.-H., Lu, Z., Boutet, E., Bye-A-Jee, H., Famiglietti, M. L., Roechert, B., & UniProt Consortium, T. (2017). On Expert Curation and Scalability: UniProtKB/Swiss-Prot as a Case Study. Bioinformatics, 33(21), 3454–3460. https://doi.org/10.1093/bioinformatics/btx439 ↩︎
Pal, C., Bengtsson-Palme, J., Rensing, C., Kristiansson, E., & Larsson, D. G. J. (2014). BacMet: Antibacterial Biocide and Metal Resistance Genes Database. Nucleic Acids Research, 42(D1), D737–D743. https://doi.org/10.1093/nar/gkt1252 ↩︎
Jia, B., Raphenya, A. R., Alcock, B., Waglechner, N., Guo, P., Tsang, K. K., Lago, B. A., Dave, B. M., Pereira, S., Sharma, A. N., Doshi, S., Courtot, M., Lo, R., Williams, L. E., Frye, J. G., Elsayegh, T., Sardar, D., Westman, E. L., Pawlowski, A. C., … McArthur, A. G. (2017). CARD 2017: Expansion and Model-Centric Curation of the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research, 45(D1), D566–D573. https://doi.org/10.1093/nar/gkw1004 ↩︎
Lombard, V., Golaconda Ramulu, H., Drula, E., Coutinho, P. M., & Henrissat, B. (2014). The Carbohydrate-Active Enzymes Database (CAZy) in 2013. Nucleic Acids Research, 42(D1), D490–D495. https://doi.org/10.1093/nar/gkt1178 ↩︎