Assembly, gene prediction, de-redundancy

Dr.TomAbout 20 wordsLess than 1 minute

Assembly

Assembly is to splice the reads obtained by sequencing into longer sequences according to a certain algorithm. This is because the reads of next-generation sequencing are generally short, while long sequences can improve the efficiency, accuracy and utilization of reads in downstream analysis.

Assembly strategy

MEGAHIT [1] is an efficient assembly tool based on kk-mer and de Bruijn assembly strategy, which can effectively deal with the uneven sequencing depth of different regions of the genome (or genomes from different species) in metagenomics sequencing.

Evaluation assembly quality

N50 means sorting and accumulating the lengths of contig/scaffold from long to short. When the cumulative sum reaches 50% of the total length of contig/scaffold, the length of the last contig/scaffold is contig/scaffold N50. It is generally believed that a larger N50 indicates a better assembly result.

SoftwareVersionLink
MEGAHITopen in new window [1:1]1.2.9megahit --min-count 2 --k-min 93 --k-max 133 --k-step 10 --no-mercy --min-contig-len 200 --continue
Note: k-min and k-max depend on read length,
- PE100,--k-min=53,--k-max=93;
- PE150,--k-min=93,--k-max=133

Gene prediction

Various signal sites of prokaryotic genes (such as promoter and terminator) are highly specific and easy to identify. We use MetaGeneMark [2] for de novo prediction of metagenomic genes. De novo prediction is based on given sequence features, mainly relying on the different characteristic information of coding regions and non-coding regions, and statistically describing them to build a probability model to distinguish coding and non-coding regions. De novo prediction can predict both known and unknown genes.

SoftwareVersionCommanders
MetaGeneMarkopen in new window [2:1]3.38gmhmmp -a -d -f G -m MetaGeneMark_v1.mod

Remove redundant genes

The gene prediction results of each sample need to be deredundantly processed. CD-HIT [3] adopts a greedy incremental clustering method, first sorts the input sequences in order from long to short. The longest sequence is classified to the first type and serves as the representative sequence of the first type. The remaining sequences are then compared to representative sequences found before it. According to the sequence similarity (generally set the identity threshold to 95%, and the coverage threshold to 90%), the sequence will be classified into one of the categories or make it the representative sequence of a new type, so that all sequences are traversed to complete the clustering process.

SoftwareVersionCommanders
CD-HITopen in new window [3:1]4.8.1cd-hit-est -aS 0.9 -c 0.95 -d 0 -g 1

Gene Abundance Determination

After constructing a non-redundant gene set, use TPM (Transcripts Per Million) to measure the abundance of different genes. Compared with the original sequencing data, TPM normalizes the gene length and sequencing depth. The calculation formula is as follows:

TPMi=Xili~(1Xili~)106 TPM_i = \frac{X_i}{\widetilde{l_i}} * \left( \frac{1}{\sum \frac{X_i}{\widetilde{l_i}}} \right) * 10{^6}

iiii th gene
li{l_i}:length of th ii th gene
Xi{X_i}:Number of reads aligned to the ii th gene

TPM calculation process of a gene in a sample:

  1. Divide the number of reads aligned to the gene by the length of the gene (the length of the exon region, in kb), and then get the number of reads per kilobase, that is RPK (Reads Per Kilobase);
  2. Divide the total RPK in a sample by 10^6
  3. Divide RPK by the value obtained in step 2 to get TPM.

Gene abundance is obtained by Salmon [4]

SoftwareVersionCommander
Salmonopen in new window [4:1]1.6.0salmon quant -l A --validateMappings

Info

Gene abundance tables, delivered with Clean Data. Please note that sample names in these tables are the sample names when you delivery the sample or reconfirmed names when client manager required you to provide. File path(es) are

  • GeneAnalysis/Abundance/gene.relative.xls:relative abundance.
  • GeneAnalysis/Abundance/gene.absolute.xls:absolute abundance.

FAQ

Q: Can TPM be used for comparisons between different samples?

A:Yes. According to the principle of TPM, the sum of TPM of all genes in different samples is equal, so TPM is similar to the relative abundance in species annotation results, and can be compared between different samples/groups.

Reference


  1. Li, D., Liu, C.-M., Luo, R., Sadakane, K., & Lam, T.-W. (2015). MEGAHIT: An Ultra-Fast Single-Node Solution for Large and Complex Metagenomics Assembly Via Succinct De Bruijn Graph. Bioinformatics, 31(10), 1674–1676. https://doi.org/10.1093/bioinformatics/btv033open in new window ↩︎ ↩︎

  2. Zhu, W., Lomsadze, A., & Borodovsky, M. (2010). Ab Initio Gene Identification in Metagenomic Sequences. Nucleic Acids Research, 38(12), e132. https://doi.org/10.1093/nar/gkq275open in new window ↩︎ ↩︎

  3. Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases. Bioinformatics (Oxford, England), 17(3), 282–283. https://doi.org/10.1093/bioinformatics/17.3.282open in new window ↩︎ ↩︎

  4. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression. Nature Methods, 14(4), 417–419. https://doi.org/10.1038/nmeth.4197open in new window ↩︎ ↩︎