Quality control and data filter

Dr.TomAbout 20 wordsLess than 1 minute

Data filtering

The library contains barcode sequences to distinguish different samples, so raw data usually contains barcode information. The barcode sequence needs to be removed. In addition, the raw data may also contain low-quality bases, uncertain bases, and sequences that are too short. If these sequences are used in subsequent analysis, the reliability of the results will be affected. Therefore, it is necessary to use SOAPnuke to perform quality control of the raw data before analysis, so as to obtain high-quality data (Clean data) to ensure the accuracy of subsequent analysis results. The steps of SOAPnuke filtering raw data are as follows:

  1. Eliminate reads containing ≥0.1% uncertain bases (N bases)
  2. Eliminate reads containing adapter sequences
  3. Eliminate reads that contain more than 50% of low-quality bases (bases with Q < 20)
  4. Use Bowtie2 to align the sequence to the host genome, and then filter out the sequence belonged to the host genome (if the sample is from human, mouse or rat, it will be directly filtered according to the source, otherwise the host sequence needs to be provided)
SoftwareVersionCommander
SOAPnukeopen in new window [1]2.2.1SOAPnuke filter -l 20 -q 0.5 -n 0.001 -d -Q 2 -5 0 --adaMis 0.3
Bowtie2open in new window [2]2.4.4Use software default parameters

FAQ

none

Reference


  1. Chen, Y., Chen, Y., Shi, C., Huang, Z., Zhang, Y., Li, S., Li, Y., Ye, J., Yu, C., Li, Z., Zhang, X., Wang, J., Yang, H., Fang, L., & Chen, Q. (2018). SOAPnuke: A MapReduce Acceleration-Supported Software for Integrated Quality Control and Preprocessing of High-Throughput Sequencing Data. GigaScience, 7(1). https://doi.org/10.1093/gigascience/gix120open in new window ↩︎

  2. Langmead, B., & Salzberg, S. L. (2012). Fast Gapped-Read Alignment with Bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923open in new window ↩︎