Quality control and data filter
Data filtering
The library contains barcode sequences to distinguish different samples, so raw data usually contains barcode information. The barcode sequence needs to be removed. In addition, the raw data may also contain low-quality bases, uncertain bases, and sequences that are too short. If these sequences are used in subsequent analysis, the reliability of the results will be affected. Therefore, it is necessary to use SOAPnuke to perform quality control of the raw data before analysis, so as to obtain high-quality data (Clean data) to ensure the accuracy of subsequent analysis results. The steps of SOAPnuke filtering raw data are as follows:
- Eliminate reads containing ≥0.1% uncertain bases (N bases)
- Eliminate reads containing adapter sequences
- Eliminate reads that contain more than 50% of low-quality bases (bases with Q < 20)
- Use Bowtie2 to align the sequence to the host genome, and then filter out the sequence belonged to the host genome (if the sample is from human, mouse or rat, it will be directly filtered according to the source, otherwise the host sequence needs to be provided)
Software | Version | Commander |
---|---|---|
SOAPnuke [1] | 2.2.1 | SOAPnuke filter -l 20 -q 0.5 -n 0.001 -d -Q 2 -5 0 --adaMis 0.3 |
Bowtie2 [2] | 2.4.4 | Use software default parameters |
FAQ
none
Reference
Chen, Y., Chen, Y., Shi, C., Huang, Z., Zhang, Y., Li, S., Li, Y., Ye, J., Yu, C., Li, Z., Zhang, X., Wang, J., Yang, H., Fang, L., & Chen, Q. (2018). SOAPnuke: A MapReduce Acceleration-Supported Software for Integrated Quality Control and Preprocessing of High-Throughput Sequencing Data. GigaScience, 7(1). https://doi.org/10.1093/gigascience/gix120 ↩︎
Langmead, B., & Salzberg, S. L. (2012). Fast Gapped-Read Alignment with Bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923 ↩︎