Among epigenetic modifications, DNA methylation (DNAm) has remained the most thoroughly investigated feature, which plays important roles in a myriad of biological processes from cancer to aging. As novel methods for profiling DNA methylation arise, and new data is being generated in huge quantities, the tools for DNAm data processing are sprawling as well. However, as it often happens in the field of bioinformatics, the tools are ample, but unreliable: some of them require a lot of time, resources, and effort to handle, some become forgotten instantly after publication, some are written in outdated versions of languages (and so are hard to incorporate into other pipelines), some are too specialized (and outdated, again) to be included as a routine technique, and the "gold standard" tools such as Bismark require a lot of computational power and time. Moreover, there is no comprehensive comparison of how all these tools perform in terms of alignment errors etc., since the toolmakers themselves select their benchmarkings. Some reviewers have attempted to suggest guidelines for choosing the best tools for this task, but these attempts are suboptimal (rarely include direct performance comparisons) and don't include the newest tools, e.g., machine learning-based ones. We propose students to tackle the processing of raw DNAm data generated by bisulfite conversion (as it's the most widely used approach for DNAm profiling) via a full-stack Python-based pipeline that would take raw reads as input, perform read QC and trimming, mapping to a genome, and methylation calling, and provide methylation fraction/beta values matrix as output. The students don't have to draw their own mathematics to compute all this, instead, they are welcome to make an overview of the existing tools to incorporate the best into their pipeline. In total, the tasks are:
1) compare existing DNAm processing tools,
2) create a new Python package guiding users from raw reads to methylation matrices, and
3) (if all goes well) compare tools for DNAm sites annotation and differential methylation analysis and
4) include the best of them into the pipeline.
As a result, the students will be able to publish (or, at the very least, present at the conferences) this pipeline and pave the road for a more unified, comprehensive, and yet user-friendly processing of DNA methylomics.