2023/24

Research projects

Academic year 2023/24

Transformation of a normal cell into a cancerous one is driven by the acquisition of cancer driver events which can include: specific activating mutations in oncogenes, protein-disrupting mutations in tumor suppressor genes, generation of gene fusions, loss of tumor supressor genes and amplifications of oncogenes as well as epigenetic changes. During the last decade, more than 600 cancer-driver genes were identified based on mutational analysis in large international cohorts comprising thousands of tumours (https://pubmed.ncbi.nlm.nih.gov/32778778/), but all the theoretical expectations and practical observations point to a significantly higher number of cancer drivers in different tumour types. The majority of high-level amplifications in cancers (more than 15 copies) are associated with extrachromosomal circular DNA (eccDNA), but only 25-35% of them harbour known oncogenes. At the same time, the eccDNA presence in cancer itself is very strong evidence of selection acting on this particular amplified region (without selection, a high copy number of eccDNA cannot be supported due to the stochastic loss and dilution of eccDNA during cancer cell division). In this work, we will use a large cohort of almost 9000 tumours from TCGA, META-PRISM and MET500 projects to identify rare cancer drivers and progression genes located in the regions of eccDNA. Materials and methods: We will use previously calculated copy number profiles of tumours, RNA-seq-based expression of genes in cancers and normal tissues from different databases and literature mining to prioritize genes which can be drivers of tumour growth, progression, metastasis and invasion. The first step of the work will include quality control of copy number alteration profiles of cancers to exclude regions with known oncogenes and ambiguous regions. During the second step, we will prioritize in each region of high-level genomic amplifications the most likely candidate driver genes based on the score created using the level of gene expression, occurrence of the gene in PUBMED abstracts and relevant databases such as kinase atlas. The third step of the work will include pathways enrichment analysis of the identified candidate driver genes per tumor type, for primary versus metastatic cancers and building genetic networks of the new drivers and known oncogenes to understand their relationship. During this work, students will mostly work with table files of different sizes (from 1kb to 10GB), R-based packages and online software (Gene set enrichment analysis). Expected results: Identification of new putative cancer driver genes and associated cellular pathways. Prioritization of genes for future in-vitro validation in experiments.

Among epigenetic modifications, DNA methylation (DNAm) has remained the most thoroughly investigated feature, which plays important roles in a myriad of biological processes from cancer to aging. As novel methods for profiling DNA methylation arise, and new data is being generated in huge quantities, the tools for DNAm data processing are sprawling as well. However, as it often happens in the field of bioinformatics, the tools are ample, but unreliable: some of them require a lot of time, resources, and effort to handle, some become forgotten instantly after publication, some are written in outdated versions of languages (and so are hard to incorporate into other pipelines), some are too specialized (and outdated, again) to be included as a routine technique, and the "gold standard" tools such as Bismark require a lot of computational power and time. Moreover, there is no comprehensive comparison of how all these tools perform in terms of alignment errors etc., since the toolmakers themselves select their benchmarkings. Some reviewers have attempted to suggest guidelines for choosing the best tools for this task, but these attempts are suboptimal (rarely include direct performance comparisons) and don't include the newest tools, e.g., machine learning-based ones. We propose students to tackle the processing of raw DNAm data generated by bisulfite conversion (as it's the most widely used approach for DNAm profiling) via a full-stack Python-based pipeline that would take raw reads as input, perform read QC and trimming, mapping to a genome, and methylation calling, and provide methylation fraction/beta values matrix as output. The students don't have to draw their own mathematics to compute all this, instead, they are welcome to make an overview of the existing tools to incorporate the best into their pipeline. In total, the tasks are:
1) compare existing DNAm processing tools,
2) create a new Python package guiding users from raw reads to methylation matrices, and
3) (if all goes well) compare tools for DNAm sites annotation and differential methylation analysis and
4) include the best of them into the pipeline.
As a result, the students will be able to publish (or, at the very least, present at the conferences) this pipeline and pave the road for a more unified, comprehensive, and yet user-friendly processing of DNA methylomics.

The last decade in bioinformatics is associated with the active development of the study of chromatin structure. Techniques such as 3-C and Hi-C have made it possible to study the three-dimensional organization of DNA, or as it is called, the 3-D genome. Improvements in sequencing methods and increased resolution have allowed the discovery of more subtle chromatin compacting structures - such as topologically associated domains (TADs). TADs are responsible for the formation of localized DNA packaging regions. Thus, they participate, for example, in controlling the expression level of certain DNA regions, forming a unique epigenetic landscape. The role of TADs in the chromatin structure of human, Drosophila and other organisms has now been studied in detail. It is now clear that by influencing epigenetics, chromatin packaging can determine differences in cell types in multicellular organisms. For example, it has been shown that different human cell types do indeed have different patterns of formation of TADs and other structures. This becomes particularly interesting when studying various diseases and disorders, as the 3-D genome may shed some light on their pathoetiology. In this regard, it is critically necessary to be able to compare statistically the DNA packaging patterns in different types of samples. This can be thinking of this as how people perform differential gene expression analysis. These methods have already been developed for structure types such as loops and contacts. However, in the area of studying differential topologically associated domains, a good tool is still lacking. There are some developments that will be an aid to the work, but overall this project is of very great benefit to the bioinformatics community.
Objectives of the work:
- Exploring the tools available for finding differential chromatin structures, understanding the statistical and mathematical approaches that are used (a short set of such tools will be given).
- Design of how these statistical methods could be applied to compare TADs.
- Writing a standalone Python tool (Snakemake can be used) to calculate differential TADs.
- Running the tool on some data, evaluating the tool.
- Remember to rest and enjoy life.

Об институте

направления

Мы в сети