• ПРОГРАММЫ
  • ПОСТУПЛЕНИЕ
  • ПРЕДМЕТЫ
  • ПРЕПОДАВАТЕЛИ
  • ПРОЕКТЫ
  • ВЫПУСКНИКИ
  • МАТЕРИАЛЫ
  • FAQ
En
Research projects
Academic year 2023/24
Projects Abstracts (PDF) RSCI
The surface proteins of pathogens are under strong selection as they are targeted by the host immune system. One of the mechanisms bacteria use to avoid this problem is phase variation. We will consider the option when, due to an inversion involving a pair of genes encoding variants of these surface proteins, a switch between these variants occurs. The goal of the project, based on data on the proteins of histidine triads of streptococci and ANK proteins of Wolbachia, is to write a pipeline for searching for proteins whose genes undergo phase variation, and package it beautifully into a project that takes as input a set of genes and proteins with markings, and returns suspicious ones. phase variation of the group.
      Fundamentally, any multicellular organism is an ecosystem, as it includes a lot of symbiotic microorganisms outside and inside its body. Obviously, when we sequence DNA or RNA of the whole organism, we can also have some idea about the diversity of the microbes associated with it. The goal of the project is to uncover the microorganisms dwelling in (or on) endemic Baikal amphipods (relatively small crustaceans living in many freshwater and marine environments, including the ancient Lake Baikal) and the dynamics of the microbiome under laboratory conditions. The particular steps will be: 1. Annotate the microorganisms in assembled transcriptomes with published databases. 2. Compile the core microbiome of the samples (and compare it to the known data on water microbiomes). 3. Compare the microbiomes of laboratory samples at different time points over time without any treatment, as well as under heat shock treatment.
      Lake Baikal is an oligotrophic water body with low concentration of nutrients. During the ice period, the upper layer of water near the bottom surface of the ice is increasingly enriched with nutrients due to salting out processes (doi:10.1080/03680770.1998.11898179) which promotes the development of microorganisms. The goal of the project is to identify similarities and differences in communities of bacteria and microeukaryotes living on the bottom surface of the ice and water column of Lake Baikal using metabarcoding of 16S and 18S rRNA gene fragments. In addition, we will try to determine the relationship of communities with physical and chemical environmental factors. Further reading: Bashenkhaeva et al., 2015, 2023
      Transformation of a normal cell into a cancerous one is driven by the acquisition of cancer driver events which can include: specific activating mutations in oncogenes, protein-disrupting mutations in tumor suppressor genes, generation of gene fusions, loss of tumor supressor genes and amplifications of oncogenes as well as epigenetic changes. During the last decade, more than 600 cancer-driver genes were identified based on mutational analysis in large international cohorts comprising thousands of tumours (https://pubmed.ncbi.nlm.nih.gov/32778778/), but all the theoretical expectations and practical observations point to a significantly higher number of cancer drivers in different tumour types. The majority of high-level amplifications in cancers (more than 15 copies) are associated with extrachromosomal circular DNA (eccDNA), but only 25-35% of them harbour known oncogenes. At the same time, the eccDNA presence in cancer itself is very strong evidence of selection acting on this particular amplified region (without selection, a high copy number of eccDNA cannot be supported due to the stochastic loss and dilution of eccDNA during cancer cell division). In this work, we will use a large cohort of almost 9000 tumours from TCGA, META-PRISM and MET500 projects to identify rare cancer drivers and progression genes located in the regions of eccDNA. Materials and methods: We will use previously calculated copy number profiles of tumours, RNA-seq-based expression of genes in cancers and normal tissues from different databases and literature mining to prioritize genes which can be drivers of tumour growth, progression, metastasis and invasion. The first step of the work will include quality control of copy number alteration profiles of cancers to exclude regions with known oncogenes and ambiguous regions. During the second step, we will prioritize in each region of high-level genomic amplifications the most likely candidate driver genes based on the score created using the level of gene expression, occurrence of the gene in PUBMED abstracts and relevant databases such as kinase atlas. The third step of the work will include pathways enrichment analysis of the identified candidate driver genes per tumor type, for primary versus metastatic cancers and building genetic networks of the new drivers and known oncogenes to understand their relationship. During this work, students will mostly work with table files of different sizes (from 1kb to 10GB), R-based packages and online software (Gene set enrichment analysis). Expected results: Identification of new putative cancer driver genes and associated cellular pathways. Prioritization of genes for future in-vitro validation in experiments.
      Inherited retinal degenerations (IRDs) are genetically and phenotypically heterogeneous, with a molecular diagnostic yield of 50-70%. There are several reasons for the missing heritability in IRDs, such us deep intronic, non- coding regulatory, copy number, complex structural variants, or mobile element insertions. Whole genome sequencing (WGS) may identify causal variants in additional 10-20% unsolved cases. A clinician-scientist driven whole analysis process accompanied with the bioinformatics student education/research process will increase the diagnostic yield in the remaining 10-20% of cases.
      Project goals:
      1. Learn about IRDs, new horizons of these disease diagnostics and treatment
      2. Learn WGS analysis (using seqr software)
      3. Apply tools to reveal standard and hidden genetic disease causing pathology
      4. Design the learning process into finished project work as a publication.
      The field of IRD is diverse and 1 project can be split into several sub-projects enough for several students.
      A genome-wide association study (GWAS analysis) is planned for the residents of the Siege of Leningrad. Due to the strong influence of the external environment, it is assumed that mutations that could potentially alter metabolism and contribute to increased survival were positively selected in the genome. The following tasks will need to be performed as part of the project:
      - Processing raw sequencing data (data quality assessment, alignment and vcf-files);
      - GWAS analysis to identify significant SNPs;
      - Assessing the impact of the identified SNPs on metabolism;
      - Comparing the identified SNPs with the frequencies of SNPs in other populations.
      Positional information plays a huge role in regeneration processes. Its essence lies in the distribution of various kinds of signals in the space of the body that determine the positions of cells along its axes. It is believed that the ability to maintain persistent gene expression during the normal growth of an animal maintains chromatin in an open state and allows cells respond quickly in the event of tissue damage. Not coincidentally, the maintenance of positional information in juvenile and adult states has been shown in animals demonstrating successful reparative regeneration. Rapid restructuring of positional information in response to body damage has also been shown, for example, in planarians and some annelids. B As objects of study, we chose two annelids with different regenerative potencies: Pygospio elegans from the family Spionidae perfectly restores the anterior and posterior ends of the body, while Arenicola marina from the family Arenicolidae can barely heal the wound surface, and restores the lost posterior end of the body due to hypertrophic growth remaining segments. We divided the body of each worm into 12 parts and isolated total RNA from each fragment. For each fragment, a transcriptome was obtained and assembled. The goal of the work is to search the obtained transcriptomes for sequences encoding conserved development factors, verify blastoma sequences, phylogenetic analysis of the found sequences, and assess the distribution of transcripts in the intact body of worms with the construction of heat maps or distribution graphs. The number of analyzed sequences is discussed and may vary depending on the success of the project.
      DNA methylation (DNAm) has been widely used to estimate epigenetic age as a proxy metric of biological age in various research settings, for example, to test if some pro-longevity intervention such as rapamycin treatment or caloric restriction affects aging in short-term experiments. Multiple DNAm-based (and other omics-based) clock models have been described, but all of them perform differently from each other, so researchers often resort to several clocks simultaneously to substantiate their findings. Unfortunately, since all of these clocks were developed independently, they must be installed from separate places, and then processed and trained anew, which is highly inconvenient and might affect reproducibility. Currently, there is an R (Bioconductor) package called methylclock which allows to generate age predictions using a number of existing clock models, to check their correlation metrics, and to visualize them. However, it has several significant drawbacks (e.g., compatibility errors; lack of dataset normalization, QC, batch effect correction, and other processing steps; and lack of some well-known and widely employed clocks), as well as it's focused on DNAm only, which all makes this package of limited use.
      We offer students to create a convenient and comprehensive pipeline for biological age estimation and comparison, during which they are invited to:
      1) translate methylclock from R into Python,
      2) create a pipeline for dataset downloading and processing (from methylation matrices to normalization, QC, etc.),
      3) add other clock models (built on other types of data including blood biochemistry, RNA-seq, etc. – optional), and
      4) improve prediction metrics calculation and visualization.
      The result of this project will be a publishable Python module for fast and easy-to-use estimation of biological age.
      Among epigenetic modifications, DNA methylation (DNAm) has remained the most thoroughly investigated feature, which plays important roles in a myriad of biological processes from cancer to aging. As novel methods for profiling DNA methylation arise, and new data is being generated in huge quantities, the tools for DNAm data processing are sprawling as well. However, as it often happens in the field of bioinformatics, the tools are ample, but unreliable: some of them require a lot of time, resources, and effort to handle, some become forgotten instantly after publication, some are written in outdated versions of languages (and so are hard to incorporate into other pipelines), some are too specialized (and outdated, again) to be included as a routine technique, and the "gold standard" tools such as Bismark require a lot of computational power and time. Moreover, there is no comprehensive comparison of how all these tools perform in terms of alignment errors etc., since the toolmakers themselves select their benchmarkings. Some reviewers have attempted to suggest guidelines for choosing the best tools for this task, but these attempts are suboptimal (rarely include direct performance comparisons) and don't include the newest tools, e.g., machine learning-based ones. We propose students to tackle the processing of raw DNAm data generated by bisulfite conversion (as it's the most widely used approach for DNAm profiling) via a full-stack Python-based pipeline that would take raw reads as input, perform read QC and trimming, mapping to a genome, and methylation calling, and provide methylation fraction/beta values matrix as output. The students don't have to draw their own mathematics to compute all this, instead, they are welcome to make an overview of the existing tools to incorporate the best into their pipeline. In total, the tasks are:
      1) compare existing DNAm processing tools,
      2) create a new Python package guiding users from raw reads to methylation matrices, and
      3) (if all goes well) compare tools for DNAm sites annotation and differential methylation analysis and
      4) include the best of them into the pipeline.
      As a result, the students will be able to publish (or, at the very least, present at the conferences) this pipeline and pave the road for a more unified, comprehensive, and yet user-friendly processing of DNA methylomics.
      VDJdb is the database of T cell receptor (TCR) sequences of known antigen specificity built upon the corpus of relevant immunological publications. However, the landscape of TCRs is enormously large and the immune repertoires have little overlap across individuals. This is the reason why it is difficult to find matches of TCRs of interest in VDJdb. There are different ways to handle this problem such as searching using k-mers and fuzzy matching together with applying basic NLP approaches such as Term Frequency - Inverse Document Frequency (TF-IDF) score. Recent advances in machine learning based methods such as Bidirectional Encoder Representations from Transformers (BERT) or Generative Pre-trained Transformer (GPT) models make it possible to find similar TCRs using predictions. The aim of this project is to create a language model which would allow predicting the closest TCR match and suggesting relevant publications that include associated TCRs.
      Aim: Create a search system that would suggest diseases and antigens associated with the given TCR CDR3 sequence and relevant publications.
      Objectives: Create a language model (BERT) for TCR sequences Get abstracts of publications mentioning TCRs and link BERT model to them Validate model accuracy by searching for TCRs similar to ones recognizing certain antigens and mentioned in publications available in VDJdb Evaluate the model's suggestions based on a set of examples to check method feasibility (optional) Embed the model into the VDJdb web server.
      T cells are the essential part of human's adaptive immune system. Their generation includes a process called V(D)J recombination which includes rearranging one V, D, and J gene from a set of genes in IMGT locus to form a T-cell receptor that determines antigen and pathogen specificity of a T-cell. It was found that some people contain copy number variation and allelic variants of certain V/D/J genes. This affects the repertoire structure dramatically, sometimes resulting in a lack of specific T cells essential to form an immune response towards certain pathogens. Recent studies have shown that amplifications and deletions are present for both TRA and TRB genes, but there is still no research on whether these deletions are coherent or not. This project aims to analyze the deletions in TRA and TRB locus in a large cohort of T-cell repertoires in order to identify allelic variants and common haplotypes in Russian and American populations. It is also planned to analyze the interplay between TRA and TRB gene expression patterns in order to find potential compensatory mechanisms related to locus variation.
      Aim: Analyze copy number variation in TRA and TRB loci for large cohort of donors and find common and rare haplotypes.
      Objectives: Detect TRA/TRB deletions and amplifications and estimate their population frequencies Analyze coherence of amplifications and deletions forming haplotypes in TRA and TRB genes Analyze co-expression patterns in V-V and V-J pairs for TRA/TRB genes Analyze co-expression patterns between TRA and TRB genes (optional) Compare found patterns between populations.
      Loss of heterozygosity (LOH) is a type of genetic abnormality in diploid organisms in which one copy of an entire gene and its surrounding chromosomal region are lost. In many cases LOH is accompanied by the presence of a heterozygous deletion of chromosomal region. If LOH is not accompanied by a deletion, it may indicate the presence of uniparental disomy or consanguineous marriage. The aim of this project is to develop a pipeline for detection of LOH regions in human exome and genome sequencing data.
      Data of WGS-based non-invasive prenatal testing (NIPT) or cell-free DNA testing contains exogenous DNA (bacterial and viral). This information is too fragmentary to conduct full microbiome studies, but still interesting for expanding NIPT functionality. Being a retrovirus, HIV can not be directly detected in cell-free DNA data. So the first step would be creating residual virus and microbiome profiles of two datasets of two different sequencing platforms (ThermoFisher and BGI). When the pipeline is established, students will proceed to analysis of the HIV-positive sequencing data and find out if there are differences in exogenous DNA composition between HIV- and HIV+ NIPT samples thus indicating indirect signs of HIV infection. Generally, the pipeline of the project will include the extraction of unmapped reads, assigning taxonomic labels and following data analysis.
      The last decade in bioinformatics is associated with the active development of the study of chromatin structure. Techniques such as 3-C and Hi-C have made it possible to study the three-dimensional organization of DNA, or as it is called, the 3-D genome. Improvements in sequencing methods and increased resolution have allowed the discovery of more subtle chromatin compacting structures - such as topologically associated domains (TADs). TADs are responsible for the formation of localized DNA packaging regions. Thus, they participate, for example, in controlling the expression level of certain DNA regions, forming a unique epigenetic landscape. The role of TADs in the chromatin structure of human, Drosophila and other organisms has now been studied in detail. It is now clear that by influencing epigenetics, chromatin packaging can determine differences in cell types in multicellular organisms. For example, it has been shown that different human cell types do indeed have different patterns of formation of TADs and other structures. This becomes particularly interesting when studying various diseases and disorders, as the 3-D genome may shed some light on their pathoetiology. In this regard, it is critically necessary to be able to compare statistically the DNA packaging patterns in different types of samples. This can be thinking of this as how people perform differential gene expression analysis. These methods have already been developed for structure types such as loops and contacts. However, in the area of studying differential topologically associated domains, a good tool is still lacking. There are some developments that will be an aid to the work, but overall this project is of very great benefit to the bioinformatics community.
      Objectives of the work:
      - Exploring the tools available for finding differential chromatin structures, understanding the statistical and mathematical approaches that are used (a short set of such tools will be given).
      - Design of how these statistical methods could be applied to compare TADs.
      - Writing a standalone Python tool (Snakemake can be used) to calculate differential TADs.
      - Running the tool on some data, evaluating the tool.
      - Remember to rest and enjoy life.
      Description: Bacteria and their viruses called bacteriophages are in a constant arms race: a huge number of bacterial defense systems have been discovered in recent years. These defense systems allow bacteria populations to avoid reproduction of phage progeny in many different ways. BREX (BacteRiophage EXclusion) is a mysterious defense system which provides defense against a wide range of bacteriophages. It has been noticed that BREX loci may serve as a hotspot for other defense systems, which provide complementary protection against BREX-resistant phages, thus, we hypothesize that large-scale comparative analysis of these loci may lead to discovery of novel defense systems. Project goals:
      - To implement a workflow for extraction and characterization of variable gene clusters within BREX loci;
      - To validate the pipeline on the open databases;
      - To arrange the implemented pipeline in the form of a Docker container to analyze newly sequenced data.
      Description: Plasmid libraries, containing randomized metagenomic or genomic DNA inserts, are widely applied to identify genes-of-interest through the change of their frequency upon application of external selection pressure. For example, the frequency of antibiotic resistance genes will increase upon treatment of the cell library with antibiotics. This approach could be adapted for the identification of viral triggers causing bacteria to commit suicide due to the activation of abortive immunity systems. As a proof of concept, we constructed a library of the phage T5 genome fragments and expressed this library in bacterial culture carrying the abortive immunity system PARIS which should activate toxic response upon sensing a specific T5 protein. Such viral triggers could be identified upon the negative change of frequency of encoding genes after prolonged incubation with PARIS, compared to the change of frequency in PARIS negative control cells, which will reflect only the intrinsic toxicity of viral products. This method will have multiple applications in microbial immunology (e.g., the search of the inhibitors of immune systems), once its efficiency is verified. We invite students to conduct a bioinformatic analysis of sequencing data in order to test this method and find out possible viral triggers of PARIS system.
      Project goals:
      - To implement a workflow for analysis of plasmid expression library sequencing for characterization of functional traits of genes;
      - To test the workflow on sequencing of the library of T5 phage genome fragments in order to find the triggers of the bacteria defense system (PARIS).
      Project description: In prokaryotic genomes, genes are assembled into operons. Operons combine genes that participate in the same biological process and allow them to be regulated together. This forms a kind of orchestra of genome regulation, where individual parts of the genome come into play only when necessary. Thus, the study of prokaryotic genomes not gene-by-gene, but within operons can provide many new insights. It is possible to better understand what processes and how they are linked if their genes form a single operon. By studying unknown genes in known operons we can make new discoveries. All of this leads us to the need to find and label operons. There are already some tools in this area, but almost all of them either produce something strange or don't work at all. The only adequate tool is a web-application, unsuitable for use in automatic analysis, and recently its site was down for almost a month. That's why we have a task to build our own working and convenient tool on the basis of available developments.
      Project goals:
      - To study already used approaches to the operon mapping;
      - To develop a new approach to the operon mapping;
      - To implement developed approach as a standalone tool.
      Project description: In prokaryotic genomes, genes are assembled into operons. Operons combine genes that participate in the same biological process and allow them to be regulated together. This forms a kind of orchestra of genome regulation, where individual parts of the genome come into play only when necessary. Thus, the study of prokaryotic genomes not gene-by-gene, but within operons can provide many new insights. It is possible to better understand what processes and how they are linked if their genes form a single operon. By studying unknown genes in known operons we can make new discoveries. All of this leads us to the need to find and label operons. There are already some tools in this area, but almost all of them either produce something strange or don't work at all. The only adequate tool is a web-application, unsuitable for use in automatic analysis, and recently its site was down for almost a month. That's why we have a task to build our own working and convenient tool on the basis of available developments.
      Project goals:
      - To study already used approaches to the operon mapping;
      - To develop a new approach to the operon mapping;
      - To implement developed approach as a standalone tool.
      Evaluation of the functional effects of genetic variants is a crucial task for interpretation of NGS results in rare disease diagnostics. Besides, understanding of the functional consequences of genetic variants is no less important for enhancing our understanding of how and why variants may have different effects in different cases. Recently, the Genome Aggregation Database (gnomAD) released an updated version of the human genome variation dataset, now including as many as 800,000+ human exomes and genomes. The goal of this project is to utilize this dataset to improve the prediction of genetic variant effects, as well as to explore patterns of variation using different types of variants.
      To reach this goal, the team may work in several independent directions:
      1. Analysis of the effects of sequence context on variant frequencies. In this case, we'll be exploring the relationship between variant frequency and the corresponding codon change and/or neighboring codons for synonymous, missense, and putative loss-of-function (pLoF variants).
      2. Determining the parts of the gene sequence under increased evolutionary constraint. For this task, we'll try to use the intragenic distribution of variants and the information on their frequency to obtain region-specific estimates of the strength of selective pressure in human protein-coding genes.
      3. Developing a tool that can use information about the density of genetic variants in gnomAD v4 to annotate genetic variants in the user input files and (potentially) predict their functional consequences based on their location.
          1. Perform literature review on the topic (need to find all possible datasets with genome and/or transcriptome and adequate smoking annotation).
          2. Construct pipeline for analysis (aka expression for the array and expression wuantification + QC for NGS).
          3. Harmonize the datasets (especially concerning smoking index types).
          4. Construct platform independent classifier of smoker status 5*. Construct regression predicting smoker index.
          The following aspects of the implementation of this task can be highlighted:
          -Receiving and annotating datasets. Study of age dependencies of various quantities;
          -Analysis of individual datasets of various kinds;
          -Implementation of approaches for the integral study of different datasets;
          -Visualization of the results obtained;
          -Drawing general biological inferences regarding important signaling and metabolic pathways, changes in which can be correlated with basic aging processes.
          Our lab has recently published a gene regulatory inference method called GRaNIE. It relies on paired gene expression and chromatin accessibility data to identify connections between transcription factors cis-regulatory elements and their target genes. In current work we are trying to adapt this method to single cell data in a pseudobulk fashion. But we still lack systematic evaluation of the method's stability to various internal parameters: FDR thresholds, reference datasets, clustering resolution for the pseudobulk aggregation, etc.
          The goals for the students are to:
          1. Learn to apply the GRaNIE method for GRN inference.
          2. Perform the benchmarking of the method under various parameters.
          Raw data from conventional microfluidics based single cell assays contains a lot of reads from droplets that did not encapsulate any cells at all, while a smaller fraction of the droplets contains only 1 cell. Most widely used method for identifying the "empty droplets" is called emptyDrops and it relies on statistical testing of expression differences between ambient RNA profile and individual droplets. Zaugg lab is currently working on a combinatorial indexing based method that allows to greatly overload 10x chromium controller and computationally demultiplex droplets that contain more than 1 cell. This poses a number of potential issues for the emptyDrops algorithm: the underlying assumption of empty droplets' RNA being sampled from the same pool of ambient RNA may not always hold; systematic differences in coverage between indexed samples can lead to bias in cell calling against lower quality samples.
          The students' goals would be to:
          1. Get familiarized with microfluidics based single cell RNA-seq methods and their modifications based on combinatorial indexing.
          2. Establish a number of quality control checks to test for potential issues with the cell calling.
          3. Identify the existence and extent of the issues with empty droplet identification within our own datasets.
          4. Modify the emptyDrops algorithm/application strategy to handle combinatorial indexing/overloading use case. (Optional)
          Previous
          Популяризация
          Next
          Наука
          Об институте
          •  
          © Bioinformatics Institute
          Об институте
          • Команда
          • Медиа о нас
          • Фирменный стиль
          • Реквизиты
          • Условия конфиденциальности
          • Образовательная лицензия
          • Публичная оферта
          направления
          • Образование
          • Наука
          • Популяризация
          • Сообщество
          • Вакансии
          Мы в сети
          • ВКонтакте
          • Telegram-канал
          • Чат про карьеру и образование