Научные проекты

За время обучения в Институте каждый студент занимается несколькими научно-исследовательскими проектами. Проект подразумевает еженедельные встречи с руководителем, а также 5-10 часов в неделю самостоятельной работы.

Руководят научными проектами ведущие специалисты из российских и зарубежных научных лабораторий и компаний, работающих в области биоинформатики и биотехнологий.

Если у вашей лаборатории или компании есть интересные задачи по биоинформатике для студентов пишите нам.

Институт биоинформатики ориентируется на исследования мирового уровня, поэтому по итогам семестрового проекта вся информация переводится на английский язык.
Научные проекты 2018/2019
Осень 2018
Весна 2019
Study spectrum of genetic variants in TTN gene | Federal Almazov Medical Research Centre

students: Olga Lebedenko
scientific adviser: Artem Kiselev

The TTN gene with 363 coding exons encodes titin, a giant muscle protein spanning from the Z-disk to the M-band within the sarcomere. Titin has roles in assembling and maintaining sarcomere structure, flexibility, stability, stretch and force transmission. Mutations in the TTN gene have been associated with various cardiomyopathies.

The main aim of this study was to investigate spectrum of genetic variants in TTN gene within group of patients with cardiomyopathy. 151 different type of cardiomyopathy samples, sequenced with Haloplex custom targeted capture, were processed with SNP Calling pipeline implemented on Snakemake and annotated by snpEff. Among 418 discovered SNP 64.44% variants were missens and 35.56% variants were silence. The PCA analysis showed absence of clustering SNP by type of cardiomyopathy. Fisher's exact tests with Bonferroni correction were used to compare allele frequencies of observed variants against all gnomAD population. Pathogenicity of 12 discovered statistically meaningful missense variants was predicted by algorithms SIFT, PolyPhen-2, Mutation Assessor, Provean and I-Mutant 3.0.

Almost all variants showed neutral effect on protein structure and stability. The most interesting SNP was the mutation rs9808377 I62T, presumably affecting on stability of the subunit Fn3-102 titin by I-Mutant 3.0. Presumably, this result may be explained by difficulties in multiple comparison connected with a high rate of spontaneous mutation owing to enormous size of TTN gene. Another reasons in analyzing accompanying TTN variants in cardiomyopathy group with confirmed well-knowing causative mutations.

A library of functions for express analysis of FASTA/FASTQ files | Bioinformatics Institute

Alena Kizenko, Alisa Morshneva, Polina Pavlova
scientific adviser: Eugene Bakin

Frequently, when carrying out bioinformatics projects including FASTA/FASTQ files processing, one has to solve routine tasks, e.g. deduplication of sequences. A common approach for this is writing little scripts in Python/bash or dealing with existing programs, which may be complicated for usage. Therefore, we decided to create a flexible tool containing functions for processing files with sequencing data. We created a program called BreakFAST which is based on Python 3 and the following libraries: Biopython, argparse, pandas, numpy, matplotlib and re. The tool consists of three modules. Basic statistics module can be used for counting:

● minimum, maximum, mean, total length of reads;
● GC-content;
● quality scores;
● N bases.

Filtering module can be used for deleting:
● reads shorter than X;
● reads containing Ns;
● poor quality reads;
● duplicates;
● reads with a particular motif.

Matching module can be used for:
● joining reads from files;
● finding overlapping between files;
● subtracting sets of reads from files.

While applying commonly suggested Biopython functions we've faced performance problems while parsing a large volume of data. For mitigation of this effect while iterating over FASTA/FASTQ files, we compared SeqIO.parser and Iterator from Biopython. We have found that usage of Iterator in Filtering and Matching modules was optimal for iteration (10 times speed gain). Notably, we compared function "delete reads shorter than X" with the same Trimmomatic's function and found that BreakFAST occupies up to 7 times less RAM, which may be useful when a computer`s capacity is limited. As a result, BreakFAST is a simple and customizable tool, which can be potentially updated with new modules and functions.

Visualization of signaling pathways basing on genes differential expression profile | First Pavlov State University of St. Petersburg

students: Stanislav Legkovoy, Olga Romanova
scientific advisers: Eugene Bakin, Oksana Stanevich

In recent years, Gene Expression Omnibus (GEO) NCBI database has accumulated a significant amount of data obtained via mRNA microarrays, which are widely used for an analysis of differential expression profile. During a research of genes expressions, a proper visualization of results is an important task. One of the best ways is to use R language and related packages for statistical analysis, preprocessing and visualization of expression data through interaction with KEGG database.

An aim of this study was to visualize signaling pathways according to the gene expression data obtained from GEO NCBI database. To achieve our goal, we implemented an easy-to-use script based on pathview, gage and GEOquery packages, which allowed us to obtain gene expression data directly from the GEO NCBI database and to find the most significant signaling pathways from KEGG PATHWAY. Using the developed script, we analyzed the Affymetrix microarray data, identified and visualized the most significant signaling pathways involved in reprogramming of lymphatic endothelial cells infected by human Kaposi's sarcoma-associated herpesvirus (KSHV).

Comparative analysis of NUMT in underground and terrestrial rodents | Zoological Institute RAS

students: Ekaterina Sytnik
scientific adviser: Olga Bondareva

NUMT (nuclear mitochondrial DNA segment) is a transposition of mitochondrial DNA into nuclear genome. They are found in all eukaryotes but significantly differ in length and number among different species. Particular factors that can be associated with NUMTs are still not determined. Due to the specificity of mitochondrial genes, habitat conditions may be one of the factors.

They aim of this study was to estimate the number of NUMT for underground and terrestrial rodents. For this work we only analyzed long (>300 n.p.) NUMTs of protein- coding regions. We used Genbank database for mitochondrial and nuclear genomes (4 species for each group) and BLAST for NUMT searching. It was found that some genes like ND4L and ATP8 are not likely to be included in NUMTs, which may be caused by the small size of the genes. Some underground species are shown to have a larger amount of long NUMTs but it is yet unclear if the same is true for whole group.

Further study should include larger amount of species and dN/dS analysis for each gene to determine whether some of the NUMTs may have a functional role.

The comparative analysis of MDR Klebsiella pneumoniae genome | Children's Scientific and Clinical Center for Infectious Diseases

students: Anastasia Kapanina, Nina Lukashina, Daria Likholetova
scientific advisers: Eugene Bakin, Oksana Stanevich, Sergey S. Sidorenko
Klebsiella pneumoniae is a gram-negative bacteria that is known as opportunistic, hypervirulent, and multidrug resistant hospital pathogen. The problem of resistance to carbenemase group of antibiotics makes it one of the main threats during hospitalisation. The diversity of K. pneumoniae is studied by whole-genome sequencing (WGS) and multiple typing methods including multi-locus sequence typing (MLST), that separate strains into different lineages. In our study we assembled and analysed genomes of 22 isolates of different years and sources from the Saint-Petersburg hospitals to identify their origin and describe their pangenome.

With use of Kleborate tool, we found that our strains belong to common european and asian MLST types (ST147, ST11, ST340 and ST395). All of them carry NDM-1 and ParC resistance genes, and only one - OXA-48. According to the genes discovered in strains, we listed inefficient antibiotics for their treatment. Via PlasmidFinder we detected a presence of plasmid R27 of Salmonella typhi, that can be explained by contamination of samples or by horizontal transfer between K.pneumoniae and S. typhi.

According to an existing literature, the obtained MLST types are spread in Europe and Asia. However, for obtaining a more detailed result about an origin of the strains, a genome structure analysis is needed. In conclusion, we can say that within the period from 2012 to 2016 there were no invasions of new sequence types on a territory of mentioned hospitals. The obtained results of pangenome analysis can be used in treatment prescription.

Comparative analysis of the human pathogens genomes Neisseria meningitidis | Children`s Scientific and Clinical Center for Infectious Diseases

students: Anton Matiiv, Ilia Sheshukov
scientific advisers: Eugene Bakin, Oksana Stanevich, Sergey S. Sidorenko

Neisseria meningitidis or meningococcus often colonizes the mucous membrane of the oropharynx, causing no visible symptoms, but is also the main cause of bacterial meningitis and sepsis throughout the world. The epidemiological profile of N. meningitidis varies in different populations, and over time, the virulence of meningococcus is based on the plastic genome and the expression of certain capsular polysaccharides and non-capsular antigens. Twelve different serogroups based on the polysaccharide capsule have been identified, but only six of them (A, B, C, W, X and Y) account for 90% of the invasive meningococcal disease worldwide. Seven housekeeping genes for meningococcal strains are used for MLST (multilocus sequence typing) to determine their sequence types (ST).

The aim of our work was to compare whole genome sequencing data of 20 Neisseria meningitis samples isolated from carriers and sick people, to use phylogenetic analysis and to find a connection with antibiotic resistance, virulence and carriage.

Before the analysis of sequences, we have written a computer script for interfacing and downloading reference genomes from NCBI. We analyzed the antigen-encoding, virulence and carriage associated and antibiotic resistant gene profiles. We also searched for amino acid changes leading to penicillin resistance. To estimate the relationship between samples, phylogenetic trees were constructed on the basis of isolates assemblies by using CSI Phylogeny and REALPHY. We constructed phylogenetic trees for carriage associated genes to figure out if the samples would cluster according to their origin of isolation.

Finding of cis-regulatory elements in promoters | University of La Verne

students: Daria Balashova, Elena Polyakova
scientific adviser: Tatiana Tatarinova

We consider a genome-wide statistical approach for the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. A comprehensive landscaping of major regulatory motifs can contribute to understanding molecular mechanisms of many complex diseases.

Assuming position-specificity of the function of promoter motifs, providing gene expression data of reasonable measurements of the number of transcripts and reflecting of the activity of the promoter, we develop cisExpress software that includes the algorithm for finding statistically significant associations between words of defined length with respect to the transcription start site in the expression dataset. Subsequent optimization includes combining motifs that have small differences and clustering basic words of fixed size into larger composite motifs. The analysis of time series, conducted on the basis of Hidden Markov Models, allows us to observe the significance of the found motifs over time. The tool is complemented by interactive graphical representations.

Analysis of the Drosophila melanogaster full genome sequences | EPAM Systems, Lifesciencs department

students: Anna Namyatova
scientific adviser: Gennady Zakharov

Drosophila melanogaster is a model object for studying insect genomes. The results can be used to make prediction on the human diseases. We had the Illumina full genome sequences for two wild type lines and two mutant lines (ts3 and X1). In the ts3 line, the defects were artificially induced with the behaviour being restored to normal after the thermal shock. The defects in the line X1 were spontaneous and permanent. The aim was to compare the mutant line genomes with each other and with those of the wild type lines, and find the genes responsible for the abnormalities in the nervous system structure and function. We used the following tools for our analysis:

1. FastQC. Sequences quality check.
2. Trimmomatic Trimming the bad quality nucleotides.
3. Bwa. Genome assembly.
4. Samtools. Creating, sorting and indexing the .bam file.
5. Picard. Adding the Readgroups into .bam files.
6. Gatk. Variation calling.
7. Vcf-merge. Merging the wild type lines mutations.
8. Rtg vcfeval. Comparing the each mutant line mutations with those in the wild type

The genomes were mapped against the reference genome Drosophila_melanogaster.BDGP6.dna_sm.toplevel.fa. FastQS showed that there were around 30 million reads in each genome, the length of reads ranges between 35 and 76 bp in the raw sequences. The total number of mutations in the wild type line was 1209056. Each mutant line had around 800000 mutations. There were 195743 unique mutations in the ts3 line, and 174653 unique mutations in the X1 line. In the future we are going to perform snpeff and snpsift tools to annotate the mutations, to assign the biological meaning to them and to exclude the nonsense mutations.

Transcriptional response of pea roots to symbiosis markers | All-Russia Research Institute for Agricultural Microbiology

students: Varvara Tvorogova
scientific advisers: Polina Kozyulina, Elena Dolgikh

Root nodules in legumes are symbiotic organs hosting nitrogen-fixing bacteria. At the beginning of the formation of nodule, bacteria enter the intercellular space of the root; therefore, the host plant needs an accurate recognition system that allows it to let symbiotic bacteria pass inside its tissues and block parasitic organisms from doing the same. The main external signals that provide such recognition are chitooligosaccharides of different lengths. Thus, chitooligosaccharides consisting of five monomers (co5) are markers of symbiotic bacteria, while chitooligosaccharides consisting of eight monomers (co8) are markers of parasitic organisms (insects and fungi).

The purpose of this study was to analyze the data of MACE-sequencing of pea (Pisum sativum) RNA from roots pretreated with co5 or co8 chitoligosaccharides. Using the pea nodule transcriptome obtained previously (Zhukov et al., 2015) and the Dedupe software from BBTools package, we removed ambiguous transcripts and got the optimal reference transcriptome for our data. Then, using the DESeq2 and GSEABase packages, we analyzed differential gene expression in our samples and performed gene enrichment analysis. According to the results obtained, co5 treatment shows more prominent differential gene expression compared to co8 probably due to incomplete reference transcriptome. However, both co5 and co8 chitoligosaccharide treatments activate gene sets that are responsible for parasite-host interaction, chitin binding and cleavage, as well as numerous signaling pathways which include different phytohormones, receptor kinases and transcription factors.

Modeling of mouse chromosome banding pattern | Bioinformatics institute

students: Yury Lebeda
scientific adviser: Yury Barbitoff

Differential chromosome staining is a method of chromosome staining with special dyes to detect certain discs or regions of the chromosome (also called chromosome bands). The resulting banding pattern is an important marker of genome architecture; however, no specific molecular determinants of it are known to date. Previously, our group discovered the relationship between the pattern of differential staining of chromosomes and several genomic features (ChIP-Seq tracks of Smc1a/Smc3, CTCF, polyA and polyT repeats). However, when validation of this relationship using the genome of Mus musculus was attempted, it was found that the distribution of genomic elements within the M. musculus bands differs from that observed in humans. In this project, we took an effort to develop a model that would be able to predict the border regions between bands, on the basis of the human genome data, and apply this model to predict the banding of M. musculus chromosomes.

We built a random forest model to predict the borders of the bands based on the number of genomic elements (i.e., ChIP-Seq peaks or k-mers) lying in the intervals of a given width inside and outside of the band borders. For prediction, the M. musculus genome was cut into intervals of the same width by a sliding window; and the resulting intervals were annotated with the same features that were used to train the model. Unfortunately, all the models constructed (despite high cross-validation AUC scores) failed to provide reasonable predictions – both for the mouse and human genomes, the results of the prediction of band boundaries differed from the already existing markup. The results can be explained by a large number of false-positive results, which becomes significant even with a small false-positive rate at large numbers of trials. Hence, a new model has to be sought for to explain the nature of chromosome bands.

De novo assembly and analysis of Platynereis dumiliii (Nereididae, Annelida) transcriptome
at different stages of regeneration
| Saint-Petersburg State University

students: Natalia Zenkova, Ruslan Abasov
scientific adviser: Maxim Nesterenko

Regeneration – the regrowth or repair of cells, tissues and organs – is widely but non- uniformly represented among all animal phyla. However, the potency of its highly variable even within a single group. The object of this study is the polychaeta Platynereis dumerilii (Nereididae, Annelida), capable to recover only tail. RNA-seq data of different time points after amputation (0, 4, 12, 24 hours, 2 and 4 days) from "head" and "tail" sites of regeneration were analyzed. Libraries of corrected read pairs (Karect, Trimmomatic, BBtools) were used to the de novo assembly of reference transcriptome (Trinity).

The resulting assembly was characterized by high quality (TransRate-score = 0.2441) and completeness (BUSCO vs Metazoa-odb9 = 99.5%). The amino acid sequences predicted by TransDecoder (N = 160381) were compared to the Swiss-Prot database using the Diamond (e-value = 1e-10). More than 61% of the sequences were successfully annotated, but among the sequences without hits we assume the presence of species-specific proteins. Based on the normalized expression levels analysis results (Salmon), sets of "associated" sequences were highlighted for each of the samples. We suggest that incomplete overlap between "associated" sets both between time points and between sites indicate complex dynamics of gene activity during postamputation events. However, expression patterns of regeneration conservative genes (for instance: Piwi-, Vasa-, Wnt- and Notch-like) varies slightly between "head" and "tail" sites. Based on the results obtained, it can be assumed that cell proliferation is not over on 4 days after amputation and damaged structure recovery will be observed at later stages of generation.

Analysis of nonsense alleles of Caenorhabditis elegans genes | Lomonosov Moscow State University

students: Daria Chaplygina
scientific adviser: Nadezda Potapova

Nonsense mutation in gene is a mutation that results in a premature stop codon. Most of genes with nonsense alleles translates into a nonfunctional proteins, which makes such genes to be a pseudogenes. The purpose of this study was to analyze the distribution of nonsense mutations in Caenorhabditis elegans genes and to perform a direct measurement of the strength of negative selection acting on nonsense alleles. For the measurement we counted the average ratio of the number of nonsynonymous mutation to the number of synonymous mutations for each gene (pN/pS ratio). The obtained pN/pS ratio then was compared to the pN/pS ratio in genes without nonsense mutations. Genome sequences processing was performed by SAMtools, VarScan and SnpEff.

According to the obtained results, the most of synonymous mutations are located at the 3′-end of gene, where they are less harmful. Also it was shown that nonsense alleles, common for many species in population, are rare, which must be due to negative selection against them. The average pN/pS ratio appears to be about 1 for genes without nonsense alleles and slightly more than 1 (1.2) for genes with nonsense alleles. Such results means that negative selection does not act on any gene, which can not be true. The mistake could be explained by possibly wrong variant annotation.

Prioritization of genetic variants | Bioinformatics Institute

students: Vasiliy Isaev, Liubov Lonishin
scientific adviser: Yury Barbitoff

The identification of deleterious mutations within candidate genes is a crucial step in the elucidation of the genetic bases of human disease, consequently, there is a need to aim attention at classifying appropriate mutations. The goal of our programme, which is called MutationsPriorityPredictionTool (MPPT) is to find out these genetic variants from thousands of others in order to help clinicians and geneticists.
To calculate the coefficient we have developed an tool. Tool get on input vcf-file and set of simple configurations that contain rules on how to calculate mutation priority score depends from parameters given in the file. After calculating tool will print top of mutation by parameter specified by user. It can be top 10% of mutations or 100 mutations ot other option. We have tested our programme on whole exome sequencing data, obtained from the resource centre. First of all, the selection of the test sample was made in accordance with the ClinVar database and was compared with results of Franklin, which is based on ACMG recommendations (The American College of Medical Genetics and Genomics). The percentage number of correct calls by MPPT was calculated, and the sensitivity and specificity of the method was determined. Accuracy of our programme is 67,5%, sensitivity is about 100% (95% CI = 79.4% to 100.00) and specificity is 60,6% (95% CI =53.9% to 67.3%). Testing on the whole data, we obtained 114 mutations above the threshold from more than 22 thousand at all.

MPPT focuses on pathogenic variants without losing them, but also keeps some benign variants which should be manually checked by a specialist after running. In the future, we will add this functionality to NGB (New Genome Browser).

Assembly of yeast genome with Oxford Nanopore data | Bioinformatics Institute

students: Andrew Matveenko
scientific advisers: Yury Barbitoff, Alexander Predeus

Baker's yeast Saccharomyces cerevisiae is a widely used model organism. The Peterhof genetic collection (PGC) is a large laboratory stock unrelated to the yeast reference strain. Previously, several PGC strains were sequenced using Ion Torrent technology. However, the resulting assemblies were incomplete and required substantial improvement. We attempted to obtain a reference quality assembly of one PGC strain, 1A-D1628, using Oxford Nanopore Technology (ONT) sequencing.

Raw data was obtained from one ONT MinIon flowcell which generated 10.15 Gbp total sequence length (836x coverage). To create draft genome assembly we used three long-read assemblers: Canu, Flye and wtdbg2. Canu produced the best results, with 17 large (> 50 kbp) contigs that correspond to 16 yeast chromosomes and mitochondrial DNA. Flye was slightly worse with 18 large contigs as it failed to assemble chromosome III as a single molecule. Wtdbg2 failed to produce any sensible sequence. Comparison of the Canu assembly with the reference showed that it contained 105 misassemblies, and large amount of mismatches and short indels. We also analyzed structural variations in the strain using NGMLR-Sniffles pipeline. The results of analysis were concordant with variations described previously in 1A- D1628.

In conclusion, the data obtained from the Oxford Nanopore sequencing can be used to analyse structural variations in the 1A-D1628 strain. However, the de novo assembly requires additional correction and polishing to reach the reference quality. Several strategies can be used to achive this goal. First, an alternative basecaller can be used to improve the quality of reads. Second, exclusively for ONT, the assembly can be improved by polishing with the MinIon raw signal using Nanopolish tool. And finally, polishing the assembly with the obtained Illumina reads should improve the accuracy of the sequence producing high quality reference, which can be used for comparative genomic studies.
The work is supported the RSF grant 18-14-00050.

The study of processes of gene gain and loss within Lactobacillus species | Zoological Institute RAS

students: Anastasia Kosolapova
scientific adviser: Olga Bondareva

Lactobacillus genus includes Gram-positive non-sporulating bacteria known for their ability to produce lactic acid as a result of carbohydrate fermentation. To date more than 180 species refer to Lactobacillus genus. A hallmark of that genus is a high level of intra-group diversity. Firstly, the diversity exhibits in ecology of the group as lots of Lactobacillus species are associated with cavities of human and animals, for example gastrointestinal tract and urogenital tract, while others can be found on plants, in dairy and fermented products. Secondly, the genome size of Lactobacillus bacteria can vary between 1.2 Mb and 5 Mb.

The aim of this work was to study connection between ecological specificity and genome organization within various strains of Lactobacillus and analyze influence of ecological specificity on processes of gain and loss of genes. As a data for analysis we used protein and CDS sequences for 185 Lactobacillus species (1708 strains) from RefSeq database. Lactococcus lactis subsp. lactis Il1403 protein and CDS sequences were used as an outgroup. We classified species into 7 groups based on ecological niche. We revealed orthologous proteins within strains using Proteinortho5/POFF software. Further research should involve a phylogenetic tree reconstruction based on full orthologous genes groups followed by gain-loss analysis performed with GLOOME software.

Detection of interchromosomal rearrangements from Hi-C data | ITMO University

students: Elena Kartysheva, Dmitriy Orekhov
scientific adviser: Nikita Alexeev

Chromosomal rearrangements disturb complex 3D structure of eukaryotic genome and may lead to various disease among which is cancer, detecting them may be useful in early diagnostics. Hi-C is a relatively recent sequencing method that estimates 3D proximity between different regions of a sequenced genome, this type of data allows for detection of different chromosomal abnormalities. We have developed an algorithm that scans through Hi-C map and reports the presence of interchromosomal rearrangements with the coordinates of their breakpoints. The algorithm relies on 2D convolution and GMM for filtering out the data and detection of interchromosomal interactions, then a sliding-window approach is used for breakpoint localization. The method is tested on Hi-C maps obtained from glioblastoma cells of H.Sapiens, showing both high precision and high recall.

Plasmid host range prediction based on CRISPR arrays. Plasmids CRISPR Cas systems search. | CAB SPbU

students: Mikhail Kongoev, Iana Fedorova
scientific adviser: Mikhail Rayko

Horizontal gene transfer plays a highly important role in evolution of bacteria. Presumably, gene exchange between bacteria occurs by genetic mobile elements such as plasmids and bacteriophages. But for nowadays there is no reliable way to check if the certain plasmid can "travel" between bacteria of different origin, and how wide the plasmid host range could be. Also it is important to be able to predict plasmid host in case of metagenomic data, where we usually have dozens of novel plasmids without any information of host species.

To answer this question, we analyzed CRISPR cassettes in bacterial genomes – repetitive sequences in bacterial DNA, interspaced with unique "spacer" sequences, which were extracted from genetic mobile elements infected the bacteria or its ancestors. Spacers in CRISPR cassette can be considered as a link between the plasmid and its host.

We used CRISPR Finder spacers database and the RefSeq database of all plasmids known to date (November 2018). Blasting spacers over plasmids sequences allowed us to determine plasmid host ranges: variety of bacterial organisms where the plasmid can exist. By taxonomy analysis we found some plasmids which can live in different families of organisms, they can be useful in genetic engineering as a natural shuttle vectors. Taxonomy analysis showed that a bunch of plasmids have additional hosts except of host they were related to according to RefSeq database: 543 blast hits – additional hosts of different genus, 29 blast hits - different family, 19 - different order, 12 - different class and even 2 blast hits – additional hosts of different phylum! Thus, plasmids are actually "travelling" between bacteria species and can be important players in process of evolution.

We also found, that a lot of plasmids carry their own defense CRISPR systems (10% of RefSeq plasmids). Part of these systems (10%) seems to be active – there are Cas1 genes near CRISPR cassettes. Role of these systems in plasmid propagation, host fitness and evolutionary relationship with the known chromosomal CRISPR-Cas systems is the subjects of future research.

Automated pathway annotation for single-cell RNA-seq | Washington University in Saint Louis / ITMO University

students: Maria Firuleva
scientific adviser: Konstantin Zaitsev

Method of single-cell RNA-seq expands the opportunities to research a biological difference between cells of interest by the individual transcriptome analysis of each cell simultaneously and to study the cell's processes more deeply. Increasing pace of RNA-seq methods expects automated approaches to process a huge amount of that data. Different cell processes are mediated by a different set of genes (signal pathways), and expression of appropriate genes changes due to activation or deactivation of appropriate signal pathways. The main target of this project is to develop a method for automated annotation of pathways which are significantly upregulated in the single-cell dataset.

We developed a three-step approach to identify differentially expressed pathways, which is applied after performing the usual single-cell rna-seq pipeline using Seurat package. First, we calculate how each pathway is expressed in every cell. Second, randomly sampling gene sets we identify candidate cells that in which pathways are upregulated more than at random. Third, we identify clusters in which there are more candidate cells than at random, using hypergeometric distribution. As a result, our program returns a matrix with cell clusters as columns and pathways as rows which values are adjusted p-values. Developed approach combined with cumulative statistic approaches allows to quickly find significantly upregulated pathways in a single-cell dataset for all clusters and large gene set databases.

Automated marker descriptor for single-cell RNA-seq | Washington University in Saint Louis / ITMO University

students: Daria Gorbach
scientific adviser: Konstantin Zaitsev

Method of single-cell RNA-seq provides an opportunity to detect gene expression and specific cellular processes from lots of cells simultaneously, and each cell type has its own combination of expressed markers, which helps to discriminate one cell population from another. Increasing rate of RNA-seq technologies demands automated methods to obtain increasing amount of that data. Popular approach of cell types identification is based on "one versus all" method, which compares gene expression profiles for one cellular cluster with all the rest. It is a way to find statistically significant markers for each cluster, however, this method fails to identify "unique" cluster markers and quite often reports markers that are not unique for a certain cluster. Cell surface markers are of particular interest for that kind of research, as they are most frequently serve as markers of specific cell types.

Our approach was to make an automated descriptor, using pair-wise comparison of expressed markers. We used MGI database for mice cell surface proteins and "Seurat" package – R toolkit for single-cell data analysis. Thus, we compared expression levels of each cell surface marker between different cellular subtypes (T h -lymphocytes, macrophages, etc.) and obtained one or several unique markers expressed uniquely in each subtype. This method allows us to describe each cluster with a set of unique surface marker genes identifying any of that type.

The effect of X chromosome inactivation on the expression of autosomal genes | Bioinformatic institute

students: Daria Kilina
scientific advisers: Yury Barbitov, Rostoslav Skitchenko

X chromosome inactivation (XCI) silences the transcription of genes located on one of the X chromosomes to balance expression dosage between XX females and XY males. According to a recent work by Tukiainen et al., there is no total X chromosome inactivation in humans as up to one-third of X-chromosomal genes are expressed from both the active and inactive X chromosomes in female cells. Howeer, the effects of XCI on the expression profile of autosomal genes have not yet been assessed. In this study we compared the expression of autosomal genes in cells with different active X chromosome copies.

To answer this qestion we analyzed public experimental data of single-cell RNA-sequencing of pancreatic islets from one female individual. We aligned the reads coming from each cell to a reference genome assembly using bowtie2. We then performed variant calling with samtools/bcftools in order to group the cells by active X chromosome by visual inspection of the alignments and SNP calls in IGV. We grouped the cells by alleles at variant sites in the XIST gene that is totally expressed from only one X chromosome. We then quantified gene expression levels with RSEM and used LIMMA plugin in the Phantasus browser to compare gene expression in the two groups of cells defined above.

We identified some candidate genes, expression of which depends on the active X chromosome copy; however, the difference in the expression levels of these genes between groups was not significant (P-value < 0.05, adjusted P-value > 0.05). Furthermore, we observed lack of clear separation between the two groups of cells based on principal component analysis (PCA), which may indicate confounding effect of cell types or other factors. Hence, the effect of differential genes expression from the X chromosomes on the expression profile of autosomal genes needs further investigation.

Increasing the length of introns due to transposable elements | Vavilov Institute of General Genetics Russian RAS

students: Anastasiia Murzina
scientific adviser: Irina Poverennaya

Unlike exons - coding regions of a gene, intron sequences are known for a high degree of mutagenesis and, in accordance with this, great variability, so that even the length of an intron can differ greatly in related organisms. A significant increase in the length of introns may be due to the active accumulation in introns of a large number of transposable elements (TE) and repeats. In this project, our goal was to get depence between intron length and count of transposable elements.

The Dfam database is a collection of Repetitive DNA element sequence alignments. This database was used with RepeatMasker programm, which based on usage Hidden Markov Models, to search TE of human genome. TE and repeats take up 43 percent on the average of the length of all introns. There is a correlation between the intron length and the TE length for introns with a length of more than 300 nucleotides, however, very long introns (> 12000) correlate with the TE length better than the average length introns.

Аnalysis of factors affecting the course of chronic myeloid leukemia | Pavlov First Saint Petersburg State Medical University

students: Nadezda Pogodina, Irina Babkina
scientific advisers: Eugene Bakin, Oksana Stanevich

Chronic myeloid leukemia (CML) is a myeloproliferative disorder characterized by unregulated granulocytic proliferation. A standard treatment includes tyrosine kinase inhibitor (TKI) and hematopoietic stem cell transplantation (HSCT). In this project, our goal was to identify the factors that have an impact on survival after HSCT. The most common method of survival analysis is a Kaplan-Meier approach. These method works on censored data, when observation ended up before event of interest occurred. We plotted overall and event-free survival curves. Cumulative survival probability was 37.9%. Then, we performed analysis of single variables (conditioning regimens, phase of CML et al.) and compared survival curves in log-rank test. We identified one statistically significant factor: cyclophosphamide therapy after HSCT (p-value = 0.03). CML therapy has been improved, so we used multivariate analysis to assess the influence of new treatment methods on the survival using correlation test. Correlation matrix showed weak association, that's why we selected the following factors with the specialist`s help: conditioning regimens, phase of CML, graft compatibility, TKI therapy, cyclophosphamide therapy after HSCT. Ordination methods (PCA and MDS) showed 3 clusters of therapy factors, distributed between 3 era: 1995-2006, 2007-2012, 2013-2018. The eras' survival curves were also statistically different (p-value = 0.009).

Поиск ключевых генов, инициирующих смену профиля экспрессии в клетке | ПСПбГМУ им. академ. И.П. Павлова

студенты: Горбач Дарья
руководители: Бакин Евгений, Станевич Оксана

Важным направлением в биоинформатике является анализ смены профиля экспрессии генов в клетке под воздействием внешних факторов (например, при заражении вирусом). При этом может быть затронуто значительное количество генов, однако, как правило, этот масштабный процесс инициируется единицами ключевых генов, таргетных для вируса. Привлечение дополнительной информации о взаимосвязи генов, хранящейся в таких базах как KEGG, позволяет выявить это множество. Целью проекта является создание алгоритма, позволяющего по данным дифференциальной экспрессии восстановить причинно-следственных связи для процессов, проистекающих в клетке при воздействии вируса.

Анализ геномов дрожжей из Петергофской генетической коллекции | Институт биоинформатики

студенты: Матиив Антон
руководители: Барбитов Юрий, Предеус Александр

Данный проект является продолжением проекта по сборке генома дрожжей из данных Oxford Nanopore. В ходе данного проекта студенту будет предложено закончить сборку референсного генома штамма 1А-Д1628, а также сравнить геномы других штаммов (74-Д694 и др.) с полученной референсной сборкой. Также в рамках выполнения проетка студенту будет предложено найти генетические варианты, ассоциированные с определенными чертами фенотипа мутантных производных 1А-Д1628 (на основании данных, полученных с использованием технологии Illumina).

Построение среды для обработки данных экзомного секвенирования | Институт биоинформатики

студенты: Творогова Варвара, Абасов Руслан
руководители: Барбитов Юрий, Шиков Антон

Данный проект посвящен анализу данных полноэкзомного секвенирования для клинических целей. В данном проекте студенту(-ам) будет предложено разобраться в самых современных технологиях поиска вариантов в геноме человека (GATK4, WDL/Cromwell, Hail). Конечной целью проекта является построение пайплайна анализа данных на основе GATK4, валидации его с использованием стандартных бенчмарк-датасетов (Genome In A Bottle), а также написание серии утилит для интеграции анализа с веб-приложениями и удобного хранения всех метаданных в виде базы данных (SQL).

Implied weighting as a measure of clade support: automation of the task and comparative assessment of results | Санкт-Петербургский государственный университет

студенты: Сытник Екатерина
руководители: Данилов Лаврентий Глебович, Константинов Федор Владимирович

IW (implied weighting) - это метод дифференциального взвешивания признаков в парсимониальном филогенетическом анализе. Он используется как для молекулярных, так и для морфологических данных, однако по-настоящему широкое распространение получил именно в работах морфологов. Хорошо известно, что признаки (группы данных) неравноценны в плане адекватного отражения родственных связей отдельных таксонов и могут нести как сильный, так и слабый филогенетический сигнал или даже не иметь его вовсе. Основная идея любого типа взвешивания состоит в придании относительно большего веса хорошо укладывающимся на филогению признакам и меньшего – гомоплазиям. Анализ опубликованных за последние два десятилетия статей с реконструкцией филогении на основе морфологических данных свидетельствует о частом использовании апостериорного взвешивания. При этом исследователи крайне редко сравнивают результаты SAW и IW, используя что-то одно. Выбор определяется в первую очередь тем, какая именно программа была использована для парсимониального анализа. При этом популярность платного и многие годы не развивавшегося пакета PAUP за последние десятилетия катастрофически снизилась, а программы TNT, напротив, значительно возросла. Это и определило практически безальтернативное использование implied weighting в опубликованных в последнее время реконструкциях филогении на основе морфологических данных. Оправданность применения апостериорного взвешивания данных в парсимониальном анализе вызывает активные споры с момента появления соответствующих методов (e.g., Kluge 1997, 2005, Turner & Zandee 1995, Goloboff 1995, Goloboff et al. 2008) с теоретической и философской точек зрения. Давняя дискуссия о биологическом смысле IW в целом и разных значений К в частности по-прежнему актуальна и значительно оживилась в последние три года (e.g., Tschopp et al., 2015; O'Reilly et al., 2016; Congreve & Lamsdell, 2016; Goloboff, Torres & Arias, 2017, etc.). При этом как сторонники, так и противники концепции апостериорного взвешивания единодушны во мнении о том, что она остро нуждается в эмпирическом материале. Казалось бы, к настоящему моменту мы уже располагаем результатами множества исследований, сделанных с использованием IW. Основная практическая проблема, однако, состоит в заложенной в IW возможности в широчайших пределах менять константу К и, соответственно, определяющую вес признаков функцию. В подавляющем большинстве случаев, исследователи выбирают два-четыре значения К, сравнивают результаты и выбирают одно из деревьев за основу для дальнейшей работы, иногда никак не поясняя причины выбора. Нередко используют и единственное значение K=3, заданное в программе TNT по умолчанию, хотя нежелательность использования низких, в диапазоне 1-4, значений К признавал и сам создатель IW и TNT (Goloboff, 2008). Апостериорное взвешивание во многом превратилось из метода оценки и выявления филогенетического сигнала в данных, в инструмент для получения более разрешенных и, по тем или иным причинам более привлекательных для исследователя топологий. Следует однако оговориться, что «размах изменчивости» топологий при разных К определяется данными и в случаях с сильным филогенетическим сигналом оказывается незначительным. Возможным вариантом является использование IW в качестве стресс-теста (Legg et al., 2013; Garwood & Dunlop, 2014; Smith & Ortega-Hernandez, 2014). Иными словами, разброс значений К, при котором сохраняется та или иная клада, может служить в качестве количественной оценки степени ее поддержки данными. С практической точки зрения, сторонники идеи использования IW в качестве меры поддержки ветвей проводят анализ с разными значениями К и затем вручную согласовывают полученные деревья правилом большинства (Majority-rule consensus tree). Чаще всего при этом подряд используют целые значения К в определенном диапазоне, например от 3 до 15.
Более технологичным решением стало бы создание TNT-скрипта, позволяющего автоматизировать этот процесс, параллельно сравнить полученные результаты (в табличном и majority-rule tree виде) с независимо посчитанными в каждом случае традиционными мерами поддержки (Bootstrap, GC-supports). Немаловажно также сохранить полученные результаты в читаемом другими программами виде, что также может быть решено с помощью скриптов TNT. Сами значения К, очевидно, должны быть стандартизированы, например через равномерные значения F.

Предсказание антимикробных пептидов с помощью данных транскриптома немодельного вида | НИИ биологии ИГУ

студенты: Бабкина Ирина
руководители: Дроздова Полина

Идея: озеро Байкал обладает уникальной глубоководной пресноводной фауной. Глубоководные рачки питаются трупами других обитателей озера, заражённые разнообразными бактериями. Логично предположить, что у них есть способы с этим бороться. Предлагаем посмотреть, есть ли у них известные или новые антимикробные пептиды: взять опубликованные данные секвенирования РНК для глубоководных (у которых мы ожидаем большое разнообразие) и литоральных видов (у которых мы ожидаем меньшее разнообразие) и посмотреть, что можно по ним предсказать.

Цель: предсказание и характеристика спектра антимикробных пептидов (АМП) у глубоководных амфипод в сравнении с литоральными.

С технической точки зрения здесь можно будет поработать с данными секвенирования РНК, собрать транскриптом и поиграть с предсказанием свойств пептидов.

Разнообразие опсинов в транскриптомах байкальских эндемичных амфипод | НИИ биологии ИГУ

студенты: Федорова Яна, Кизенко Алена
руководители: Дроздова Полина

Зрение -- это важно, в том числе в воде. В озере Байкал обитает 350 видов амфипод, они очень ярко окрашены (в том числе есть внутривидовое разнообразие) и обитают на всех глубинах. Тут надо отметить, что свет с разной длиной волны с разной эффективностью проникает на разные глубины, и для байкальских коттоидных рыб (кстати, хищников, которые питаются в том числе амфиподами) известен сдвиг максимальной чувствительности опсинов в синюю область с увеличением глубины обитания.

Цель: изучить разнообразие генов, кодирующих белки зрительной системы у байкальских амфипод, и поискать связь с их филогенией, глубиной обитания, яркостью окраски или другими особенностями вида.

С технической точки зрения здесь можно будет поработать с собранными транскриптомами, предсказанием генов, разными методами выравнивания, построением филогенетических деревьев. И, конечно, поиграть в игру "запусти чужой код".

Whole-genome Drosophila sequence analysis | EPAM Systems, Lifesciencs department

студенты: Лебеда Юрий, Косолапова Анастасия
руководители: Захаров Геннадий

Основу курса составляют задачи по управлению вычислениями и анализу данных геномного секвенирования. В качестве исходных данных предлагаются полногеномные сиквенсы на платформе Illumina пяти линий дрозофилы – три линии дикого типа и два ранее не секвенированных мутанта.
В осеннем семестре команда построила из набора открытых утилит пайплайн («конвейер», набор совместно работающих утилит) для автоматизированного анализа данных. С его помощью была выполнена базовая задача – очистка исходных прочтений, выравнивание на референсный геном и обнаружение вариаций (значащих отличия от референсного генома) у всех 5 линий.
Во втором семестре команде предлагается продолжить эту работу – очистить и проаннотировать полученные вариации, сравнить линии между собой и попытаться обнаружить вариации, ответственные за формирование мутантного фенотипа.

Поиск латентных вирусов в данных полногеномного секвенирования человека | Genotek

студенты: Погодина Надежда, Моршнева Алиса, Орлов Юрий
руководители: Ильинский Валерий

Известно, что в организме большинства людей присутствуют вирусы, не приводящие к каким-либо острым состояниям (HPV, HHV, EVB, TTV и другие). В ходе проекта предполагается проанализировать данные полногеномного секвенирования людей с целью поиска таких вирусов. Описание разнообразия латентных вирусов может отличаться в разных регионах и у людей с разными генетическими особенностями.

Поиск патогенных вариантов (SNPs, indels) в экзомах пациентов с различными вариантами идеопатических кардиомиопатий. | НМИЦ Алмазова

студенты: Килина Дарья
руководители: Киселёв Артём

Лаборатория Молекулярной биологии и генетики НМИЦ и В.А. Алмазова регулярно проводит экзомное секвенирование у пациентов с идеопатическими случаями каридиомиопатий. Как правило, перед экзомным секвениерованием проводят анализ наследственного характера заболевания и секвенируют таких пациентов на малых панелях с целевым обогащением генов, описанных в литературе и известных в качестве причинных в различных вариантах кариомиопатий. Если малая панель не дает результатов, то проводят WES. WES и анализ редких вариантов при помощи баз данных частот позволяет однозначно определить патогенные варианты у ~70% таких пациентов. Оставшиеся пациенты нуждаются в расширенном анализе.

Поиск CNV в экзомах у пациентов с различными вариантами идеопатических кардиомиопатий | НМИЦ Алмазова

студенты: Романова Ольга
руководители: Киселёв Артём

Лаборатория Молекулярной биологии и генетики НМИЦ и В.А. Алмазова регулярно проводит экзомное секвенирование у пациентов с идеопатическими случаями каридиомиопатий. Как правило, перед экзомным секвениерованием проводят анализ наследственного характера заболевания и секвенируют таких пациентов на малых панелях с целевым обогащением генов, описанных в литературе и известных в качестве причинных в различных вариантах кариомиопатий. Если малая панель не дает результатов, то проводят WES. WES и анализ редких вариантов при помощи баз данных частот позволяет однозначно определить патогенные варианты у ~70% таких пациентов. Оставшиеся пациенты нуждаются в расширенном анализе.

Поддержка сторонних графовых форматов в SPAdes | СПбГУ, ЦАБ

студенты: Зенкова Наталья
руководители: Коробейников Антон, Пржибельский Андрей

Многие современные ассемблеры (в том числе и SPAdes) предоставляют на выходе не только геномные последовательности (контиги), но граф сборки. Однако, далеко не все программы имеют возможность принимать такие графы на вход. В этом проекте предлагается реализовать поддержку различных графовых форматов (FASTG/GFA) в качестве входных данных для модуля разрешения повторов SPAdes. Реализация такой функциональности заметно расширит область использования SPAdes, в частности, позволит осуществлять скаффолдинг на основе графов, построенных другими сборщиками.

Байесовская оптимизация для вывода демографических историй | Университет ИТМО, Исследовательская лаборатория им. П. Л. Чебышева, ПОМИ РАН

студенты: Шешуков Илья
руководители: Носкова Екатерина, Боровицкий Вячеслав

Демографическая история популяций это последовательность таких событий как миграция, разделение/слияние популяций, изменение их численностей. Современные методы позволяют строить правдоподобные гипотезы о демографической истории популяций по набору геномов, взятых у их "современных" представителей. Одним из ключевых этапов автоматизированных методов вывода демографической истории является оптимизация некоторой сложно-вычислимой функции без доступа к ее градиенту.

Недавно вышедший тул GADMA (https://github.com/ctlab/GADMA) решает эту задачу с помощью генетических алгоритмов. В проекте предлагается реализовать возможность использования методов Байесовской оптимизации для тула GADMA, сравнить два подхода.

Отметим, что байесовская оптимизация - не просто очередной алгоритм оптимизации, который можно за пять минут изучить в википедии, это весьма обширная и современная тема с множеством неожиданных возможностей, например оптимизации функции, используя приближения разной степени точности к ней, распараллеливание оптимизации и т.д. (вместо википедии см., например, https://arxiv.org/abs/1807.02811).

Адаптации рыб к жизни на большой глубине | ФББ МГУ

студенты: Намятова Анна, Полякова Елена
руководители: Потапова Надежда

Рыбы обитают не в особенно гостеприимной среде, порой испытывая на себе колоссальное давление. Становится интересным узнать механизмы, с помощью которых они всё же могут прекрасно существовать и на глубинах в несколько километров. Известно всего несколько генов-кандидатов, которые могут обеспечивать существование в таких условиях, но количество таких исследований, равно как и число генов, ограничено.
Цель проекта -- составить список имеющихся данных для глубоководных рыб и понять, какие гены вовлечены в адаптации к глубоководному образу жизни.

Интерактивные сети фенотипов / генов / метаболических путей. | Genotek

студенты: Лонишин Любовь
руководители: Ракитько Александр

Для консультаций по результатам генетического теста удобно иметь перед глазами информацию о фенотипе (например, заболевании) и его взаимосвязях, представленную в структурированном виде. В данном проекте предлагается реализовать R-Shiny приложение, которое будет отображать в виде интерактивных сетей: - взаимосвязи между фенотипами (что является фактором риска чего);
- генные сети (с отмеченными на них вариантами пациента);
- метаболические пути.

Влияние курения на эпигеном лейкоцитов человека | НИИ ФХБ имени А.Н.Белозерского МГУ; Институт биоинженерии РАН, ФИЦ «Фундаментальные основы биотехнологии» РАН

студенты: Павлова Полина, Фирулёва Мария
руководители: Сергеев Олег, Медведева Юлия

Факторы окружающей среды, включая химические вещества, способны вызывать эпигенетические изменения, которые могут быть прослежены у последующих поколений. Наиболее изученными эпигенетическими изменениями являются метилирование ДНК, малые некодирующие РНК (к которой относят транспортную РНК (tRNA), микро РНК (mi RNA) и piRNA) и модификация гистонов. Курение остается одним из наиболее неблагоприятных добровольных рисков, влияющих на здоровье.

Объект заявляемого молекулярно-эпидемиологического исследования – данные метилирования ДНК (RRBS) и данные профиля малой РНК лейкоцитов периферической крови, полученные в ходе родительского проспективного когортного исследования «Russian Children's Health. Male Reproductive SubStudy», у молодых мужчин в возрасте 18 лет.

Что известно из предыдущих исследований?
Воздействие таких стойких токсикантов, как диоксины, в течение полового развития (Pilsner et al., 2018) и курения (Корниенко с соавт., 2018, студенческий проект Института биоинформатики) влияет на метилирование ДНК сперматозоидов в возрасте 18 лет.

Что неизвестно?
Как влияет курение на метилирование ДНК и изменение профиля малой и миРНК лейкоцитов периферической крови в возрасте 18 лет и насколько сопряжены изменения эпигенома сперматозоидов и лейкоцитов у тех же участников исследования.

Цель исследования:
Изучить роль влияния курения на метилирование ДНК и изменение профиля малой и миРНК лейкоцитов крови молодых мужчин.

Определение корректных параметров МД моделирования для исследования белковой динамики (в сотрудничестве с проф. Дэвидом Кейсом, создателем силового поля Амбер) | СПбГУ / Purdue University

студенты: Легковой Станислав, Лебеденко Ольга
руководители: Скрынников Николай

Определение корректных параметров МД моделирования для исследования белковой динамики (в сотрудничестве с проф. Дэвидом Кейсом, создателем силового поля Амбер)

Моделирование биомолекул методом Молекулярной Динамики (МД) является одним из самых важных и перспективных инструментов структурной биологии. Как правило, в такого рода исследованиях используется так называемый ансамбль NPT. Однако использование NPT со стандартным набором параметров баростата/термостата ведёт к существенному замедлению динамики. Мы выявили этот эффект путём сопоставления экспериментальных данных, полученных методом спектроскопии ЯМР, с предсказаниями, полученными из траекторий МД. В частности, нам удалось показать, что стандартный метод моделирования ведёт к существенной (более 2 раз) переоценке характерного времени вращения глобулярного белка в растворе. В настоящее время мы ставим перед собой следующие цели.
(1) Нам предстоит выяснить насколько существенно этот эффект проявляется по отношению к движениям боковых цепей белка.
(2) Нам также предстоит установить в какой мере данный эффект сказывается на динамике нативно разупорядоченных белков.
(3) Помимо этого нас интересует его влияние на трансляционное движение белка - в особенности, в приложении к разупорядоченным белкам.
(4) Мы намерены предложить альтернативные методы моделирования, позволяющие с высокой точностью воспроизводить временной масштаб различных форм белковой динамики (например, применение ансамбля NPT с пониженным коэффициентом трения, применение ансамбля NVE).
Для записи траекторий МД наша лаборатория оснащена компьютерами на графических процессорах последнего поколения. Работа над проектом ведётся в сотрудничестве с проф. Дэвидом Кейсом (Rutgers University), создателем и ведущим разработчиком одного из двух самых совершенных силовых полей (Amber).

Определение наиболее вероятного генотипа и этнической принадлежности индивида исходя из генотипа его потомков | University of La Verne

студенты: Исаев Василий
руководители: Татьяна Татаринова

В судебно-медицинской практике есть такая задача: есть жертва изнасилования, есть ребенок родившийся в результате преступления. Преступник (отец) неизвестен. Как наиболее точно получить его описание исходя из генотипов матери и ребенка? Какие методы (полногеномное секвенирование или анализ на чипах) будут наиболее эффективными? Можно ли достоверно определить этническую принадлежность отца? На эти вопросы мы будем искать ответ в этом семестре.

Локальное выравнивание последовательностей с использованием внутрипроцессорного параллелизма | University of Warwick

студенты: Орехов Дмитрий, Мурзина Анастасия
руководители: Тискин Александр

Локальное выравнивание символьных последовательностей - одна из фундаментальных задач биоинформатики. Широко распространены как быстрые эвристические (тима BLAST), так и точные, но более трудоемкие методы локального выравнивания. Руководителем предлагаемого проекта был разработан относительно простой и эффективный метод точного локального выравнивания по принципу "скользящего окна", реализация которого позволила получить биологически значимые результаты, опубликованные в 2010-2012 гг. Эффективность реализации была достигнута, в том числе, использованием команды архитектуры Intel MMX с низкоуровневым внутрипроцессорным SIMD (single instruction, multiple data) параллелизмом. В последние годы данный тип внутрипроцессорного параллелизма интенсивно развивался. Компанией Intel были последовательно внедрены несколько новых наборов команд для параллельного манипулирования данными: SSE (несколько версий до SSE4 включительно), AVX, AVX-512. Аналогичные расширения архитектуры микропроцессоров были реализованы и другими производителями.

Association Rule Mining on genome regions using fishbone diagrams | JetBrains Research

студенты: Лукашина Нина, Лихолетова Дарья
руководители: Цуринов Пётр

В биологии поиск ассоциаций является важной частью построения гипотез. Текущие подходы включают в себя поиск ассоциаций по набору генов (GREAT), а также по геномным позициям (ChIP-Atlas). Хотелось бы реализовать ещё один подход – построить зависимости используя Association Rule Mining, а затем визуализировать результаты с помощью диаграмм Исикавы. Проверить работу нового сервиса можно будет на известных статьях о взаимосвязи различных модификаций (изменение в гистонах, метилировании) и областях посадки транскрипционных факторов, а затем и попробовать поискать какие-то новые закономерности.

Noisy peak calling | JetBrains Research

студенты: Чаплыгина Дарья
руководители: Шпынов Олег

Провести сравнительный анализ инструментов ChIP-Seq Peak Calling на устойчивость к соотношению сигнал-шум в экспериментальных данных.

Rescue failures | JetBrains Research

студенты: Балашова Дарья
руководители: Шпынов Олег

Задача заключается в улучшении качества эксперимента Ultra Low Input ChIP-Seq (новый протокол, требующий 10 тысяч клеток вместо 2-5 миллионов) в случае нескольких доноров с помощью нейронной сети.

Улучшение в SPAN модели поиска пиков с помощью модели линейной регрессии | JetBrains Research

студенты: Картышева Елена
руководители: Алексей Диевский

Задача проекта: улучшить SPAN, а именно добавить возможность дообучать модель, используя не только треки, но и какую-то информацию о данных, например:
• GC-content
• Mappability
• Local BG Estimate

Towards detection of differential RNA editing events in transcriptomics datasets. | St.Petersburg State University

студенты: Матвеенко Андрей
руководители: Anastasia Samsonova, Alexander Kanapin

Throughout their lifetimes, RNA molecules undergo a variety of alterations that can result in beneficial or deleterious consequences for the organism. Modification of RNA nucleosides, known as RNA editing (RNAe), is a powerful instrument for the diversification of regulatory landscapes in eukaryotic genomes, promoting variability in the protein repertoire and silencing patterns, and fighting the genome instability mediated by mobile genetic elements. Abnormalities in the RNA editing process can be devastating for the whole organism, as they can provoke a wide spectrum of genetic disorders and cancers. Despite the relevance of RNA editing to human health, understanding of the process and its regulation is impeded by a lack of robust, straightforward experimental assays and computational methods for in-depth studies of the mechanism.

Modern bioinformatics algorithms for RNAe analyses focus mostly on discovery of the editing sites from transcriptomics data. However, these tools are not designed for discovery of differential editing between various experimental conditions and tissues; i.e. for estimation of changes in editing landscape or editing efficiency either for a single nucleotides or for genes/transcripts.
In this project we aim to evaluate applicability of existing tools and statistical approaches for analyses of similar data modalities, such as bisulfite sequencing data, for discovery of differential RNAe events. The goal will be to propose a prototype of a statistical framework and and a blueprint of an algorithm for estimation of statistically significant changes in editing efficiency on a single nucleotides and/or gene level.

This project is a continuation of a successful project undertaken by Irina Shchukina (Bioinformatics Inst., class of 2017) who developed a framework for discovery of RNA editing events in RNASeq data.

Научные проекты 2017/2018
Осень 2017
Весна 2018
Phylogenetic networks comparison | ITMO University

students: Anton Eliseev, Natalia Klimenko, Elena Pazhenkova
scientific adviser: Nikita Alekseev

Phylogenetic networks are used to visualize evolutionary relationships that reflect any reticulations (such as hybridization). The amount of reticulation edges is a widely used criterion of networks, however, this measure is often identical in different topologies. We propose to use the number of possible convex colorings as metrics to distinguish networks with equal number of hybridizations. The number of convex colorings shows how many homoplasy-free characters are possible within this phylogeny.

Six species of Heliconius butterflies was chosen as model group to test our algorithm. A peculiar trait of genus Heliconius is the prevalence of interspecific hybridization, which reflects on phylogenetic networks as reticulation events. As suggested earlier, H. heurippa and H. elevatus have resulted from hybrid speciation [1, 2]. We analyze 20 nuclear genes, obtain NJ trees for each gene, compare these trees using pairwise Branch Score Distances, concatenate genes providing the most similar trees (distances up to 0.015), calculate hybridization networks and estimate numbers of convex colorings for each network. The network with the largest count of convex colorings is congruent with the hypothesis of hybrid origin of heurippa and elevatus species.

Another part of the study concerned the phylogeny of potatoes. 420 potato plants classified earlier as 29 species (7 cultivated and 22 wild) were analyzed by 15 plastid SSR-markers. As genomes were plastid, no hybridization was observed. We concentrated on building the most accurate phylogenetic tree for this data. Dendrograms were based on the Manhattan distance matrix. Cultivated and wild species of potato are clearly distinguishable. The idea of dividing Solanum tuberosum into Andigenum group and Chilotanum groups (according to [3]) is correct. Results of molecular analysis don't correspond to classification based on morphological features.

1.Kronforst M.R., Papa R. The Functional Basis of Wing Patterning in Heliconius Butterflies: The Molecules Behind Mimicry. GENETICS. 2015. 200(1): 1-19
2.Salazar C., Baxter S.W., Pardo-Diaz C., Wu G., Surridge A., Linares M., Bermingham E., Jiggins C.D. Genetic Evidence for Hybrid Trait Speciation in Heliconius Butterflies. PLoS Genet. 2010. 6(4): e1000930.
3.Spooner D.M., Ghislain M., Simon R., Jansky S.H., Gavrilenko T. Systematics, Diversity, Genetics, and Evolution of Wild and Cultivated Potatoes. Bot. Rev. 2014. 80: 283–383

Optimization of spectral network parameters | EPAM Lifescience

students: Evgenia Fedotova , Rostislav Skitchenko , Ksenia Cherenkova
scientific adviser: Gennadiy Zacharov
A pipeline for exome and target-sequencing analyses was developed. It's results could be used by physicians for diagnosis refinement. Such problems as pipeline deployment, it's utilities versions and dependencies control was solved by using Docker software. Pipeline quality control was obtained for NA12848 GIAB sample: Precision 0.95 and Sensitivity 0.78.

We've analyzed sequence results of cardyo-panel for families, whose members had diagnosis cardiomyopathy. Dependencies between variations and clinical diagnosis cardiomyopathy was found.

Construction of RNA fragment database | University of North Carolina at Chapel Hill

student: Alexandr Ilin
scientific advisers: J. Wang, N. Dokholyan
RNA plays significant role in regulation of gene expression at transcriptional and translational levels. This is achieved because of appropriate spatial structure of RNA molecule (i.e. motif), which is obtained after folding. Ability to predict 3-dimensional structure given sequence of RNA oligonucleotide is very important due to possibility to make use of this information in construction molecules with predefined structure – thus with known properties and targets to interact. Therefore, it supports design of new RNAs, which can be used as medications against wide spectrum of diseases caused by consequences of problems with gene product abundance.

In this work we developed RNA secondary structure decomposition algorithm to decompose an integrated RNA into many motifs. According to the RNA secondary structure decomposition algorithm, an RNA 3D motifs database was built by decomposing all the RNA 3D structures downloaded from PDB. We devised an algorithm to analyze and compare the base interactions networks between different RNA 3D motifs. We classified and clustered all the RNA 3D motifs in the database by using the network comparison algorithm. We utilized the supervised machine learning method to learn the relationship between sequences and base interactions networks of clustered RNA 3D motifs.

Assembly of mammalian genomes using GemCode data | Center for Algorithmic Biotechnology, St. Petersburg State University

student: Angira Kekteeva
scientific advisers: Ivan Tolstoganov, Anton Bankevich
GemCode technology that was developed by 10X Genomics Company is actively used for assembly of mammalian genomes. CloudSPAdes is a genome assembly algorithm which was designed for metagenome assembly. However, algorithms in this tool, that were developed for resolving repetitions in the assembly graph, can be successfully used for assembly of mammalian genomes.

In this work we've examined exisiting metagenome assembly algorithms and analysed the disadvatages of using them for large genomes. Our analysis has shown that the average number of close edges in a human genome graph is more, than in metagenomes assembly graph, so it requires additional methods for sequencing long edges in the genomes of mammals.

Searching for molecular markers of chromosome bands | Bioinformatics Institute

student: Alexandra Klimina
scientific advisers: Yury Barbitoff

Giemsa staining produces specific bands on metaphase chromosomes that have coloring of different intensity (G-bands). There are known correlations between the intensity of coloring and the degree of chromatin condensation, GC-content, and replication time. However, little is known about molecular markers of such banding pattern.

Main purpose of this project was to develop a tool for analysis of genome-wide correlation between different genomic features. We implemented the tool in Java with possibility to work on Spark-cluster for distributed computitions.

We chose previously described Projection test and Jaccard test to analyze the dependence between the reference (e.g., chromosome bands) and query feature of interest. We estimated the significance of correlation by sampling 1000 sets of randomly distributed intervals of the same length as the query feature followed by Kolmogorov-Smirnov normality test and one-sample t-test to obtain the p-value of association.

We tested our tool by analyzing correlation between chromosome banding pattern and such features as CpG-islands, microsatellite repeats and DNAse hypersensitivity regions. Expectedly, we showed that G-positive bands are positively correlated with microsattelite repeats, and negatively - with CpG-islands and open chromatin, DNAse hypersensitive regions. Thus, our tool can be used to further analyze genome-wide correlations between banding pattern and diverse molecular features.

Association of methylation level CpG-islands and IQ-level | University of Houston

student: Daria Krytskaya
scientific adviser: Olga Naumova

Methylation is an epigenomics modification of DNA. It change the activity of a DNA segment without changing the sequence. The most methylated region is region riched guanine and cytosine called CpG-islands.

In this work, we evaluate the methylation level of all known 26640 CpG-islands as average value of this region. Than we made a correlation test with using the Benjamini-Hochberg procedure for decreases the false discovery rate.

Analysis of biological role of this region was make with UCSC Genome Browser on Human Feb. 2009 (GRCh37/hg19) Assembly and GeneCards.
Among our results are predicted transcription and translation region and known protein. For example, on 1 chromosome the most significant region is (38059428, 38063740) contained: prediction region are ENST00000373062, ENST00000463351, ENST00000488496 and gene of known proteins GLN2 – Homo sapiens guanine nucleotide binding protein-like 2 (nucleolar). For region (54951893, 54957287) of 1 chromosome are not predictions. For 2 chromosome the most significant region is (65213598, 65219212) contained gene of known proteins SLC1A4 Gene – Solute Carrier Family 1 Member 4. It is a transporter of alanine, serine, cysteine, threonine. Predicts a transport of a glucose by this protein. Disorders associated with this protein are spastic tetraplegia, thin corpus callosum, progressive microcephaly, microcephaly.

Genome structure of Mycobacterium tuberculosis strains in different world regions |
Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University

students: Vladimir Klimov, Vladimir Molchanov
scientific adviser: Ekaterina Chernyaeva

Due to the high epidemiological rate of Mycobacterium tuberculosis and its constantly updated genomic data the problem of genome data analysis and systematization becomes extremely significant. For this reason our project was devoted to extend Genome-based Mycobacterium tuberculosis Variation (GMTV) Database which was developed by the researchers of Theodosius Dobzhansky Center for Genome Bioinformatics (Chernayeva et al., 2014).

In this study we performed an analysis of 999 M. tuberculosis strains which was isolated from patients in Malawi Republic. To achieve this results we designed a pipeline aimed at single nucleotide polymorphisms (SNPs) and insertions/deletions (InDels) identification from M. tuberculosis whole genome sequencing data.

This pipeline based on BWA-mem and GATK programs which are widely used in such kind of investigations. Considering big amount of NGS data we suggested simple and rapid method to visualize or estimate quality control results performed by FastQC program using python3 regular expression and plotting in R. All variant calls (.vcf files) was uploaded on database, in future this data could be used for clade-specific annotation, which gives a possibility to identify strains without NGS methods.

Comparative analysis of natural selection effects across human populations |
Bioinformatics Institute

student: Julia Kornienko
scientific adviser: Yury Barbitoff

In this project we aimed to estimate the natural selection effects across human populations based on the Genome Aggregation Database (gnomAD) dataset which contains information about sequence variants in 123136 human exomes and 15,496 genomes. To this end we calculated the amount of protein truncating variants (PTV) both (i) per individual genes (based on GENCODE v.19); and (ii) per gene sets (hallmarks and canonical pathways obtained from the MSigDB Collections) for six different populations (European, South Asian, Latino American, East Asian, African and Finnish).

We estimated the selective coefficients of heterozygous PTVs for different human populations from the constructed dataset in the same way as it was done by Cassa et al. (2017) and found that distribution of selective coefficients both per individual genes and gene sets is dependent on the population size. Taking this into account, we evaluated the difference in distribution of PTV allele counts among the populations and found that for 2040 of 12367 analyzed genes and for 746 of 1379 of analyzed gene sets selective effects were significantly population-dependent. Thus it is possible to conclude that selective effects for some genes do vary across the populations.

Interestingly, we discovered significant enrichment of PTV alleles in the immune system-related pathways (IL-10, IL-13 and IFNG signaling) in the individuals of South Asian ancestry (SAS), with more than half of all PTVs discovered in the corresponding genes belonging to the SAS population. These results are concordant with some previous findings and emphasize the natural heterogeneity of selective effects.

De novo cdr3 annotation in VDJ rna sequences | Center for Algorithmic Biotechnology, St. Petersburg State University

student: Kristina Krivonosova
scientific advisers: Andrey Slabodkin, Maria Chernigovskaya
The project relates to the construction and analysis of the repertoire of antibodies. In order to build a repertoire, we look for mutations in antibodies with the help of a germline by aligning the variable part of the immunoglobulin gene with a special base of V-, D-, and J-genes (germline). With this alignment, we can annotate the sequence: we mark the boundaries of the V-gene and the J-gene, as well as the boundaries of the three regions that determine the specificity of the antibody to the antigen (CDRs). In practice the third region (CDR3) is the most variable part of the immunoglobulin gene so its borders are of the greatest interest. Unfortunately, on some data (for instance, in case of lymphoma) the level of mutations goes off the scale and we can not build an alignment on the germline for that data.

In this work we develop a heuristic for VDJ sequence annotation that does not use alignments. This new heuristic is based on searching conserved regions in the source sequence to identify CDR3 regions. In practice this approach produces satisfying results with accuracy rate of 95% when applied to verified data sets.

Forming a panel of markers for the molecular-genetic diagnosis of congenital
metabolic disorders
| Parseq Lab

student: Ekaterina Nebozhatko
scientific advisers: Tamara Simakova, Anton Bragin

Predicting the deleterious effects of mutation on protein function is one of the main tasks of genetics. Often researchers use predictive tools for this. The main problem with the use of predictive tools is not enough high sensitivity and specificity of classifiers. On average, the sensitivity is 80%, which means that 20% of the possible pathogenic mutations may go unnoticed. This can adversely affect the success of treatment. Another approach is to use open databases in which information on pathogenic mutations for certain genes is collected and verified.

The company Parseq Lab is working on a large project to create a panel of markers. It included 37 genes associated with 35 different diseases and 36 external sources. In this paper, a part of the project with single database PNDdb and three genes GCH1, QDPR and PTS is presented. In the course of the work, a tool was implemented that exports the necessary data from the site and presents them in the VCF format. The sensitivity of predictive tools of SIFT and PolyPhen was also assessed.

Khazars heritage in the world genomes | University of La Verne

student: Yury Orlov
scientific adviser: Tatiana Tatarinova

Khazars are a semi-nomadic ethnic group that lived in the second part of the 1 st millennium, occupying a large area north of Caucasus between Black and
Caspian seas. At the end of the 10 th century their state was destroyed and the Khazar Khanate disappeared as suddenly as it rose, without leaving any legacy
except their own funerary mounds. At all times there were pretty much theories and guesses about the Khazars origin and their descendants but it was impossible to make solid conclusions about them. The aim of this project is to find the answers on the questions of Khazars origin and genetic legacy by
analysing ancient DNA (aDNA) extracted from remains of three Khazar representatives.

In course of the project, aDNA sequencing data was processed according to its specific library preparation and degraded nature of aDNA. Reads were
mapped to the HG38 reference genome. It was found that bacterial contamination was more than 75% (typical to aDNA), and by detecting significant amount of C-T transitions on the read ends it was shown that studied DNA is indeed of ancient origin. Using the GATK package we performed the SNP-calling procedure on obtained data and with Admixture tool figured out that the Khazars are a mix of North East Asians, Northern European, Mediterranean and South West Asian populations, as it is expected of a well-mixed group of semi-nomadic people.

As continuation of the work we are collecting different ancient and modern genomes to compare obtained data to them and finally draw a conclusion about the Khazars origin and their descendants in the modern world.

Russian Exomes. Part 1. |Bioinformatics Institute

students: Olga Poleshchuk, Ekaterina Izmailova
scientific adviser: Yury Barbitoff

Mutations in protein-coding part of the genome are a cause of numerous different pathologies. Thus, whole exome sequencing (WES) is a commonly used alternative to whole-genome sequencing in medical genetics and health-related studies. Some environmental adaptations also likely arose from changes in protein-coding regions, making exome sequencing a valuable tool for population genetics studies. Several large sequencing consortia (e.g., Exome Aggregation Consostium (ExAC)) have collected data from hundreds of thousands samples of western population, and a lot of research was done using this data. The goal of our project was to develop a pipeline for variant analysis in Russian population, and apply it to ~570 WES samples.

Firstly we created pipeline using Snakemake as one of possible tools for creating workflows. The pipeline receives raw FASTQ files as input and outputs a combined annotated VCF for all samples in a batch. We successfully tested our pipeline, and performed preliminary analysis of variants in a dataset of 570 samples of Russian and CIS ancestry. We observed many novel variants common to samples included in the study, with most of such variants classified as missense mutations, intronic variants, and synonimous substitutions. Thus, we made a very first and preliminary steps in assessing the exome-wide genetic structure of Russian population. Further data aggregation and analysis will help to completely fill the biggest gap on the genetic map of the world.

Speed-efficient data structures for cloudSPAdes |Center for Algorithmic Biotechnology, St. Petersburg State University

student: Evgen Polevikov
scientific advisers: Anton Bankevich, Ivan Tolstoganov

GemCode technology that was recently introduced by 10X Genomics company is rapidly becoming essential for variant calling, diploid genome assembly and read alignment. The cloudSPAdes algorithm was recently developed in the Center for Algorithmic Biotechnology. The algorithm uses GemCode data to improve metagenome assembly quality. Currently, cloudSPAdes consumes a large amount of computational resources for assembly of complex metagenomes. CloudSPAdes' assembly procedure consists of several stages: on the first stage it constructs assembly graph using procedures which were implemented in already existing metaSPAdes pipeline. Then barcoded reads are aligned to the edges of the graph such that for every edge we get a particular set of barcodes. Every set is represented as a sorted array. Intersection of these sets is computed in order to estimate genome distance between long edges and determine their true ordering.

In this work we adapted probabilistic data structure called containment min hash that allows to improve current procedure of computing of edge intersection. In order to estimate intersection of two sets of barcodes A and B (assume that size of A less than size of B) we first create a bloom filter from the set of a larger size B. Then we take a random sample S from the set of a smaller size A and test every element of S for membership in B using a bloom filter. By that we estimate an intersection of A and B.

In order to benchmark containment min hash against the original sorted array data structure we constructed assembly graph from GemCode library which was sequenced from a mixture of 5 known bacterial species. We selected a set of 1492 edges longer than 5,000 bp from the assembly graph and found an intersection for every ordered pair of these edges using containment min hash and compared it with initial edge intersecting procedure. Our analysis have shown that new algorithm works approximately 6 times faster. Also we have managed to decrease memory consumption: now it is enough to store about 60% of data that initial procedure uses.

Protein digestion patterns|University of North Carolina at Chapel Hill

student: Natalia Rodina
scientific advisers: Popov Konstantin, Dokholyan Nikolay

Digestion of the proteins by proteasomes and proteases in cells results in producing a specific repertoire of peptides that can potentially bind to MHC 1 complex 1 and used for triggering immune response against specific cancer cells. Thus, creating a tool for prediction the "peptide profiles" produced by protease cleavage in different types of tissues in normal cells and primary tumor became the aim of the present project.
In the first step, microarray expression data in 19 types of tissues types for normal cells (726 arrays) and primary tumor (1.460 arrays) was collected from the MERAV database. PCA analyzes of the preprocessed expression data showed differences in normal tissue and primary tumor for every tissue type were shown. For every tissue type, expressed genes were selected (values higher then median of quintile normalized data) and only genes expressed in the primary tumor and not presented in the normal tissues were taken into account. For selected genes, the amino acid sequences were parsed from the NCBI protein database.
Information about cleavage sites of human cell proteases was downloaded from the Merops database and for every type of tissue only expressed proteases were selected.
In the next step, a tool for prediction of the peptide profiles was created. As input, the tool takes a selected type of the tissue. Then, from inbuilt database, amino acid sequences of all genes and information about cleavage sites of all proteases expressed in the selected tissue are taken. The tool finds all possible cleavage sites in every protein for every protease and provides information about all peptides created by the cleavage of all proteins in the tissue.
The created tool will provide the possibility to predict peptide profiles for all tissue types and identify peptides that are specific for the cancer cells and can be used for targeted immune therapy.

Ti plasmid evolution and horizontal gene transfer | St. Petersburg State University

students: Shikov Anton , Zorin Evgeniy
scientific advisers: Alexandr Tkachenko, Mikhail Rayko

Agrobacterium species contain special sequence named T-DNA in Ti- and Ri-plasmids which can be inserted into plant genome. This feature is widely used in plant biotechnology. However, this insertion can become a stable part of plant genome, thus, Agrobacterium species are able to implement horizontal gene transfer to plant organism that happens quite rare in plant realm.

The aim of our work was to detect of new examples of horizontal gene transfer in plants and reconstructing phylogeny and evolution of Ti- and Ri-plasmids. To achieve this goal, we used hmmer tools and analyzed available plant genomes and proteomes. In total, 66 proteomes and 45 genomes were scanned. Extracted hits were further utilized for making multiple alignments and building approximately 700 trees.

Unfortunately, we didn't detect any explicit clusterization of plant and bacterial sequences. Nevertheless, during analysis we successfully revealed a brand-new example of horizontal gene transfer in Nicotiana tabacum that has not been described in literature before. Bacterial protein riORF20 from Agrobacterium rhizogenes has two plant homologs. Interestingly, this two proteins are homologous to C- and N-ends of riORF20 respectively. For this reason, we propose DNA recombination in N. tabacum after T-DNA insertion.

Enhancement of Export Option for a Genome Mappability Score Estimator | Bioinformatics Institute

student: Skalon Elizaveta
scientific adviser: Bakin Evgeniy
Mappability is a genome-wide function that indicates whether it is possible for any read to be unambiguously mapped to a given position. Mappability information can be crucial for an interpretation of such experiments as ChIPSeq, SNP-calling etc., where quantitative estimates or confident identification of variations are performed. There is a special metric called Genome Mappability Score (GMS), which quantifies the mappability. GMS measures a weighted probability of mapping certainty in a given place. If the GMS is zero in a given position, many identical reads from different loci may be equally mapped to this region. Otherwise, if the GMS is 100, a read mapped to this position is unique.

In this work, we extended the functionality of fast and sufficiently accurate instrument for the GMS calculation, developed by in Bionformatics Institute in 2016. Firstly, a possibility to get output records in many various formats was provided. It allowed not only Wig and BigWig, but also BED, BigBed and TDF output formats to be supported. Secondly, the runtime of data export was reduced by an implementation of a multiprocessing mode, geometric expansion of arrays and export of GMS track directly to BigWig without wigToBigWig converter.
These improvements made the GMS computation even more convenient and friendly for its users.

Retention time for identification of natural products | Carnegie Mellon University

student: Vladimir Sukhov
scientific adviser: Alexey Gurevich, Husein Mohimani

Natural Products (NPs) play an important role in pharmacology: many antibiotics, antiviral and antitumor agents are NPs. Thus, it is crucial to have methods for accurate discovery of new NP. In the process of searching for new NP, false positive identifications may occur. To reduce their number, we use retention time (RT) as an additional correctness check for discovered NP.

In this work, we applied machine learning methods for determining possible RT range for peptides. The multiple regression method was chosen as the primary technique. As a model for machine learning, we considered the amino acid composition of a peptide, where each amino acid adds its own weight to the final RT value.

As the result, the model was trained and tested. Model benchmarking demonstrated high accuracy of RT prediction and its potential for a significant reduction of false positive identifications.

Analysis of VH-replacement statistical properties based on public datasets | Center for Algorithmic Biotechnology, St. Petersburg State University and Pavlov First St. Petersburg State Medical University

students: Adel Gazizova, Anastasia Vinogradova
scientific adviser: Andrey Slabodkin, Maria Chernigovskaya, Oksana Stanevich

During a construction, immunoglobulin H locus (IgH) undergo a process named VDJ-recombination, during which is random gene segments from IgH germline are set into resulting gene sequence. It provides a primary specificity to antigens. However, infrequently, the existing V-gene can be partly replaced by a new one, and this process is called VH-replacement. There are various hypotheses regarding the contribution that VH-replacement makes to antibody functionality.

In our work, we created a pipeline, which allows to identify VH-replacement in human antibody sequences. First we downloaded the data from Genbank, parsed files and extracted titles, that contained all the information about each sequence. Before starting the search, we divided antibody sequences into clonal families, because our data must contain only clonal-independent sequences in order to exclude a false-positive result. Then by means of developed script we made an exact and inexact (with one possible mismatch) search of VH-replacement's footprints in sequences of people with different phenotypes. We analyzed results and found, that VH-replacement frequency significantly increases for subjects infected with HIV-1, as well as for ones vaccinated against pneumococcus.

Web bot development for automation of requests to IMGT / V-quest | Bioinformatics Institute

students: Andrey Zolotarev, Alexandr Cheblokov
scientific adviser: Evgeniy Bakin

There are a lot of different web-services that give the user an opportunity to work with integrated databases and research tools, which are necessary for a number of scientific areas.

One of them is IMGT® (the international ImMunoGeneTics information system®) – high-quality knowledge resource in immunogenetics and
immunoinformatics that specifically provides data about immunoglobulin or antibodies, T-cell receptors, major histocompatibility (MH) of human and other vertebrate species, immunoglobulin superfamily, MH superfamily and related proteins of the immune system of vertebrates and invertebrates. Unfortunately, this service is difficult to use for implementation statistical analysis due to limitation of loadable
sequences. IMGT allows the user to load only 50 sequences by one request, and the task become further complicated by the need to configure multiple query parameters.

The problem for statistical research is obvious, the scientist must spend a huge amount of time to process even one thousand sequence dataset.

The solution we developed is web-bot that allows the researcher to automate the processing of large amounts of data subject to the limitation given above. We came to the conclusion that a suitable basis for our objective is Selenium Web-Driver.
Selenium is a software library with the open source code, which is widely represented for a number of the most popular web browsers and compatible with such popular programming languages as C#, Python, JavaScript and others. This module emulates user behavior on the site, what allows to set parameters once and then implement the repetition of their setting by Selenium API.
As a result of our work we present the program that automates requests to IMGT for huge datasets and includes an interface for configuring the search parameters that are specific to a particular task. Result of program execution is a table in CSV-format that contains data required for the researcher.

Using approximate calculations to speed up the peak calling procedure |Bioinformatics institute

student: Viacheslav Borovitskiy
scientific adviser: Evgeniy Bakin
Peak calling is a computational procedure used to identify areas in a genome that have been enriched with aligned reads primarily as a consequence of performing a ChIP-sequencing experiment. There are several popular pieces of software which perform this procedure (most of them require substantial computational resources and time). Each one has its own set of parameters requiring adjustment to the particular experiment.

In this work, we try to address the issue of time costs of the process of parameters adjustment for the peak calling procedure. We present a prototype of a tool that uses some fast machine learning / digital signal processing methods to approximately obtain the result of a peak calling procedure for a given caller with a given set of parameters in a matter of no time.

At first, we use given caller with a given set of parameters on a small piece of data. We then use the results of the previous step to train a linear
classifier (some fast time-series optimized version of logistic regression).

Finally, we apply our trained classifier (followed by some threshold transformation) to the rest of the data to obtain an approximation of the result.

We test our tool against some data sets from the Encyclopedia of DNA Elements (ENCODE). On "good" data we have precision/recall scores at about 0.85/0.85. On "bad" data we have precision/recall scores at about 0.20/0.20. Tests give impression that we never overfit, meaning that precision/recall scores on the train set determine those on the test set.

Regulatory network modeling based on analysis of ATAC-seq data from cancer cells | Institut Cochin, Laboratory "Computational Epigenetics of Cancer"

student: Anastasia Danchurova
scientific adviser: Valentina Boeva

The tumor cell state is governed by complicated interplay between transcription factors that regulate gene expression and thus define cell fate. The concept of core, or master, transcription factors comprising Oct4, Sox2, Nanog (also known as Yamanaka's factor family) postulates that small number of transcription factors control the more numerous auxiliary transcription factors and play an essential role in determining of cell fate. Recent data showed that these core transcription factors play a regulative role in different types of cancer.

Because cancer is a disease associated with aberrant gene expression patterns, transcription factors, which serve as the convergence points of oncogenic signaling and are functionally altered in many cancers, hold great therapeutic promise. The more personal this therapy will be the more efficient result it will achieve.

That is why in this project exactly ATAC-seq data is used. Related to DNAse-seq and MNAse-seq methods, ATAC-seq compares favorably in library preparation simplicity, speed and amount of required cells (500-50 000 cells), what in total makes it the appropriate for clinical usage.

In this project, we create a tool, which combines ATAC-seq data with human genome annotation and several databases, determines interactions between transcription factors and active promoters and enhancers. As a result, we are expecting to construct a graph that will represent all detected interactions. Analysis of such graph is intended to help to determine the main transcription factors that may become effective potential targets for anti-cancer therapy.

Regulatory network modeling based on analysis of ATAC-seq data from cancer cells |
All-Russia Institute for Agricultural Microbiology

student: Yury Malovichko
scientific adviser: Evgeniy Andronov

Sinorhizobium meliloti is one of the so-called Rhizobia, a group of α- and β-proteobacteria known for their capability of interacting with legume plants that results in stable mutualistic symbiosis where bacteria provide plants with atmosphere nitrogen reduced to ammonium in exchange for organic carbon. The genome of Rhizobia differs from that of Escherichia coli and other model prokaryotes and comprises of one major chromosome and one or more symbiotic plasmids that determine bacterium's host range and symbiosis efficacy. However, Rhizobia genome is also known for its flexibility, with symbiotic genes rearranged with plasmids, between them or even between plasmids and chromosome.

In this study, we aimed to prove a suggestion based on RFLP and other molecular marker analyses that two distinct genetic lines exist with S. meliloti species discriminated by linkage of particular alleles of leu and betCB genes. We used MLST approach with 10 loci suggested previously for genomic clustering of this species (see Reference) ad Bayesian Inference algorithm to build a tree that would show actual phylogeny of 12 isolates with 6 isolates for both supposed genomic lines, respectively. However, we gained ambiguous results showing that suggested loci are evidently not universal in their use for MLST of S. meliloti. For now, we seek for more informative loci that will shed the light on true phylogeny of
these isolates and existence of these two genomic lines.

1. Berkum P. Van, Elia P., Eardly B.D. Multilocus sequence typing as an approach for population analysis
of Medicago-nodulating rhizobia // J. Bacteriol. 2006. Т. 188. № 15. С. 5570–5577.

NGS-based metagenomic pathogen viruses and bacteria identification system |
Saint-Petersburg Pasteur Institute

student: Alexandr Bebyakov
scientific adviser: Alexandr Semenov

Большинство методов микробиологической диагностики занимают продолжительное время и неприменимы для обнаружения некультивируемых форм патогенных агентов. Предполагается возможным на основе данных секвенирования нуклеотидных последовательностей смешанного образца определять наличие возбудителей особо опасных инфекций и свойственных им факторов патогенности и, таким образом, ускорять процесс принятия решений о мерах противодействия возможным эпидемиям.

Applying state-of-the-art neural network architectures for predicting protein-binding sites |
ITMO University

student: Viacheslav Borovitskiy
scientific adviser: Tatiana Malygina

This is a project aiming to improve an approach proposed in the paper by using some of the modern neural network architectures.

Study and development of a macrophage metabolic model|ITMO University

student: Natalia Rodina, Alexandr Cheblokov
scientific adviser: Gainullina Anastasia, Sergushichev Alexey

Macrophages are cells of the first line of immune protection: destroy pathogens (M1), maintain tissue homeostasis (M2). Using the metabolic FBA model allows you to see the coordination between metabolic pathways at the level of the whole cell. However, the existing FBA model of macrophage metabolism has a number of inaccuracies, and therefore it does not reflect the latest ideas about their M1 activation, formulated during molecular biological experiments. The purpose of this project is to detect and correct inaccuracies of the macrophage FBA model of metabolism.

Biogeography of arabidopsis|University of La Verne

student: Anton Eliseev, Kristina Krivonosova
scientific adviser: Tatiana Tatarinova

Цель проекта - с помощью геномов медикаго и арабидопсиса определить корреляции между генетикой, климатом, почвой и прочей географией. Построить модель связывающую окружающую среду и геном.

Statistical analysis of annotated genomes|University of La Verne

student: Poleshchuk Olga, Danchurova Anastasia
scientific adviser: Tatiana Tatarinova

Find correlation between sequence features and functional regions in different genomes

  1. Plot sequence features such as TFBS, SNPs, methylation, RNA-seq coverage
  2. Map it on promoter regions
  3. Find correlation
  4. Consider outcomes for promoter prediction for complex and not annotated genomes

Finding novel variations of germline Immunoglobulin genes using WGS data|University of California San Diego

student: Alexandr Ilin
scientific adviser: Yana Safonova

Variety of immunoglobulin germline genes (V, D, and J) is a key component of the antibody repertoire diversity. Highly repetitive structure of Ig loci and a lack of natural selection result in elevated polymorphism rate of immunoglobulin germline genes. In this project, we want to analyze variations of Ig loci in several human populations and describe differences between them.

Long read mapping improvements for Flye assembler|University of California San Diego

student: Evgeny Polevikov
scientific adviser: Mikhail Kolmogorov

minimap2 is a versatile pairwise aligner for genomic and spliced nucleotide sequences written in C. The goal of this project is to write C++ wrapper for this tool in order to incorporate it into Flye.

In src/example.cpp you can find an usage example of minimap2 API with C++ interface. The example shows how to build an index and how to use this index to find overlaps for pacbio reads.

We have recently released Flye assembler for long and noisy reads (PacBio, Oxford Nanopores). The assembly results seems to be very promising in comparison with the current state-of-the-art approaches.

As a successor of the ABruijn assembler, Flye uses solid k-mer based approach to find overlaps between noisy reads, which is (relatively) fast, but might be not optimal in terms of memory usage and the parameter choice flexibility. On the other hand, minimap2 seems to be very memory efficient, while showing the best sensitivity/specificity among the other long read aligners. As minimap2 also has C++ API, we want to explore the possibility of replacing our solid k-mer approach with minimap2. We expect that this change will significantly reduce the memory usage bottleneck, while also improving the assembly accuracy.

Mediation of effects of persistent chemicals on the human sperm epigenome|A.N. Belozersky Research Institute of Physico-Chemical Biology, Moscow State University, Institute of Bioengineering, Research Center of Biotechnology RAS

student: Julia Kornienko
scientific adviser: Oleg Sergeev, Yulia Medvedeva

Процессы сперматогенеза и созревания сперматозоидов включают в себя каскад эпигенетических изменений (Wu et al., 2015). Изучение эпигенома сперматозоидов представляет очень перспективное направление по нескольким причинам. Во-первых, явно недостаточно изучено специфическое воздействие разнообразных факторов среды, включая химические факторы, на эпигенетические маркеры. Во-вторых, выявляемые эпигенетические изменения связаны с качеством и количеством сперматозоидов. И, в-третьих, что может быть особенно важно - репрограммирование эпигенома половых клеток может быть передано следующему поколению, что может привести к нарушению развития потомства, как на этапе развития эмбриона, так и в последующей жизни.

- Объект заявляемого молекулярно-эпидемиологического исследования – данные метилирования ДНК сперматозоидов (WGBS и RRBS) и данные факторов окружающей среды, полового развития и образа жизни, полученные в ходе родительского проспективного когортного исследования «Russian Children's Health. Male Reproductive SubStudy», начавшегося в 2003 году.

- Что известно из родительского исследования?
Воздействие таких стойких токсикантов, как диоксины, в течение полового развития (пубертата) влияет как на метилирование ДНК сперматозоидов (Pilsner et al., 2018), так и на снижение качества семени (Minguez-Alarcon et al., 2017) в возрасте 18 лет.

- Что неизвестно?
Какой вклад вносят другие факторы окружающий среды, потенциально негативно влияющие в процессе сперматогенеза на эпигеном сперматозоидов и качество семени, в частности курение.
Какой вклад в изменение эпигенома и качества сперматозоидов вносят различные варианты полового развития (ускоренное, нормальное, замедленное)?

- Цель исследования:
Изучить роль курения и темпов полового созревания в качестве медиаторов влияния диоксинов на метилирование ДНК сперматозоидов.

Role of protein dimerization|Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill

student: Orlov Iurii
scientific adviser: Nikolay Dokholyan

We would like to understand why nature evolves proteins to function as dimers. To understand whether the oligomeric structure of protein is more evolutionary preferable than monomeric.

General plan:
  1. detect core residues (CR) responsible for structure formation
  2. determine how number of CR grows with protein length
  3. compare obtained results with dimeric proteins
  4. go further for larger oligomers (n-mers) to find the most preferable n

Model of the N1 zone formation in human antibodies dimerization|Bioinformatics institute

student: Elena Pazhenkova
scientific adviser: Evgeniy Bakin, Oksana Stanevich

The N1-zone is a variable region of human antibodies DNA, formed as a result of VDJ-recombination and providing diversity of antigen binding regions. N1-zone generation is a complicated process including formation of palindroms on 5' and 3' ends and addind up to 20 random nucleotides to 5' and 3' ends with following non-homologous end joining. Thus, length of N1-zone depends of several random events. However, the N1-zone sometimes contains so-called footprints, appeared as a result of VH-replacement and recent studies showed that the length of CDR3 (including V3', N1, D, N2 and J5') is correlated with number of footprints (Meng et al., 2014). In this project we want to figure out whether VH-replacement is a random event by fitting of statistical model of N1-zone formation and estimating its parameters using Maximum Likelihood method.

Search for multiple associations in GWAS data|Bioinformatics institute

student: Shikov Anton
scientific adviser: Yury Barbitoff

В 2017 году Биобанк Великобритании сделал крупнейший релиз генетических данных в истории (500,000 человек). Группа Бенджамина Нила произвела быстрый массовый анализ ассоциаций (GWAS) с более чем 2,000 фенотипов, результаты данного анализа были выложены в открытый доступ. С тех пор, огромное количество препринтов, посвященных поиску интересного сигнала в этих данных, появились на сервере bioRxiv. Данный проект посвящен поиску маркеров, обладающих множественными ассоциированными фенотипами, и изучению механизмов, опосредствующих данные эффекты.

Identifying differentially expressed transposons across four life-cycle stages of Fasciola hepatica|Institute of Cytology RAS

student: Elisaveta Scalon
scientific adviser: Anna Soloveva, Nikolay Panyushev

Проект предполагает поиск мобильных элементов, чей уровень экспрессии варьирует в зависимости от стадии жизненного цикла Fasciola hepatica.
У F.hepatica секвенированы транскриптомы всех стадий жизненного цикла, данные доступны в Sequence read archive. Планируется собрать транскриптомы по референсу, выявить последовательности мобильных элементов
и определить их уровни транскрипции. На выходе будут получены данные о наличии или отсутствии мажорных транскриптов мобильных элементов, специфичных к отдельным стадиям жизненного цикла F.hepatica.

Evolution analysis of genes associated with apomixis in Brassicaceae family|CAB SPbU

student: Rostislav Skitchenko
scientific adviser: Mike Raiko

  • Perform a comparative phylogenetic assay of the genomes of seven plants.
  • Find the patterns between specific genes and apomixis plant-forms.
  • Find orthologous genes in other representatives of the Brassicaceae family.
  • Build the trees of genes of interest.