Research

We developed many tools for compression, processing, and analysis of bioinformatics data. Some of them were designed by our group in a close collaboration with external researchers. Below you can find a short description of what we did.

We are also open to collaboration, so if you like our tools and think that some of them need customization to fit your goals or you want to create something brand new and ambitious feel free to contact us.

Compression of sequencing data

Genome sequencing is usually the first step in the understanding of species genomes. It is relatively cheap and thus popular. Over the years, many technologies were developed. Currently, the most popular are those by Illumina (2nd gen sequencers) and by Oxford Nanopore and PacBio (3rd gen sequencers). Usually, sequencing produces huge amounts of data. For example, in a single experiment for the human genome, we obtain much more than 100GB of plain text files in FASTQ format. Thus, the first problem is how to store and transfer such huge amounts of data.

In 2011, we developed DSRC, one of the first specialized compressors for FASTQ files. It was designed for the 2nd generation data, i.e., reads of (usually fixed) short length. Its newer version, DSRC 2 is able to compress significantly better than the commonly-used gzip and works at a speed of 500MB/s and more using small memory. Then, we experimented with reorganization of the reads in the input dataset to achieve much better compression ratios. The experimental tool ORCOM implemented these ideas, but focused only on the bases (not ids and quality scores). FaStore, its successor, is a fully-fledged compressor of FASTQ files. It offers a compromise between compression ratio and processing time. Finally, we developed FQSqueezer that used a completely different approach. It takes the best from the PPM and DMC general-purpose compressors to offer even better compression ratios than the tools based on reads reorganization. The price is, however, slow processing and large memory requirements. Thus, FQSqueezer should be seen as an example of what is possible, but not necessarily as a tool useful for everyone.

In 2021, we focused on the data produced by the 3rd gen sequencers (ONT, PacBio). The reads are much longer (even 1M+ bases) and more noisy than given by Illumina equipment. The differences are so large that the existing tools behave poorly here. CoLoRd is our answer to these problems. It is the first compressor that significantly outperforms the general-purpose gzip in terms of compression ratio. It is also relatively fast (50–100 MB/s and more), so we believe it can be useful in practice and can change how the 3rd gen reads are stored and transferred.

Compression of genome collections

The sequencing data are just an input to the pipelines that allow us to see the longer (ideally the whole) genomic sequences. A number of large sequencing projects were launched over the years, with the 1000 Genomes Project as one the most famous. Its output was a set of whole genomic sequences.

In 2011, we developed GDC. In the experiments with 70 human genomes, we observed even about 1000-fold compression ratio. Its successor GDC 2 is even better. Here, we experimented with 2184 human genome set and saw 9500-fold compression ratio. It was impressive, but the tool had some limitations. The most important was that it assumed that the input genomes were sets of chromosomes. Sometimes this is not true. Moreover, the datasets used in the evaluation of GDC and competitive tools were rather “simplified” genomes. They were produced from genotype and 2nd gen sequencers experiments, and thus usually did not contain longer variants, so were much easier to compress than complete genomes. In 2021, the first datasets containing much more complete assemblies of human genomes were published by Human Pangenome Project. Shortly after that, we developed a new algorithm, AGC, that is based on a different approach, but which is able to cope with such genome assemblies. The compression ratios are not so impressive here (e.g., approx. 300GB collection of human genomes were compacted to 1.5GB), but still this is the best that was proposed to this date. Moreover, AGC offers an API that allows to work directly on the compressed genomes without a necessity to decompress them before further use. Human Pangenome Project already distributes its collection of genomes in AGC format and we believe that other genome sequencing projects will also find AGC useful.

Compression of genotype collections

Obtaining complete genome sequences (especially for longer genomes) was (almost) impossible in the 2010s, at least on a large scale. Therefore, in a lot of projects the focus was on genotyping, i.e., obtaining lists of typical variants for species and determining the variants present in the examined genomes. Such datasets are usually distributed as VCF files.

We started to work on such data in 2013. Our first tool was TGC, an experimental compressor designed to see what compression ratio is possible for the large collection of human genomes. In the experiments with the same 2184 sequences set as used in the GDC 2 evaluation, we saw approx. 15500-fold compression. More or less at the same time we developed MuGI, an index to a collection of genomes supporting k-mer queries. The next tool was GTC, a compressor of VCF files that was able to represent a collection of genotypes from 27,165 humans (4.3TB VCF file) in as little as 4GB. Moreover, it supported various types of queries. Finally, we focused on the maximum compression of VCF files as we wanted to see what is possible here. We developed GTShark and VCFShark. The former tries to compact only the genotype data as much as possible, while the latter supports also other types of data present in VCF files, as well as VCF files without genotype info. GTShark can be also used to measure a similarity of newly genotyped sample to various populations.

Multiple sequence alignment of proteins

Multiple sequence alignment (MSA) of proteins is one of the most important analyzes in molecular biology. The problem cannot be solved optimally in any reasonable time, thus various heuristics are in common use. Moreover, the bigger and bigger protein families are published, e.g., the largest family from Pfam (v33.1 NCBI version) contains almost 3M proteins.

In 2014, we developed QuickProbs focusing on providing as most accurate alignments as possible. We followed this way in its second version. The price was, however, the processing time. If you have a modern GPU and want high-quality alignments, consider QuickProbs 2 for families up to 1,000 (maybe 2,000) sequences. To address the problem of huge families, we designed FAMSA. It is able to process families of millions of sequences in minutes, scale impossible for the competitive tools.

We also focus on compression of large MSAs. If you need to store large MSAs, you can try CoMSA. For large families, it offers an order of magnitude more compact representation than gzip.

Our last project in this field is for efficient storage of protein structures. Our new tool ProteStAr allows to store millions of files in a single archive with compression ratio a few times better than gzip. At the same time, random access is possible and fast.

Counting of k-mers in the sequencing datasets looks like a simple task. In fact, it is simple if the dataset is small. Nevertheless, in reality FASTQ files we can have billions of unique k-mers and trillions of them in total, and easy solutions fail.

In 2013, we published the first version of KMC, a tool designed to count k-mers in datasets of any size. In the next releases, we extended the package with KMC tools, which allows performing various types of operations (e.g., set operations) on sets of k-mers. In one of our experiments, we used a small workstation with memory limited to 33GB to count k-mers in 1.7TB FASTQ file in just 1.5h. KMC is currently our most popular tool and is used in many applications. One of them is RECKONER, our corrector of errors in the 2nd gen sequencing data. Another application is CoMeta, a tool for the classification of reads in metagenomic experiments.

Kmer-db is our tool that collects k-mers in many samples and allows to compare these samples using the shared k-mers. In one of our experiments we counted all k-mers in the dataset of 661K bacterial genomes in less than 1TB RAM. One of the various applications of comparing k-mer sets is PHIST, our tool to predict prokaryotic hosts for phage (meta)genomic sequences.

Read mapping

When the sequencing is over, we obtain a set of possibly huge number of reads. One of the typical steps is then mapping them onto a reference genome. In the case, the number of reads is small there are great tools, like BWA-MEM, Minimap2, Bowtie2. Nevertheless, in whole genome sequencing experiments, the number of reads is usually huge. This is a moment in which you can consider using Whipser. For WGS Illumina data it is a few times faster than BWA-MEM, Minimap2, Bowtie2 preserving similar variant calling quality.

External projects to which we contributed

Our tools are used in many pipelines or as parts of tools developed by other labs. Sometimes we are also involved in closer co-operation and modify some of our tools for special needs or help in designing of new tools. Some examples of such collaborations are:

Cuttlefish — fast tool to construct compacted de Bruijn graphs. The tool was developed in the COMBINE lab.
ERISdb — database of plant splice sites and splicing signals. The database was developed in the Laboratory of Functional Genomics (UAM, Poland).
HuntMi — efficient and taxon-specific approach in pre-miRNA identification. The tool was developed in close collaboration with Laboratory of Functional Genomics (UAM, Poland).
Kmer-File-Format — compact data structure containing k-mer sets.
miRNEST — integrative collection of animal, plant, and virus microRNA data. The database was developed in the Laboratory of Functional Genomics (UAM, Poland).
PgSA — compact index of k-mers in sequencing reads. The tool was developed mainly by Tomasz Kowalski.