I am a computer scientist with leading research interests in computational biology, bioinformatics, and data compression. I hold an Information and Communication Technologies Degree from the University of Aveiro (Portugal), with a segment carried at the Pontifical University of Salamanca (Spain). After, I worked in the private sector for a couple of years. Then, I rejoined the University of Aveiro and completed the Ph.D. in Informatics (2016) and the PostDoc in Computer Science (2019). In 2019, I worked as a Bioinformatician at the University of Helsinki (Finland). Currently, I am an auxiliary Scientist/Professor at the DETI/IEETA of the University of Aveiro and an Invited Scientist at the Department of Virology of the University of Helsinki. My memberships include the Super Dimension Fortress, the APRP, and the ESCV.

        

Students

Current:
A. Cerqueira (MS Student)
A. Ferrolho (MS Student)
D. Yamunaque (MS Student)
D. Lei (MS Student)
M. J. Sousa (PhD Student)
M. Fernandes (MS Student)
M. Pinheiro (MS Student)
R. Vieira (MS Student)

♕ I am currently looking for motivated PhD and MSc students,
namely in the areas of computational biology, bioinformatics,
or Information theory. If you are interested, please contact me.

Alumni:
MS. A. Lourenço
Dr. J. M. Silva
MS. M. Silva
Dr. M. Hosseini
MS. M. Gaspar
MS. R. Soares
MS. T. Fonseca


Projects

We develop new mathematical and computational models, including their efficient implementation into computer programs for biomedical, anthropological, and coding applications. We address both statistical and algorithmic natures creating innovative data mining and machine learning methodologies. We are currently working on projects such as the development of efficient biological data compression tools, reconstruction and analysis of ancient and extant viral genomes, identification of specific viral signatures, genomic variation quantification, classification of unknown sequences, and metagenomic analysis of ancient DNA samples. The following word map summarizes a part of our works.

Selected Publications



AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. GigaScience, 2023. J. M. Silva, W. Qi, A. J. Pinho, D. Pratas.


Unmasking the tissue-resident eukaryotic DNA virome in humans. Nucleic Acids Research, 2023. L. Pyöriä, D. Pratas, M. Toppinen, K. Hedman, A. Sajantila, M. F. Perdomo.


The complexity landscape of viral genomes. GigaScience, 2022. J. M. Silva, D. Pratas, T. Caetano, S. Matos.

The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features. GigaScience, 2022. W. Qi, Y. Lim, A. Patrignani, P. Schläpfer, A. Bratus-Neuenschwander, S. Grüter, C. Chanez, N. Rodde, E. Prat, S. Vautrin, M. Fustier, D. Pratas, R. Schlapbach, W. Gruissem.


The Human Bone Marrow Is Host to the DNAs of Several Viruses. Frontiers in cellular and infection microbiology, 2021. M. Toppinen, A. Sajantila, D. Pratas, K. Hedman, M. F. Perdomo.


Efficient DNA sequence compression with neural networks. GigaScience, 2020. M. Silva, D. Pratas, A. J. Pinho.


Persistent minimal sequences of SARS-CoV-2. Bioinformatics, 2020. D. Pratas, J. M. Silva.


The landscape of persistent human DNA viruses in femoral bone. Forensic Science International: Genetics, 2020. M. Toppinen, D. Pratas, E. Väisänen, M. Söderlund-Venermo, K. Hedman, M. F. Perdomo, A. Sajantila.


A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level. GigaScience, 2020. D. Pratas, M. Toppinen, L. Pyöriä, K. Hedman, A. Sajantila, M. F. Perdomo.


Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. GigaScience, 2020. M. Hosseini, D. Pratas, B. Morgenstern, A. J. Pinho.


Cryfa: a secure encryption tool for genomic data. Bioinformatics, 2019. M. Hosseini, D. Pratas, A. J. Pinho.


An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Scientific Reports, 2015. D. Pratas, R. M. Silva, A. J. Pinho, P. J. S. G. Ferreira.


Three minimal sequences found in Ebola virus genomes and absent from human DNA. Bioinformatics, 2015. R. M. Silva, D. Pratas, L. Castro, A. J. Pinho, P. J. S. G. Ferreira.


MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics, 2014. A. J. Pinho, D. Pratas.


GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Research, 2012. A. J. Pinho, D. Pratas, S. P. Garcia.

Selected tools




  • AlcoR: a toolkit for alignment-free simulation, computation, and visualization of Low-complexity regions in biological data.
  • EAGLE2: an ultra-fast and alignment-free tool to compute minimal sequences from viral genomes that are absent from hosts.
  • GeCo3: a series of efficient compressors (GeCo, GeCo2, GeCo3) for genomic sequences (reference and reference-free approaches).
  • JARVIS2: a series of high-ratio reference-free data compressors for DNA sequences, FASTA, and FASTQ data.
  • FALCON-meta: a high-sensitive and ultra-fast tool for analysis of metagenomic samples using a reference multi-FASTA database.
  • TRACESPipe: an automatic pipeline for reconstruction and analysis of viral and human-host genomes at multi-organ level.
  • Cryfa: an ultra-fasta secure encryption tool for genomic data that is able to compact FASTA and FASTQ data.
  • Smash: an alignment-free tool to find and visualize rearrangements in pairs of DNA sequences (FASTA format).
  • Chester: a memory efficient tool to map and visualize evolutionary regions in multiple genomes (FASTA and FASTQ format).
  • GTO: a complete toolkit for analysis of genomic and proteomic data providing pipe support for easy integration with existing tools.
  • AC2: a series of efficient compressors (AC, AC2) for Amino Acid sequences (reference and reference-free approaches).

Datasets

In order to promote the development of efficient computational models for the minimal lossless representation of DNA and Amino Acid sequences, we hold two benchmarks. The latest top developments involve the use of neural networks for context mixing and cache hashes in weighted stochastic repeats models. Please, click on the following images to download the sequence data and benchmark with a new compression algorithm.


        

Challenge

Provide a data compressor that improves the lossless and reference-free minimal representation of a human genome sequence (T2T Chm13 version 2.0 [article, sequence]).

Top 5 entries:

Ranking Bytes Bps Time (m) RAM (GB) Program Replication Factor
1 544,059,173 1.396 389 28.8 JARVIS2 Run51 30%
2 544,267,353 1.396 420 27.4 JARVIS2 Run50 30%
3 544,292,577 1.397 399 26.9 JARVIS2 Run49 30%
4 545,960,947 1.401 283 26.9 JARVIS2 Run48 30%
5 549,594,830 1.410 284 11 JARVIS2 Run47 30%


Full leader board here

Contact

IEETA, University of Aveiro
Campus Universitário Santiago, 3810-193
Aveiro, Portugal

Phone: +351 234 370 506



IEETA/LASI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro
DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro
DV, Department of Virology, University of Helsinki