D. Pratas @ Interdisciplinary Informatics

About

Diogo Pratas is a researcher in interdisciplinary informatics, dedicated to developing computational methods that bridge informatics with biomedical, anthropological, and historical research. His work focuses on analyzing and interpreting complex data. Current projects include efficient compression of biological data, reconstruction and analysis of viral genomes, metagenomic studies of ancient DNA, and the application of machine learning to date and localize ancient artifacts such as metals, texts, and genetic material.

Research

Bioinformatics: Advanced genetic and structural analysis through the development of efficient computational tools. This includes genome classification, identification of patterns in biological sequences, and the reconstruction of ancient and modern genomes (mostly viruses and mitogenomes).
Computational Biology: Modeling and simulating biological systems to explore functional genomics and evolutionary dynamics. This line involves developing interdisciplinary methods and tools to interpret complex biological patterns across species and time scales, and includes applied solutions to environmental challenges, such as plastic biodegradation.
Computational Medicine: Development and application of innovative computational methods for virus detection in human tissues, virome analysis (including in disease and transplantation contexts), cytogenetic analysis, and identification of therapeutic targets, contributing to diagnosis, prognosis, and treatment in medicine.
Cultural Heritage: Application of machine learning and data-mining techniques to preserve and analyze traditional knowledge. Projects include dating and localizing ancient artifacts—such as metal objects, manuscripts, and DNA samples—to reconstruct provenance, reveal transmission patterns, and advance cultural-heritage stewardship.
Data Compression: Development of efficient algorithms to process and interpret large-scale datasets, emphasizing statistical and algorithmic methods, format transformations, parameter optimization, and machine learning to enable data reduction, pattern discovery, and the extraction of meaningful structure from complex data.
Algorithmic Information Theory: Development of methodologies for generating and analyzing pseudo-random Turing Machines to identify those that produce statistically complex tape behaviors, along with the development of computational approaches to discover short Turing Machines that describe exact or approximate digital objects and to compute approximations of their logical depth.

Students

PhD Student

M. J. P. Sousa - Intelligent reconstruction and analysis of viral genomes.

Graduated Students

P. Pinto - Text dating using machine learning.

Master Students

A. Gomes - Human Virus Genomics and distribution.
D. Silva - Detection of Machine-Generated Text.
J. Gaspar - Compression of Astronomical Data.
R. Dias - Complexity analysis of musics.
S. Almeida - Analysis of Relative Absent Words for Innovative Diagnosis.

Thesis and Dissertations

Automatic DNA classification of organisms contained in ancient samples. Luís Marques (2025).
Temporal Text Classification with Machine Learning. Paulo Pinto (2025).
Automatic plant gene regions annotation using supervised Machine Learning. Bruna Simões (2025).
Exploring microalgae-enzymes for sustainable plastic biodegradation. Diana Lourenço (2025).
Machine Learning Approach for Non-Destructive Coin Dating. Ricardo Dias (2025).
A machine learning approach for authentication of ancient DNA samples. Denis Yamunaque (2025).
Optimization of a genomic data compressor using metameric genetic algorithms. Rita Ferrolho (2024).
Machine learning-enhanced optimization of plastic-degrading enzymes for sustainable ocean cleanup. Clara Cerqueira (2024).
Study of the impact of data compression on energy consumption reduction. Dinis Lei (2024).
Genomic diversity and zoonotic potential of hepatitis E virus in European rabbits: implications for diagnostic and therapeutic approaches. Margarida Pinheiro (2024).
Designing optimal 3D enzyme computational models for efficient plastic degradation. Mariana Fernandes (2024).
Designing in-silico aptamers for potential use in marine bioremediation. Rafael Vieira (2024).
Automatic reconstruction of persistent human virus sequences. Maria J. P. Sousa (2023).
Improving a database of cyanobacterial bioactive compounds that can be used for therapeutic approaches in human diseases. Renato Soares (2023).
Impact of sorting in DNA sequence compression. Tiago Fonseca (2023).
Algorithmic information approximations in data analysis. Jorge M. Silva (2023).
Reconstruction and classification of unknown DNA sequences. Alexandre Lourenço (2021).
Efficient biosequence compression using neural networks. Milton Silva (2021).
Compression models and tools for omics data. Morteza Hosseini (2020).
Automatic system for approximate and noncontiguous DNA sequences search. Manuel Gaspar (2017).

Publications

Journal Articles:

M. J. P. Sousa, M. Toppinen, L. Pyöriä, K. Hedman, A. Sajantila, M. F. Perdomo, D. Pratas*. An evaluation of computational methods for reconstruction of human viral DNA genomes. GigaScience, 2025.
M. J. P. Sousa, A. J. Pinho, D. Pratas*. JARVIS3: an efficient encoder for genomic data. Bioinformatics, 2024.
J. M. Silva, A. J. Pinho, D. Pratas*. AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data. GigaScience, 2024.
R. Soares, L. Azevedo, V. Vasconcelos, D. Pratas, S. Sousa, J. Carneiro. Machine Learning-Driven Discovery and Database of Cyanobacteria Bioactive Compounds: A Resource for Therapeutics and Bioremediation. Journal of Chemical Information and Modeling, 2024.
L. Pyöriä, D. Pratas, M. Toppinen, P. Simmonds, K. Hedman, A. Sajantila, M. F. Perdomo. Intra-host genomic diversity and integration landscape of human tissue-resident DNA virome. Nucleic Acids Research, 2024.
L. Hannolainen, L. Pyöriä, D. Pratas, J. Lohi, S. Skuja, S. Rasa-Dzelzkaleja, M. Murovska, K. Hedman, T. Jahnukainen, M. F. Perdomo. Perinnöllinen herpesvirus elinsiirron kiusana. Duodecim, 2024.
L. Hannolainen, L. Pyöriä, D. Pratas, J. Lohi, S. Skuja, S. Rasa-Dzelzkaleja, M. Murovska, K. Hedman, T. Jahnukainen, M. F. Perdomo. Reactivation of a transplant recipient’s inherited human herpesvirus 6 and implications to the graft. The Journal of Infections Diseases, 2024.
J. M. Silva, W. Qi, A. J. Pinho, D. Pratas*. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. GigaScience, 2023.
J. Carneiro, F. Pascoal, M. Semedo, D. Pratas, M. P. Tomasino, A. Rego, M. F. Carvalho, A. P. Mucha, C. Magalhães. Mapping human pathogens in wastewater using a metatranscriptomic approach. Environmental Research, 2023.
J. Carneiro, R. P. Magalhães, V. M. de la Oliva Roque, M. Simões, D. Pratas, S. Sousa. TargIDe: a machine-learning workflow for target identification of molecules with antibiofilm activity against Pseudomonas aeruginosa. Journal of Computer-Aided Molecular Design, 2023.
L. Pyöriä, D. Pratas, M. Toppinen, K. Hedman, A. Sajantila, M. F. Perdomo. Unmasking the Tissue-Resident Eukaryotic DNA Virome in Humans. Nucleic Acids Research, 2023.
L. Pyöriä, D. Pratas, M. Toppinen, K. Hedman, A. Sajantila, M. F. Perdomo. Elimistömme on lukuisten terveyteemme vaikuttavien virusten koti. Duodecim, 2023.
M. K. Jauhiainen, U. Mohanraj, M. Lehecka, M. Niemelä, T. P. Hirvonen, D. Pratas, M. F. Perdomo, M. Söderlund-Venermo, A. A. Mäkitie, S. T. Sinkkonen. Herpesviruses, polyomaviruses, parvoviruses, papillomaviruses, and anelloviruses in vestibular schwannoma. Journal of NeuroVirology, 2023.
J. M. Silva, D. Pratas, T. Caetano, S. Matos. The complexity landscape of viral genomes. GigaScience, 2022.
W. Qi, Y. Lim, A. Patrignani, P. Schläpfer, A. Bratus-Neuenschwander, S. Grüter, C. Chanez, N. Rodde, E. Prat, S. Vautrin, M. Fustier, D. Pratas, R. Schlapbach, W. Gruissem. The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features. GigaScience, 2022.
O. I. Mielonen, D. Pratas, K. Hedman, A. Sajantila, M. F. Perdomo. Detection of Low-Copy Human Virus DNA upon Prolonged Formalin Fixation. Viruses, 2022.
M. Toppinen, A. Sajantila, D. Pratas, K. Hedman, M. F. Perdomo. The Human Bone Marrow Is Host to the DNAs of Several Viruses. Frontiers in cellular and infection microbiology, 2021.
J. Monteiro, D. Pratas, A. Videira, F. Pereira. Revisiting the Neurospora crassa mitochondrial genome. Letters in Applied Microbiology, 2021.
M. Silva*, D. Pratas*, A. J. Pinho. AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models. Entropy, 2021.
J. M. Silva, D. Pratas, R. Antunes, S. Matos, A. J. Pinho. Automatic analysis of artistic paintings using information-based measures. Pattern Recognition, 2021.
J. R. Almeida, D. Pratas, J. L. Oliveira. A semi-automatic methodology for analysing distributed and private biobanks. Computers in Biology and Medicine, 2021.
D. Pratas*, J. M. Silva. Persistent minimal sequences of SARS-CoV-2. Bioinformatics, 2020.
D. Pratas*, M. Toppinen, L. Pyöriä, K. Hedman, A. Sajantila, M. F. Perdomo. A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level. GigaScience, 2020.
M. Silva*, D. Pratas*, A. J. Pinho. Efficient DNA sequence compression with neural networks. GigaScience, 2020.
M. Toppinen, D. Pratas, E. Väisänen, M. Söderlund-Venermo, K. Hedman, M. F. Perdomo, A. Sajantila. The landscape of persistent human DNA viruses in femoral bone. Forensic Science International: Genetics, 2020.
M. Hosseini, D. Pratas, B. Morgenstern, A. J. Pinho. Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. GigaScience, 2020.
J. R. Almeida, A. J. Pinho, J. L. Oliveira, O. Fajarda, D. Pratas. GTO: A toolkit to unify pipelines in genomic and proteomic research. SoftwareX, 2020.
J. M. Silva, E. Pinho, S. Matos, D. Pratas. Statistical Complexity Analysis of Turing Machine Tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model. Entropy, 2020.
D. Pratas*, M. Hosseini, J. M. Silva, A. J. Pinho. A reference-free lossless compression algorithm for DNA sequences using a competitive prediction of two classes of weighted models. Entropy, 2019.
M. Hosseini, D. Pratas, A. J. Pinho. Cryfa: a secure encryption tool for genomic data. Bioinformatics, 2019.
M. Hosseini, D. Pratas, A. J. Pinho. AC: A Compression Tool for Amino Acid Sequences. Interdisciplinary Sciences: Computational Life Sciences, 2019.
J. M. Carvalho, S. Brás, D. Pratas, J. Ferreira, S. C. Soares, A. J. Pinho. Extended-Alphabet Finite-Context Models. Pattern Recognition Letters, 2018.
D. Pratas*, M. Hosseini, G. Grilo, A. J. Pinho, R. Silva, T. Caetano, J. Carneiro, F. Pereira. Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard. Genes, 2018.
D. Pratas*, R. Silva, A. J. Pinho. Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes. Entropy, 2018.
M. Hosseini, D. Pratas, A. J. Pinho. A Survey on Data Compression Methods for Biological Sequences. Information, 2016.
D. Pratas*, R. Silva, A. J. Pinho, P. J. S. G. Ferreira. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Scientific Reports, 2015.
R. Silva*, D. Pratas*, L. Castro, A. J. Pinho, P. J. S. G. Ferreira. Three minimal sequences found in Ebola virus genomes and absent from human DNA. Bioinformatics, 2015.
L. Matos, A. J. R. Neves, D. Pratas, A. J. Pinho. MAFCO: A compression tool for MAF files. PLoS ONE, 2015.
A. J. Pinho, D. Pratas. MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics, 2014.
D. Pratas*, A. J. Pinho, J. M. O. S. Rodrigues. XS: a FASTQ read simulator. BMC Research Notes, 2014.
A. J. Pinho, S. P. Garcia, D. Pratas, P. J. S. G. Ferreira. DNA sequences at a glance. PLoS ONE, 2013.
L. Matos, D. Pratas, A. J. Pinho. A compression model for DNA multiple sequence alignment blocks. IEEE Transactions on Information Theory, 2013.
S. P. Garcia, J. M. O. S. Rodrigues, S. Santos, D. Pratas, V. Afreixo, C. A. C. Bastos, P. J. S. G. Ferreira, A. J. Pinho. A genomic distance for assembly comparison based on compressed maximal exact matches. IEEE Transactions on Computational Biology and Bioinformatics, 2013.
A. J. Pinho, D. Pratas, S. P. Garcia. GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Research, 2012.

International Conference Articles:

D. Yamunaque, A. J. Pinho, A. Sajantila, D. Pratas. A Machine Learning Method for Authentication of Human Ancient Mitochondrial DNA. IbPRIA 2025, Coimbra, Portugal, July 2025.
D. Lei, D. Yamunaque, A. J. Pinho, D. Pratas. ECOmpress: A web tool for boosting energy efficiency through data compression. IbPRIA 2025, Coimbra, Portugal, July 2025.
A. J. Pinho, D. Pratas. Optimization of data compression parameters using genetic algorithms. DCC 2025, Snowbird, United States, March 2025.
L. Almeida, P. Rodrigues, D. Magalhães, A. J. Pinho, D. Pratas. AIDetx: a compression-based method for identification of machine-learning generated text. DCC 2025, Snowbird, United States, March 2025.
M. J. P. Sousa, A. J. Pinho, D. Pratas. Improving the generation of viral consensus sequences using adaptive models. EUSIPCO 2024, Lyon, France, August 2024.
M. J. P. Sousa, A. J. Pinho, D. Pratas*. A sensitive compression-based method for filtering targeted FASTQ sequencing reads. EUSIPCO 2024, Lyon, France, August 2024.
T. Fonseca, M. J. P. Sousa, A. J. Pinho, D. Pratas. A sorting tool for improving FASTA data compression tools. EUSIPCO 2024, Lyon, France, August 2024.
A. J. Pinho, D. Pratas. Copy models for protein sequence compression. DCC 2024, Snowbird, United States, March 2024.
D. Pratas*, A. J. Pinho. An experimental sorting method for improving metagenomic data encoding. DCC 2024, Snowbird, United States, March 2024.
M. J. P. Sousa, D. Pratas. A method for improving the generation of consensus sequences. Workshop on Informatics Engineering Research, Porto, Portugal, 2024.
J. M. Silva, D. Pratas, S. Matos. Exploring Kolmogorov Complexity Approximations for Data Analysis: Insights and Applications. DoCEIS 2023, Caparica, Portugal, pp. 161–174, 2023.
D. Pratas*, A. J. Pinho. JARVIS2: a data compressor for large genome sequences. DCC 2023, Snowbird, United States, March 2023.
J. M. Silva, D. Pratas, T. Caetano, S. Matos. Feature-Based Classification of Archaeal Sequences Using Compression-Based Methods. IbPRIA 2022, Aveiro, Portugal, May 2022.
M. Hosseini, D. Pratas, A. J. Pinho. A probabilistic method to find and visualize distinct regions in protein sequences. EUSIPCO 2019, A Coruña, Spain, September 2019.
D. Pratas*, M. Hosseini, A. J. Pinho. GeCo2: An optimized tool for lossless compression and analysis of DNA sequences. PACBB 2019, Ávila, Spain, June 2019.
D. Pratas*, M. Hosseini, A. J. Pinho. Visualization of similar primer and adapter sequences in assembled archaeal genomes. PACBB 2019, Ávila, Spain, June 2019.
A. J. Pinho, D. Pratas. An application of data compression models to handwritten digit classification. ACIVS 2018, September 2018.
D. Pratas*, A. J. Pinho. Metagenomic composition analysis of sedimentary ancient DNA from the Isle of Wight. EUSIPCO 2018, Rome, Italy, September 2018.
D. Pratas*, A. J. Pinho. A DNA sequence corpus for compression benchmark. PACBB 2018, Toledo, Spain, June 2018.
D. Pratas*, M. Hosseini, A. J. Pinho. Compression of amino acid sequences. PACBB 2018, Toledo, Spain, June 2018.
M. Gaspar, D. Pratas, A. J. Pinho. NET-ASAR: a tool for DNA sequence search based on data compression. PACBB 2018, Toledo, Spain, June 2018.
D. Pratas*, M. Hosseini, A. J. Pinho. Cryfa: a tool to compact and encrypt FASTA files. PACBB 2017, Porto, Portugal, June 2017.
M. Hosseini, D. Pratas, A. J. Pinho. On the role of inverted repeats in DNA sequence similarity. PACBB 2017, Porto, Portugal, June 2017.
D. Pratas*, M. Hosseini, A. J. Pinho. Substitutional tolerant Markov models for relative compression of DNA sequences. PACBB 2017, Porto, Portugal, June 2017.
D. Pratas*, A. J. Pinho. On the approximation of the Kolmogorov complexity for DNA sequences. IbPRIA 2017, Faro, Portugal, June 2017.
D. Pratas*, M. Hosseini, R. Silva, A. J. Pinho, P. J. S. G. Ferreira. Visualization of distinct DNA regions of the modern human relative to a Neanderthal genome. IbPRIA 2017, Faro, Portugal, June 2017.
A. J. Pinho, D. Pratas, P. J. S. G. Ferreira. Authorship attribution using relative compression. DCC 2016, Snowbird, United States, March 2016.
D. Pratas*, A. J. Pinho, P. J. S. G. Ferreira. Efficient compression of genomic sequences. DCC 2016, Snowbird, United States, March 2016.
A. J. Pinho, D. Pratas, P. J. S. G. Ferreira. A new compressor for measuring distances among images. ICIAR 2014, Vilamoura, Portugal, October 2014.
D. Pratas*, A. J. Pinho. Exploring deep Markov models in genomic data compression using sequence pre-analysis. EUSIPCO 2014, Lisbon, Portugal, September 2014.
D. Pratas*, A. J. Pinho. A conditional compression distance that unveils insights of the genomic evolution. DCC 2014, Snowbird, United States, March 2014.
A. J. Pinho, D. Pratas, P. J. S. G. Ferreira. Information profiles for DNA pattern discovery. DCC 2014, Snowbird, United States, March 2014.
S. P. Garcia, J. M. O. S. Rodrigues, D. Pratas, A. J. Pinho. Comparing maximal exact repeats in human genome assemblies using a normalized compression distance. ISMB 2012, Long Beach, United States, July 2012.
L. Matos, D. Pratas, A. J. Pinho. Compression of whole genome alignments using a mixture of finite-context models. ICIAR 2012, Aveiro, Portugal, June 2012.
D. Pratas*, A. J. Pinho. On the Detection of Unknown Locally Repeating Patterns in Images. ICIAR 2012, Aveiro, Portugal, June 2012.
D. Pratas*, A. J. Pinho, S. P. Garcia. Exon: A web-based software toolkit for DNA sequence analysis. PACBB 2012, Salamanca, Spain, March 2012.
D. Pratas*, A. J. Pinho, S. P. Garcia. Computation of the normalized compression distance of DNA sequences using a mixture of finite-context models. Bioinformatics 2012, Vilamoura, Portugal, February 2012.
A. J. Pinho, D. Pratas, S. P. Garcia. Complexity profiles of DNA sequences using finite-context models. USAB 2011, Graz, Austria, November 2011.
A. J. Pinho, D. Pratas, P. J. S. G. Ferreira, S. P. Garcia. Symbolic to numerical conversion of DNA sequences using finite-context models. EUSIPCO 2011, Barcelona, Spain, August 2011.
A. J. Pinho, D. Pratas, P. J. S. G. Ferreira. Bacteria DNA sequence compression using a mixture of finite-context models. SSP 2011, Nice, France, June 2011.
D. Pratas*, C. A. C. Bastos, A. J. Pinho, A. J. R. Neves, L. Matos. DNA synthetic sequences generation using multiple competing Markov models. SSP 2011, Nice, France, June 2011.
D. Pratas*, A. J. Pinho. Compressing the human genome using exclusively Markov models. PACBB 2011, Salamanca, Spain, April 2011.

National Conference Articles

M. J. P. Sousa, M. F. Perdomo, D. Pratas. A method to analyse and classify viral sequences accurately. RecPad 2025, Aveiro, Portugal, October, 2025.
J. Contente, A. Martins, D. Pratas, A. J. Pinho, S. Gouveia. An approach to enhance compression of categorical time series. RecPad 2025, Aveiro, Portugal, October, 2025.
A. Martins, D. Pratas, A. Pinho, S. Gouveia. Finding relevant features to enhance compression of categorical time series. RecPad 2024, Covilhã, Portugal, October 2024.
M. J. P. Sousa, D. Pratas. A method for accurate reconstruction of persistent human viral sequences. RecPad 2023, Coimbra, Portugal, October 2023.
M. J. P. Sousa, D. Pratas. A survey on computational tools for human viral genomes reconstruction. RecPad 2022, Leiria, Portugal, October 2022.
M. J. P. Sousa, R. Ferrolho, T. Fonseca, A. J. Pinho, D. Pratas. Improving the compression of a complete Telomere-to-Telomere (T2T) human genome sequence. RecPad 2022, Leiria, Portugal, October 2022.
J. M. Silva, D. Pratas, T. Caetano, S. Matos. Archaea Taxonomic Classification. RecPad 2021, Évora, Portugal, November 2021.
J. M. Silva, D. Pratas, S. Matos. Comparison and Evaluation of Information-based Measures in Images. RecPad 2020, Évora, Portugal, October 2020.
M. Hosseini, D. Pratas, A. J. Pinho. Clustering DNA sequences by relative compression. RecPad 2019, Porto, Portugal, October 2019.
J. M. Silva, D. Pratas, S. Matos. Evaluation of Statistical Complexity in Viral Genome Sequences. RecPad 2019, Porto, Portugal, October 2019.
M. Hosseini, D. Pratas, A. Amorim, J. Carneiro. Improving the detection of mtDNA rearrangements using a fast and accurate algorithm. ENBE 2019, Porto, Portugal, November 2019.
A. Teixeira, D. Pratas, A. J. Pinho, R. Silva. Evolutionary insights from the comparative analysis of hominid genomes. RecPad 2018, Coimbra, October 2018.
C. Figueiredo, D. Pratas, A. J. Pinho, R. Silva. Identification of antifungal targets using alignment-free methods. RecPad 2018, Coimbra, October 2018.
D. Pratas*, R. Silva, A. J. Pinho, P. J. S. G. Ferreira. Detection and visualization of regions of human DNA not present in other primates. RecPad 2015, Faro, Portugal, October 2015.
D. Pratas*, R. Silva, A. J. Pinho. Large-scale inversions between human reference assemblies. RecPad 2014, Covilhã, Portugal, October 2014.
R. Silva, L. Castro, D. Pratas, A. J. Pinho. Towards personalized medicine: ebola virus absent words in the human genome. RecPad 2014, Covilhã, Portugal, October 2014.
D. Pratas*, A. J. Pinho. Insights into primates genomic evolution using a compression distance. RecPad 2013, Lisbon, November 2013.
D. Pratas*, A. J. Pinho. On the compression of FASTQ quality-scores. RecPad 2012, Coimbra, Portugal, October 2012.
D. Pratas*, A. J. Pinho. M6: a method for compressing complete genomes using Markov models. DSIE 2012, Porto, Portugal, January 2012.
D. Pratas*, S. P. Garcia, A. J. Pinho. Analysis of patterns in S. pombe genome through compression-based complexity profiles. RecPad 2011, Porto, Portugal, October 2011.
D. Pratas*, A. J. Pinho. Analysis of DNA sequences using finite-context modelling and compression. RecPad 2010, Vila Real, Portugal, October 2010.
D. Pratas*, A. J. Pinho, A. J. R. Neves, C. A. C. Bastos. DNA synthetic sequences generated by finite-context models. RecPad 2010, Vila Real, Portugal, October 2010.

Book Chapters

A. J. Pinho, D. Pratas, S. P. Garcia. Compressing resequencing data with GReEn. In: Deep Sequencing Data Analysis, ed. Noam Shomron, Humana Press (Methods in Molecular Biology, Vol. 1038), pp. 27–37, July 2013.

Other

P. J. N. Pinto, A. J. Pinho, D. Pratas*. Decoding The Past: Explainable Machine Learning Models For Dating Historical Texts. bioRxiv, 2025.
L. L. Marques, A. J. Pinho, D. Pratas*. Metagenomic Classification of Ancient Viruses. bioRxiv, 2025.
B. Simões, A. J. Pinho, D. Pratas*. Automated Annotation of Plant Gene Regions Using Supervised Machine Learning. bioRxiv, 2025.
R. Ferrolho, A. J. Pinho, D. Pratas*. Optimizing Genomic Data Compression with Genetic Algorithms. bioRxiv, 2025.
M. J. P. Sousa, M. Toppinen, L. Pyöriä, K. Hedman, A. Sajantila, M. F. Perdomo, D. Pratas*. Comparative evaluation of computational methods for reconstruction of human viral genomes. bioRxiv, 2025.
D. Pratas*, A. J. Pinho, R. M. Silva, J. M. O. S. Rodrigues, M. Hosseini, T. Caetano, P. J. S. G. Ferreira. FALCON-meta: A method to infer metagenomic composition of ancient DNA. bioRxiv, 2018.

Books

D. Figueiredo, C. Martín-Vide, D. Pratas, M. A. Vega-Rodríguez. Algorithms for Computational Biology. Springer International Publishing, June 2017.
D. Pratas. Compression and analysis of genomic data. PhD Thesis, University of Aveiro, Portugal, 2016.

Outreach

'Pipeline of pipelines' to improve viral genome reconstruction in a clinical setting

UAveiro, 15-01-2026

Paintings: UAveiro software identifies authors and classifies artistic movements

UAveiro, 29-07-2022

UAveiro paves the way for effective treatment of the Ebola virus

UAveiro, 28-07-2015

Resources

Software

TRACESPipe is a hybrid pipeline for efficient reconstruction and analysis of viral and host genomes that can be set at multi-organ level for clinical or aDNA purposes.

AltaiR is a fast-flexible C toolkit for alignment-free temporal analysis of multi-FASTA data in large genomic collections and for use in potential epidemic scenarios.

AlcoR is an alignment-free toolkit for analysis of low-complexity regions in biological data, supporting mapping, masking, simulation and visualization.

JARVIS3 is an efficient lossless data (de)compression tool tailored for genomic sequences with extension for compressing FASTA and FASTQ data.

FALCON is an ultra-fast method to infer metagenomic composition of sequenced reads while minimizing false positives and maximizing accuracy.

SPARK is a toolkit to simulate, search, and analyze exact or approximate Turing Machines using alignment-free methods and colorized visualizations.

CRYFA is an ultrafast encryption tool specifically designed for genomic data, while it can also compress FASTA / FASTQ data by a factor of three.

SMASH is a compression-based and alignment-free method to automatically find and visualise rearrangements between pairs of DNA sequences.

AC is a lossless compressor to compress efficiently amino acid sequences (proteins). It uses a cooperation between multiple context models.

GeCo3 is an efficient lossless genomic compressor that uses a neural network for expert mixing. It supports relative and conditional compression.

Chester is a probabilistic tool that uses Bloom filters to map and further compute visualization of evolutionary regions (relative singularity regions).

[+]
More software is available at Github.

Datasets

DNA sequence corpus in sequence format (ACGT)

Amino Acids sequence corpus (sequence only)

Human virus genome sequences (FASTA format)

XRF, UV-VIS, and Weight Data for 250 Portuguese Coins

Classes

Algorithmic Information Theory

Algorithmic Information Theory (AIT, or TAI in Portuguese) is a field at the intersection of computer science, mathematics, and information theory that explores how information can be measured, represented, and processed using algorithms. It provides a deep understanding of concepts such as data compression, randomness, machine learning, and the limits of computation. AIT equips students with fundamental tools to reason about what information is, how to model sources of data, and how efficiently data can be encoded or processed. These ideas are essential for anyone working in areas like machine learning, cryptography, data science, and theoretical computer science. The AIT course at the University of Aveiro combines both theoretical and practical components. In addition to lectures, students engage in hands-on learning through three group-based practical assignments, where they apply the concepts studied in class to real problems. This structure helps students to consolidate their understanding and develop collaborative problem-solving skills that are valuable in both research and industry.

Challenge

Human genome sequence compression

The challenge is to develop a data compressor that improves the lossless, reference-free minimal representation of a complete human genome sequence, specifically version 2.0 of the T2T Chm13 human genome. This fully sequenced genome is described in detail in the corresponding scientific article, which serves as the foundation for the task. Participants are invited to propose innovative compression methods that advance the state of the art, with results compared against existing solutions on the current leaderboard. Each of the compressor and decompressor must complete within 48 hours (wall-clock) and use no more than 32 GB of RAM (peak). The leading entries are currently dominated by the JARVIS series—specifically JARVIS3 and JARVIS2. These DNA-based compressors are capable of reducing the size of the human genome by nearly one-third. JARVIS3 achieves approximately 7% greater compression compared to nncp, a Transformer-based general-purpose data compression tool.

Cassava genome sequence compression

The task is to build a lossless, reference-free compressor that minimizes the representation of the cassava TME204 genome (762,392,783 DNA symbols). The target assembly is the one linked in the repository (article and sequence), and results are ranked by smallest compressed size in bytes (a Kolmogorov-complexity proxy), with a baseline of 2 bits per symbol used to report a compression factor. Runtime and peak RAM are recorded for transparency; the reference runs are executed on a Linux desktop (Intel i7-6700, 31.2 GiB RAM). Participants are invited to propose innovative methods and compare against the current leaderboard provided in the repo. At present, the top entries are JARVIS2 variants, with the best run compressing TME204 to 69,078,516 bytes (0.7249 bps; ~64% factor) in about 193 minutes using ~11.3 GB RAM, outperforming general compressors and prior genome-specific tools (e.g., GeCo3, NAF) on this dataset. Reproducible scripts are provided (Install_Tools.sh, GetCassava.sh, RunCassava.sh) to fetch the data, install tools, and rerun the benchmark.

Contact

Address

IEETA, University of Aveiro,
Campus Universitário Santiago, 3810-193
Aveiro, Portugal.

Email

pratas@ua.pt

Phone

+351 234 370 506

Affiliation

IEETA/LASI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Aveiro, Portugal.
DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal.
DV, Department of Virology, University of Helsinki, Helsinki, Finland.