In the EPO pipeline; Enredo, Pecan and Ortheus are just the tip of the iceberg. These combined methods use information from homology and synteny, that is derived from anchor sequence alignments. This page provides a summary of what the anchors are, how they are generated, and how they are used.
Anchor generation
The anchors represent short regions of conservation, typically about 100 bases in length, from a phylogenetically representative subset of the species we wish to align. As such, anchors only have to be regenerated when the subset of species used is not representative any more.
The anchor set is generated from pairwise alignments (LastZ-Net) of a non-reference species to a selected reference species. All the species chosen to generate the anchors must have a pairwise alignment with the selected reference species. The pairwise alignments are stacked together based on their coordinates and the regions are realigned with Pecan. Then GERP is used to identify conserved regions, from where the anchor sequences are defined.
We consider a good anchor set to contain hundreds of thousands of anchors in order to cover the genomes optimally.
Anchor mapping
The anchor set is mapped (currently we use exonerate) against all the genomes we wish to include in the final alignment. Overlapping anchors are filtered so that any particular genomic location is associated with, at most, one anchor.
This step is computationally expensive, however, running independently per genome, it can be modelled as a cumulative process. Assuming the anchors have not changed, we only need to run it for the new assemblies, and can reuse mappings from existing assemblies.
Genome alignment
Enredo extracts a list of homologous genomic regions from the positions where the anchors have mapped. These homologous regions are then aligned with Pecan or Ortheus. Ortheus uses Pecan to align the sequences and additionally generates a consensus sequence for each ancestral node in a tree. This ancestral sequence is predicted for each aligned region.