Ensembl Canonical transcript
The Ensembl Canonical transcript is a single, representative transcript identified at every locus. For accurate analysis, we recommend that more than one transcripts at a locus may need to be considered, however, we designate a single Ensembl Canonical transcript per locus to provide consistency when only one transcript is required e.g. for initial display on the Ensembl (or other) website and for use in genome-wide calculations e.g. the Ensembl gene tree analysis.
For protein-coding genes, we aim to identify the transcript that, on balance, has the highest coverage of conserved exons, highest expression, longest coding sequence and is represented in other key resources, such as NCBI and UniProt. To identify this transcript, we consider, where available, evidence of functional potential (such as evolutionary conservation of a coding region, transcript expression levels), transcript length and evidence from other resources (such as concordance with the APPRIS1 ’principal isoform’ and with the UniProt/Swiss-Prot ‘canonical isoform’).
Computational pipeline details for the Ensembl Canonical
The algorithm uses the data sources listed below. A score is assigned for each component where data are available for a species. While the pipeline has been developed in a species-agnostic manner, currently, the complete set of input data are available only for human and mouse. However, the recent update of the reference mouse assembly to GRCm38 requires that all data be remapped and analysis be recalculated on the new reference, which prevents the pipeline from being run on mouse for the time being. The pipeline can be run successfully using only a partial subset of the data and analysis, however, reference assembly updates for other species (e.g. rat) have again reduced the practicality of running the pipeline on any non-human genome. In time, we anticipate that suitable data will be available for more species. We also note that the majority of data included in the pipeline is suited for the analysis of protein-coding genes. As a greater understanding of the functionally important regions of lncRNAs is gained, it is intended that the pipeline will be extended to include the data types that support their identification.
Algorithm selection
- Conservation:
- PhyloCSF is a comparative genomics method (PMID 21685081) to calculate the evolutionary conservation of protein-coding potential for the CDS portion of each transcript. The number of bases that have a positive value in the PhyloCSF data for each exon of a transcript are counted and divided by the length of each coding exon. This score is then normalised with respect to all transcripts at the locus to produce the “PhyloCSF score”.
- Expression, using two types of data:
- RNA-seq supported intron data (e.g. Intropolis in human (PMID 28038678)) are used to calculate the expression of each transcript. Supporting reads for each intron (RNAseq reads where the alignment is split by the intron) are summed to obtain an overall intron count. The total counts are divided by the number of introns. This value is then normalised with respect to all transcripts at the locus to produce the “Intropolis score”. Intropolis is a compilation of exon-exon junctions and the expression, overall and tissue-specific,associated with these junctions.
- CAGE (Cap Analysis of Gene Expression) data (PMID 24670764) are used to calculate the expression of transcripts that start at different promoters. This deep sequencing technique captures the 5’ ends seen in the sample. The algorithm sums the CAGE read counts that overlap the 5’UTR of each transcript. This value is then normalised with respect to all transcripts at the locus to produce the “CAGE score”.
- Concordance with the APPRIS Principal (P1) CDS isoform:
- The APPRIS database (PMID: 29069475) uses functional annotation and cross-species conservation to select a principal coding isoform. The algorithm assigns a score to all transcripts at the locus whose CDS is identical to the APPRIS’ Principal (P1) CDS isoform. This is the “APPRIS score”.
- Concordance with the UniProt canonical protein isoform:
- UniProt (PMID: 30395287) has defined a canonical protein isoform for all human protein-coding genes. This is only partially available for mouse and rat. For other species there is a single protein representative per proteome. The algorithm assigns a score to all transcripts at the locus that encode the UniProt canonical isoform. This is the “UniProt score”.
- Length, using two considerations:
- CDS length: The algorithm determines the CDS length of each transcript and then normalises the length with respect to all transcripts at the locus to generate the “Length score”.
- Length override: The algorithm determines the CDS length of each transcript and disqualifies any transcript whose CDS length is 75% or less of the longest CDS at the locus. This step aims to avoid conservation bias towards shorter isoforms and the override may result in the selected transcript being the one with the second highest (or lower) overall score.
- Clinical variation (for human):
- The location of variants that have been classified by clinical experts as “Pathogenic” or “Likely Pathogenic” point to regions of a transcript that are functional. The algorithm identifies all pathogenic or likely pathogenic variants from the public domain (e.g. ClinVar, HGMD public) and identifies the transcript(s) that covers the largest number of variants. This information is provided as an algorithm output for manual review by curators.
- Partial transcript status:
- Ensembl/GENCODE models are built to the extent of existing evidence, such that some transcripts are not full-length due to lack of evidence. These partial transcripts may contain unique well-supported coding regions. However, on the very rare occasions that a partial transcript obtains the highest score, the algorithm disqualifies it and selects the highest scoring full-length transcript.
“Default” selection: in the absence of the data above (which currently applies to all non-human genomes), transcripts prioritised accordingly, choosing the one with the longest combined exon length:
- protein coding
- NMD
- non-stop decay
- polymorphic pseudogene
For everything, if required, the final disambiguation step is the lowest stable ID number (i.e. the oldest).
Note: for genes annotated on older reference assemblies, the “default selection” outlined above was used. For example, the Ensembl Canonical for a protein-coding gene on GRCh37 is the transcript with the longest CDS because the Ensembl annotation on GRCh37 has been frozen since September 2013.
Collaborative selection: Collaborative efforts with other research groups, including manual review, may lead to override of the algorithm’s choice (e.g. MANE project described below).
Selection for patches and haplotypes (novel patches): The algorithm is optimised for transcripts annotated on the primary assembly. For genes with known issues in the primary reference assembly, the Ensembl Canonical is chosen from the annotation on fix patches (improved sequences for known assembly errors). For genes with multiple representations in the assembly, an Ensembl Canonical is selected from each alternate sequence or novel patch.
The Matched Annotation from NCBI and EMBL-EBI project (MANE)
For human transcripts, the Ensembl Canonical transcript undergoes additional review, in a collaboration between EMBL-EBI and NCBI to define a joint default transcript set (the MANE project). If approved jointly by manual curators at both groups, the Ensembl Canonical is designated as the MANE Select transcript at that locus. The approved Ensembl Canonical must also perfectly align to the GRCh38 assembly and be identical to the corresponding RefSeq transcript (CDS and both UTRs). Occasionally, this review may result in the designation of a different transcript than the algorithmically selected Ensembl Canonical (i.e. the one with the highest score from the computational selection). Where a MANE Select transcript has not yet been determined, the Ensembl Canonical is presented.
The Ensembl Canonical transcript for non-protein-coding gene biotypes are currently calculated as follows:
- lncRNAs: The Ensembl Canonical is the transcript at the locus with the longest genomic span.
- Pseudogene: The Ensembl Canonical is the transcript at any pseudogene locus with a pseudogene biotype (for example, processed_pseudogene, transcribed_processed_pseudogene).
- Small RNAs: The Ensembl Canonical is the transcript at any small RNA locus with a small RNA biotype(eg miRNA, rRNA) - genes of this biotype generally only have one transcript per locus
- IG/TR: The Ensembl Canonical is the transcript at any IG/TR locus with an IG/TR biotype (eg IG_C_gene, TR_J_gene, IG_V_pseudogene) - genes of this biotype generally only have one transcript per locus. 1. Rodriguez, J. M. et al. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013). 2. Nellore, A. et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol. 17, 266 (2016).