Transcript flags
Transcript flags can be used to identify the highest quality or most relevant transcripts for your studies. We used these flags:
- MANE Select
- Ensembl Canonical
- Transcript support level (TSL)
- APPRIS
- GENCODE Basic
- 5' and 3' incomplete
MANE (Matched Annotation between NCBI and EBI) Select
To determine the MANE Select transcript, Ensembl and NCBI independently identify which transcript we believe is the most biologically relevant. Where these match, the transcripts are labelled as MANE in both databases. The transcripts are absolutely identical in both databases, having matching splicing structure, sequence which matches the reference genome, 5' and 3' UTRs and start and end.
Ensembl Canonical
The Ensembl Canonical transcript is the one that is, on balance, the most conserved, most highly expressed, has the longest coding sequence and is represented in other key resources, such as NCBI and UniProt.
Transcript support level
The Transcript Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl.
It is important to understand how to assess transcript annotations that you see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative.
TSL Method
The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in humans. Human transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments.
Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis.
Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ.
TSL Categories
The following categories are assigned to each of the evaluated annotations:
- tsl1 – all splice junctions of the transcript are supported by at least one non-suspect mRNA
- tsl2 – the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
- tsl3 – the only support is from a single EST
- tsl4 – the best supporting EST is flagged as suspect
- tsl5 – no single transcript supports the model structure
- tslNA – the transcript was not analysed for one of the following reasons:
- pseudogene annotation, including transcribed pseudogenes
- human leukocyte antigen (HLA) transcript
- immunoglobin gene transcript
- T-cell receptor transcript
- single-exon transcript (will be included in a future version)
APPRIS
APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes (Rodriguez et al 2012).
APPRIS try to select a single CDS variant for each gene as the 'PRINCIPAL' isoform, however this is not always possible. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most significant.
- PRINCIPAL:1 Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS.
- PRINCIPAL:2 Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.
- PRINCIPAL:3 Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.
- PRINCIPAL:4 Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
- PRINCIPAL:5 Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.
For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way:
- ALTERNATIVE:1 Candidate transcript(s) models that are conserved in at least three tested species.
- ALTERNATIVE:2 Candidate transcript(s) models that appear to be conserved in fewer than three tested species.
Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website.
GENCODE Basic
The GENCODE collaboration produces the gene sets displayed in Ensembl for human and mouse. GENCODE Comprehensive contains all genes and transcripts in the GENCODE gene set.
GENCODE Basic is a subset of GENCODE Comprehensive. GENCODE Basic contains at least one transcript for every gene in the GENCODE geneset regardless of biotype - i.e. every gene is represented in the GENCODE basic set.
For protein-coding genes, only full-length protein coding transcripts (those that contain a complete CDS from start codon to stop codon) are included in the GENCODE Basic set. For protein-coding genes with one or more full-length transcript, any partial transcripts (containing CDS_start_NF ('Not Found') and/or CDS_end_NF tags), NMD transcripts and transcripts with no annotated CDS ('retained_intron' and 'processed_transcript' biotypes) within the same gene will be excluded from GENCODE Basic.
For lncRNA GENCODE Basic contains the smallest number of transcripts required to represent 85% of the splice sites in the gene.
For pseudogenes, sRNA and IG/TR there is almost always only 1 transcript per gene and as such that transcript is included in GENCODE Basic.
Rules for GENCODE Basic
We worked with GENCODE to decide how to tag transcripts as 'Basic'. These are the rules that we use to tag which transcripts are included in the GENCODE Basic set, for each gene:
- Loop through all protein-coding (and similar biotype) transcripts and tag all the complete (CDS start- and end found) transcripts. If none of the transcripts are complete, tag only the transcript(s) with the longest CDS.
- Loop through all the small noncoding and antisense transcripts and tag all the complete (mRNA start- and end found) transcripts. If none are complete, loop through the long-noncoding transcripts too and then tag only the transcript(s) with the longest combined exon length.
- Combine the results from steps (1) and (2) and this is what is displayed as ‘GENCODE Basic’.
- If, after step (3), we've got an empty basket and no transcripts in the gene are tagged as 'Basic', we look for pseudogene transcripts and tag all the pseudogene transcripts that we find.
- Finally, we've still got no transcripts tagged from steps (1) or (2) or (4), then we tag transcripts with 'problematic' biotypes ie. retainedintron, TEC, ambiguousORF and disrupted_domain.
5' and 3' incomplete
If a transcript is annotated from a protein fragment, resulting in a transcript having an incomplete ORF, then it will be labelled as either 5' incomplete if the N terminus is missing, 3' incomplete if the C terminus is missing, or both. This only considers ORFs; this label will not be applied to transcripts which have a start and stop codon but missing or truncated UTRs.