Variation File Format - Definition and supported options

The Variant Effect Predictor tool which appears as an option when you click on Manage your Data allows you to upload a set of variation data and predict the effect of the variants.

Note that the input and output formats are completely different.

Input format

Data must be supplied in a simple tab-separated format, containing five columns, all required:

  1. chromosome - just the name or number, with no 'chr' prefix
  2. start
  3. end
  4. allele - pair of alleles separated by a '/', with the reference allele first
  5. strand - defined as + (forward) or - (reverse).
1   881907    881906    -/C   +
5   140532    140532    T/C   +
12  1017956   1017956   T/A   +
2   946507    946507    G/C   +
14  19584687  19584687  C/T   -
19  66520     66520     G/A   +
8   150029    150029    A/T   +

An insertion is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand of chromosome 8 is indicated as follows:

8   12601     12600     -/C   +

A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand of chromosome 8 will be:

8   12600     12602     CGT/- -

The following input file formats are also supported:

When using the web VEP, ensure that you have the correct file format selected from the drop-down menu. The VEP script is able to auto-detect the format of the input file.

Output format

The tool predicts the consequence of this variation, the amino acid position and change (if the variation falls within a protein) and the identifier of known variations that occur at this position. The output columns are:

  1. Uploaded variation - as chromosome_start_alleles
  2. Location - in standard coordinate format (chr:start or chr:start-end)
  3. Allele - the variant allele used to calculate the consequence
  4. Gene - Ensembl stable ID of affected gene
  5. Feature - Ensembl stable ID of feature
  6. Feature type - type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.
  7. Consequence - consequence type of this variation
  8. Relative position in cDNA - base pair position in cDNA sequence
  9. Relative position in CDS - base pair position in coding sequence
  10. Relative position in protein - amino acid position in protein
  11. Amino acid change - only given if the variation affects the protein-coding sequence
  12. Codons - the alternate codons with the variant base highlighted as bold (HTML) or upper case (text)
  13. Corresponding variation - identifier of existing variation
  14. Extra - this column contains extra information as key=value pairs separated by ";". The keys are as follows:
    • HGNC - the HGNC gene identifier
    • ENSP - the Ensembl protein identifier of the affected transcript
    • HGVSc - the HGVS coding sequence name
    • HGVSp - the HGVS protein sequence name
    • SIFT - the SIFT prediction and/or score, with both given as prediction(score)
    • PolyPhen - the PolyPhen prediction and/or score
    • Condel - the Condel consensus prediction and/or score
    • MOTIF_NAME - the source and identifier of a transcription factor binding profile aligned at this position
    • MOTIF_POS - The relative position of the variation in the aligned TFBP
    • HIGH_INF_POS - a flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)
    • MOTIF_SCORE_CHANGE - The difference in motif score of the reference and variant sequences for the TFBP
    • CANONICAL - a flag indicating if the transcript is denoted as the canonical transcript for this gene
    • CCDS - the CCDS identifer for this transcript, where applicable
    • INTRON - the intron number (out of total number)
    • EXON - the exon number (out of total number)
    • DOMAINS - the source and identifer of any overlapping protein domains

Empty values are denoted by '-'. Further fields in the Extra column can be added by plugins or using custom annotations in the VEP script. Output fields can be configured using the --fields flag when running the VEP script.

11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000525319  Transcript         NON_SYNONYMOUS_CODING   742  716  239  T/N  aCc/aAc  -  SIFT=deleterious(0);PolyPhen=unknown(0)
11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000534381  Transcript         5_PRIME_UTR             -    -    -    -    -        -  -
11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000529055  Transcript         DOWNSTREAM              -    -    -    -    -        -  -
11_224585_G/A    11:224585   A  ENSG00000142082  ENST00000529937  Transcript         INTRONIC,NMD_TRANSCRIPT -    -    -    -    -        -  HGVSc=ENST00000529937.1:c.136-346G>A
22_16084370_G/A  22:16084370 A  -                ENSR00000615113  RegulatoryFeature  REGULATORY_REGION       -    -    -    -    -        -  -

The VEP script will also add a header to the output file. This contains information about the databases connected to, and also a key describing the key/value pairs used in the extra column.

## ENSEMBL VARIANT EFFECT PREDICTOR v2.4
## Output produced at 2012-02-20 16:09:38
## Connected to homo_sapiens_core_66_37 on ensembldb.ensembl.org
## Using API version 66, DB version 66
## Extra column keys:
## CANONICAL    : Indicates if transcript is canonical for this gene
## CCDS         : Indicates if transcript is a CCDS transcript
## HGNC         : HGNC gene identifier
## ENSP         : Ensembl protein identifer
## HGVSc        : HGVS coding sequence name
## HGVSp        : HGVS protein sequence name
## SIFT         : SIFT prediction
## PolyPhen     : PolyPhen prediction
## Condel       : Condel SIFT/PolyPhen consensus prediction
## EXON         : Exon number
## INTRON       : Intron number
## DOMAINS      : The source and identifer of any overlapping protein domains
## MOTIF_NAME   : The source and identifier of a transcription factor binding profile (TFBP) aligned at this position
## MOTIF_POS    : The relative position of the variation in the aligned TFBP
## HIGH_INF_POS : A flag indicating if the variant falls in a high information position of the TFBP
## MOTIF_SCORE_CHANGE : The difference in motif score of the reference and variant sequences for the TFBP