Prerequisites

VEP

The input to the pVAC-Seq pipeline is a VEP annotated single-sample VCF. In addition to the standard VEP annotations, pVAC-Seq also requires the annotations provided by the Downstream and Wildtype VEP plugins.

To create a VCF for use with pVAC-Seq follow these steps:

  1. Download and install the VEP command line tool following these instructions.
  2. Download the VEP_plugins from their GitHub repository.
  3. Copy the Wildtype plugin provided with the pVAC-Seq package to the folder with the other VEP_plugins:
pvacseq install_vep_plugin
  1. Run VEP on the input vcf with at least the following options:
--format vcf
--vcf
--symbol
--plugin Downstream
--plugin Wildtype
--terms SO

The --dir_plugins <VEP_plugins directory> option may need to be set depending on where the VEP_plugins were installed to.

The --pick option might be useful to limit the annotation to the top transcripts. Otherwise, VEP will annotate each variant with all possible transcripts. pVAC-Seq will provide predictions for all transcripts in the VEP CSQ field. Running VEP without the --pick option can therefor drasticly increase the runtime of pVAC-Seq.

Additional VEP options that might be desired can be found here.

Example VEP Command

perl variant_effect_predictor.pl \
--input_file <input VCF> --format vcf --output_file <output VCF> \
--vcf --symbol --terms SO --plugin Downstream --plugin Wildtype \
[--dir_plugins <VEP_plugins directory>]

Optional Preprocessing

Coverage and Expression Data

Coverage and expression data can be added to the pVAC-Seq processing by providing bam-readcount and/or Cufflinks output files as additional input files. These additional input files must be provided as a yaml file in the following structure:

gene_expn_file: <genes.fpkm_tracking file from Cufflinks>
transcript_expn_file: <isoforms.fpkm_tracking file from Cufflinks>
normal_snvs_coverage_file: <bam-readcount output file for normal BAM and snvs>
normal_indels_coverage_file: <bam-readcount output file for normal BAM and indels>
tdna_snvs_coverage_file: <bam-readcount output file for tumor DNA BAM and snvs>
tdna_indels_coverage_file: <bam-readcount output file for tumor DNA BAM and indels>
trna_snvs_coverage_file: <bam-readcount output file for tumor RNA BAM and snvs>
trna_indels_coverage_file: <bam-readcount output file for tumor RNA BAM and indels>

Each file in this list is optional, and its entry can be omitted. If no additional files exist then this yaml file is optional and can be omitted from the list of pvacseq arguments.

bam-readcount

pVAC-Seq optionally accepts bam-readcount files as inputs to add coverage information (depth and VAF) for downstream filtering. Depth and VAF are calculated from the read counts of the reference allele and alternate allele.

Follow the installation instructions on the bam-readcount GitHub page.

bam-readcount uses a bam file and regions file as input, and the bam regions may either contain snvs or indels. Indel regions must be run in a special insertion-centric mode. Any mixed input regions must be split into snvs and indels, and bam-reacount must then be run on each file individually using the same bam.

Example bam-readcount command

bam-readcount -f <reference fasta> -l <site list> <bam_file>

The -i option must be used when running indels bam in order to process indels in insertion-centric mode.

A minimum base quality of 20 is recommended which can be enabled by -b 20.