Setup

Fastq filename convention

The permanent filename should follow the following format:

{LANE}_{DATE}_{FLOW-CELL}_{SAMPLE-ID}_{BARCODE-SEQ}_{DIRECTION 1/2}.fastq[.qz]

Where some types or formats are required for each element:

  • LANE = Integer

  • DATE = YYMMDD

  • BARCODE-SEQ = A, C, G, T or integer

  • DIRECTION = 1 or 2

The case_id and sample_id(s) needs to be unique and the sample id supplied should be equal to the {SAMPLE_ID} in the filename. Underscore cannot be part of any element in the file name as this is used as the separator for each element.

However, MIP will accept filenames in other formats as long as the filename contains the sample id and the mandatory information can be collected from the fastq header within the file.

Meta-Data

MIP requires pedigree information recorded in a pedigree.yaml file and a config file.

Dependencies

MIP comes with an install application, which will install all necessary programs to execute models in MIP via conda and/or $SHELL. Make sure you have installed all dependencies via the MIP install application and that you have loaded your MIP base environment. You only need to install the dependencies that are required for the recipes that you want to run. If you have not installed a dependency for a module, MIP will tell you what dependencies you need to install and exit.

Extra CPANM modules You can speed up, for instance, the Readonly module by also installing the companion module Readonly::XS. No change to the code is required and the Readonly module will call the Readonly::XS module if available.

CADD MIP is currently unable to install the CADD binary for dynamic calculation of indels and there is also no support for downloading the CADD references file. If you want to use these features in MIP you have to install and download them manually.

Programs

  • Simple Linux Utility for Resource Management (SLURM) (version: 18.08.0)

Pipeline: Rare disease

The version number after the software name are tested for compatibility with MIP.

Databases/References

MIP can download many program prerequisites automatically via the mip download application mip download [PIPELINE].

MIP will build references and meta files (if required) prior to starting an analysis pipeline mip analyse [PIPELINE].

Automatic Build:

Human Genome Reference Meta Files: 1. The sequence dictionary (".dict") 2. The ".fasta.fai" file

BWA: 1. The BWA index of the human genome.

Star: 1. Star index files of the human genome

Note

If you do not supply these parameters (Bwa/Star) MIP will create these from scratch using the supplied human reference genome as template.

Capture target files: 1. The "infile_list" and .pad100.infile_list files used in picardtools_collecthsmetrics. 2. The ".pad100.interval_list" file used by some GATK recipes.

Note

If you do not supply these parameters MIP will create these from scratch using the supplied "latest" supported capture kit ".bed" file and the supplied human reference genome as template.

Private References

Some references are not available for download because they require a license or contain data that are not consented for sharing. These references have to be manually applied for and added to the analysis where appropriate:

SweFreq — The Swedish Frequency resource for genomics

This dataset contains whole-genome variant frequencies for 1000 Swedish individuals generated within the SweGen project. One can request data access and download files from: https://swefreq.nbis.se/

Corresponding MIP references:

  • grch37anon-swegen_str_nsphs-1000samples-.vcf.gz (Autozygosity calculation;Rhocall)

  • grch37anon_swegen_snp-2016-10-19-.tab.gz (Frequency annotation;Snpeff)

  • grch37anon-swegen_indel-1000samples-.vcf.gz (Frequency annotation;Snpeff)

  • grch37swegen_concat_sort-20170830-.vcf (Structural variant frequency annotation; Svdb)

Spidex - Splicing prediction

SPIDEX is a computational model that uses the Percentage of Spliced-In (PSI) metric to evaluate whether a certain splicing isoform is more enriched under the presence/absence of a given variant. Unfortunately, the Deepgenomics company that used to provide spidex scores seem to have shut down. The files can be downloaded via Annovar, which however also requires a license.

Corresponding MIP references:

  • grch37spidex_public_noncommercial-v1_0-.tab.gz (Splicing annotation; Genmod)

Local frequency Databases

We use several local frequency databases, that we unfortunately are not allowed to share, but can be built using locusdb or Svdb: https://github.com/moonso/loqusdb.

Corresponding MIP references:

  • grch37loqusdb_snv_indel-2018-12-18-.vcf.gz (SNV/INDELS; Snpeff)

  • grch37mip_sv_svdb_export-2018-10-09-.vcf (SV; Svdb)

  • grch37svdb_query_clingen_ngi-v1.0.0-.vcf (SV;Svdb)

  • grch37svdb_query_decipher-v1.0.0-.vcf (local array SVs frequency annotation; Svdb)

Local clincial significance databases

Variants annotated as benign or pathogenic from array data.

Corresponding MIP references:

  • grch37svdb_query_clingen_cgh_benign-v1.0.0-.vcf

  • grch37svdb_query_clingen_cgh_pathogenic-v1.0.0-.vcf

GATK exome dataset

GATK needs more data in the variant calling for exomes than a single sample or trio. MIP adds in other previously sequenced samples in the variant calling as a supplementary dataset.

Corresponding MIP references:

  • grch37_gatk_merged_reference_samples.txt