Skip to content
Snippets Groups Projects
bioinformatics.md 6.58 KiB
Newer Older
selva's avatar
selva committed
# Bioinformatics Pipelines
Bioinformatics pipelines are an integral component of next-generation sequencing (NGS). Processing raw sequence data to detect genomic alterations has significant impact on disease management and patient care. Because of the lack of published guidance, there is currently a high degree of variability in how members of the global molecular genetics and pathology community establish and validate bioinformatics pipelines.

## Bioinformatics Analysis of NGS Data
NGS bioinformatics pipelines are frequently platform specific and may be customizable on the basis of laboratory needs. A bioinformatics pipeline consists of the following major steps:

### Sequence Generation

> Sequence generation (signal processing and base calling) is the process that converts sensor (optical and nonoptical) data from the sequencing platform and identifies the sequence of nucleotides for each of the short fragments of DNA in the sample prepared for analysis. For each nucleotide sequenced in these short fragments (ie, raw reads), a corresponding Phred-like quality score is assigned, which is sequencing platform specific. The read sequences along with the Phred-like quality scores are stored in a FASTQ file, which is a de facto standard for representing biological sequence information
</details>


### Sequence Alignment

> Sequence alignment is the process of determining where each short DNA sequence read (each typically <250 bp) aligns with a reference genome (eg, the human reference genome used in clinical laboratories). This computationally intensive process assigns a Phred-scale mapping quality score to each of the short sequence reads, indicating the confidence of the alignment process. This step also provides a genomic context (location in the reference genome) to each aligned sequence read, which can be used to calculate the proportion of mapped reads and depth (coverage) of sequencing for one or more loci in the sequenced region of interest. The sequence alignment data are usually stored in a de facto standard binary alignment map (BAM) file format, which is a binary version of the sequence alignment/map format. The newer compressed representation [Compressed and Reference-oriented Alignment Map (CRAM)] or its encrypted version [Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map (SECRAM)]6 is a viable alternative that saves space and secures genetic information, although laboratories need to carefully validate variant calling impact if lossy (as opposed to lossless) compression settings are used in generating CRAM (European Nucleotide Archive, CRAM format specification version 3.0; http://samtools.github.io/hts-specs/CRAMv3.pdf, last accessed November 23, 2016) and SECRAM files.

### Variant Calling

> Variant calling is the process of accurately identifying the differences or variations between the sample and the reference genome sequence. The typical input is a set of aligned reads in BAM or another similar format, which is traversed by the variant caller to identify sequence variants. Variant calling is a heterogeneous collection of algorithmic strategies based on the types of sequence variants, such as single-nucleotide variants (SNVs), small insertions and deletions (indels), copy number alterations, and large structural alterations (insertions, inversions, and translocations). The accuracy of variant calling is highly dependent on the quality of called bases and aligned reads. Therefore, prevariant calling processing, such as local realignment around expected indels and base quality score recalibration, is routinely used to ensure accurate and efficient variant calling. For SNVs and indels, the called variants are represented using the de facto standard variant call format (VCF; https://samtools.github.io/hts-specs/VCFv4.3.pdf, last accessed November 23, 2016). Alternative specifications exist for representing and storing variant calls [Genomic VCF Conventions, https://sites.google.com/site/gvcftools/home/about-gvcf/gvcf-conventions, last accessed November 23, 2016; The Sequence Ontology Genome Variation Format Version 1.10, https://github.com/The-Sequence-Ontology/Specifications/blob/master/gvf.md, last accessed November 23, 2016; The Human Genome Variation Society, Human Genome Variation Society (HGVS) Simple Version 15.11. 2016, http://varnomen.hgvs.org/bg-material/simple, last accessed November 23, 2016; Health GAfGa File Formats, https://www.ga4gh.org/ga4ghtoolkit/genomicdatatoolkit, last accessed November 27, 2017].

### Variant Filtering

> Variant filtering is the process by which variants representing false-positive artifacts of the NGS method are flagged or filtered from the original VCF file on the basis of several sequence alignment and variant calling associated metadata (eg, mapping quality, base-calling quality, strand bias, and others). This is usually a postvariant calling step, although some variant callers incorporate this step as part of the variant calling process. This automated process may be used as a hard filter to allow annotation and review of only the assumed true variants.

### Variant Annotation

> Variant annotation performs queries against multiple sequence and variant databases to characterize each called variant with a rich set of metadata, such as variant location, predicted cDNA and amino acid sequence change (HGVS nomenclature), minor allele frequencies in human populations, and prevalence in different variant databases [eg, Catalogue of Somatic Mutations in Cancer, The Cancer Genome Atlas, Single-Nucleotide Polymorphism (SNP) Database, and ClinVar]. This information is used to further prioritize or filter variants for classification and interpretation.

### Variant Prioritization

> Variant prioritization uses variant annotations to identify clinically insignificant variants (eg, synonymous, deep intronic variants, and established benign polymorphisms), thereby presenting the remaining variants (known or unknown clinical significance) for further review and interpretation. Clinical laboratories often develop variant knowledge bases to facilitate this process.
> Some clinical laboratories choose to apply hard filters on called variants on the basis of variant call metadata or from a data dictionary (variant filtering) as a component of the pipeline analysis software. Because its purpose is to hide certain variants from the view of the human interpreter, it is absolutely critical that filtering algorithms be thoroughly validated to ensure that only those variants meeting strict predefined criteria are being hidden from view. Otherwise, the human interpreter may miss clinically significant variants that may result in harming the patient.