Bioinformatics pipelines are an integral component of next-generation sequencing (NGS). Processing raw sequence data to detect genomic alterations has significant impact on disease management and patient care. Because of the lack of published guidance, there is currently a high degree of variability in how members of the global molecular genetics and pathology community establish and validate bioinformatics pipelines.
## Bioinformatics Analysis of RNA-seq Data
RNA-seq is a powerful platform for comprehensive investigation of the transcriptome.The General bioinformatics workflow for the quantitative analysis of RNA-seq data includes three parts:
RNA-seq is a powerful platform for comprehensive investigation of the transcriptome. RNA sequencing (RNA-Seq) uses the capabilities of high-throughput sequencing methods to provide insight into the transcriptome of a cell.
### The General bioinformatics workflow for the quantitative analysis of RNA-seq data:
- RNA-seq Quality Check;
- Quality timming of Adapters;
- mapping sequencing reads to a reference genome or transcriptome;
- quantifying expression levels of individual genes and transcripts; and
- identifying specific genes and transcripts that are differentially expressed between samples.
### BASIC WORKFLOW
#### FASTQ Files - RAW Data
#### RAW DATA - FASTQ Files
Raw RNA-seq data are typically formatted as **FASTQ** files. **FASTQ** is a text-based format storing the sequences of the reads as well as their sequencing quality. The file is organized in groups of four lines per read as shown below:
Once high-quality data are obtained from pre-processing, the next step is the read mapping or alignment.When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome **(genome mapping)** or transcriptome **(transcriptome mapping)**. Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts.
**INDEX**
Before doing the mapping, we have to prepare an index from the reference DNA sequence that a chosen algorithm will use.
Like the index at the end of a book, an index of a large DNA sequence allows one to rapidly find shorter sequences embedded in it. Different tools use different approaches at genome/transcriptome indexing.
**Splice-aware aligners to a reference genome**
These aligners are able to map to the splicing junctions described in the annotation and even to detect novel ones.
Some of them can detect gene fusions and SNPs and also RNA editing. For some of these tools, the downstream analysis requires the assignation of the aligned reads to a given gene/transcript.
[HISAT2](https://daehwankimlab.github.io/hisat2/) is the next generation of spliced aligner from the same group that have developed TopHat. It is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). The indexing scheme is called a Hierarchical Graph FM index (HGFM).
Wrapper
The **hisat2**, **hisat2-build** and **hisat2-inspect** executables are actually wrapper scripts that call binary programs as appropriate. The wrappers shield users from having to distinguish between “small” and “large” index formats, discussed briefly in the following section. Also, the hisat2 wrapper provides some key functionality, like the ability to handle compressed inputs, and the functionality for --un, --al and related options.
It is recommended that you always run the hisat2 wrappers and not run the binaries directly.
For more understanding about working on Hisat2 refer [http://daehwankimlab.github.io/hisat2/manual/](http://daehwankimlab.github.io/hisat2/manual/)