bioinformatics.md

# Bioinformatics Pipelines
Bioinformatics is an interdisciplinary field focused on developing software and hardware tools and methods to support biological data storage, organization, and analysis, particularly related to genetic sequencing.

RNA-sequencing Data analysis is a  Bioinformatics pipeline that deals on the analysis of transcriptome, indicating which of the genes encoded in our DNA are turned on or off and to what extent.

## Bioinformatics Analysis of RNA-seq Data
RNA-seq is a powerful platform for comprehensive investigation of the transcriptome. RNA sequencing (RNA-Seq) uses the capabilities of high-throughput sequencing methods to provide insight into the transcriptome of a cell. 

### The General bioinformatics workflow for the quantitative analysis of RNA-seq data:
- RNA-seq Quality Check; 
- Quality timming of Adapters;
- mapping sequencing reads to a reference genome or transcriptome; 
- quantifying expression levels of individual genes and transcripts; and
- identifying specific genes and transcripts that are differentially expressed between samples.

```{image} ../img/rna-seq_workflow.png
```

#### RAW DATA - FASTQ Files 
Raw RNA-seq data are typically formatted as **FASTQ** files. **FASTQ** is a text-based format storing the sequences of the reads as well as their sequencing quality. The file is organized in groups of four lines per read as shown below:

> @NB500929:247:HL2TYBGX3:1:11101:25163:1060
> GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
> +  
> !’’*((((\*\*\*+))%%%++)(%%%%).1\*\*\*-+\*’’))**55CCF>>>>>>CCCCCCC65

The first line starts with “@” and is followed by a unique sequence identifier, which includes instrument ID (NB500929), run number (247), and flow cell ID (HL2TYBGX3), followed by the numbers specifying the location of the DNA fragment on the flowcell. In the case of paired-end sequencing, two FASTQ files for read 1 and read 2 include the same sequence identifiers plus the read number (1 or 2), which indicates whether the sequence comes from read 1 or read 2 of the DNA fragment. The second line contains the read sequence. The third line starts with a “+” character and can optionally be followed by the same sequence identifier and any additional description. The fourth line encodes the sequencing quality scores for each base, which are coded as individual symbols according to a [coding scheme](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm)

#### QUALITY CHECK ON RAW DATA
RNA-Seq has become one of the most widely used applications based on next-generation sequencing technology. However, raw RNA-Seq data may have quality issues, which can significantly distort analytical results and lead to erroneous conclusions. Therefore, the raw data must be subjected to vigorous quality control (QC) procedures before downstream analysis. Currently, an accurate and complete QC of RNA-Seq data requires of a suite of different QC tools used consecutively, which is inefficient in terms of usability, running time, file usage, and interpretability of the results.

[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

The main functions of FastQC are:

    Import of data from FASTQ files (also accepts BAM and SAM alignment files)
    Quick overview of any likely sequencing problems
    Summary graphs and tables to quickly assess your data
    Export of results as an HTML-based report

FastQC has a really well documented manual page with detailed explanations about every plot in the report.

<details>
<summary>Working on FastQC</summary>

```bash
    $ fastqc SRR20076358_1.fastq.gz                                                 
```
Running the above command produces the following output in console and Results in making reports as two files `SRR20076358_1_fastqc.zip` and `SRR20076358_1_fastqc.html`
```bash
Started analysis of SRR20076358_1.fastq.gz
    Approx 5% complete for SRR20076358_1.fastq.gz
    Approx 10% complete for SRR20076358_1.fastq.gz
    Approx 15% complete for SRR20076358_1.fastq.gz
    Approx 20% complete for SRR20076358_1.fastq.gz
    Approx 25% complete for SRR20076358_1.fastq.gz
    Approx 30% complete for SRR20076358_1.fastq.gz
    Approx 35% complete for SRR20076358_1.fastq.gz
    Approx 40% complete for SRR20076358_1.fastq.gz
    Approx 45% complete for SRR20076358_1.fastq.gz
    Approx 50% complete for SRR20076358_1.fastq.gz
    Approx 55% complete for SRR20076358_1.fastq.gz
    Approx 60% complete for SRR20076358_1.fastq.gz
    Approx 65% complete for SRR20076358_1.fastq.gz
    Approx 70% complete for SRR20076358_1.fastq.gz
    Approx 75% complete for SRR20076358_1.fastq.gz
    Approx 80% complete for SRR20076358_1.fastq.gz
    Approx 85% complete for SRR20076358_1.fastq.gz
    Approx 90% complete for SRR20076358_1.fastq.gz
    Approx 95% complete for SRR20076358_1.fastq.gz
    Analysis complete for SRR20076358_1.fastq.gz
```
Run the following command to view the result of the Quality Check as shown in _fig.2.1_

```bash
$ firefox SRR20076358_1.fastq.gz
```
```{image} ../img/fastqc_report.png
```
<p align="center">
fig.2.1
</p>
</details>

#### QUALITY TRIMMING OF ADAPTERS
Trimming for adaptors and low quality bases is important part of the analysis pipeline for sequencing data. Typically, after you isolate and fragment your RNA sample, adaptors are attached to the ends of the sequences that are needed for sequencing .These adaptors need to be removed from the sequenced reads before downstream processing. An additional step that needs to be taken is removing low quality bases.

Quality trimming decreases the overall number of reads, but increases to the total and proportion of uniquely mapped reads. Thus, you get more useful data for downstream analyses.

There are many tools for trimming reads and removing adapters, such as Trim Galore!, Trimmomatic, Cutadapt, skewer, AlienTrimmer, BBDuk, and the most recent SOAPnuke and fastp.

Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing).
<details>
<summary>Working on trim_galore</summary>

```bash
$ trim_galore --gzip --fastqc --max_n 2 --paired --length 50 SRR11862696_1.fastq.gz SRR11862696_2.fastq.gz
```
<details>
<summary>The above command will produce the following result in console
</summary>

```bash
# trim_galore --gzip --fastqc --max_n 2 --paired --length 50 SRR11862696_1.fastq.gz SRR11862696_2.fastq.gz                              

Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.1
single-core operation.
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)


AUTO-DETECTING ADAPTER TYPE
===========================
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> SRR11862696_1.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type    Count   Sequence        Sequences analysed      Percentage
Illumina        4229    AGATCGGAAGAGC   1000000 0.42
Nextera 7       CTGTCTCTTATA    1000000 0.00
smallRNA        3       TGGAATTCTCGG    1000000 0.00
Using Illumina adapter for trimming (count: 4229). Second best hit was Nextera (count: 7)

Writing report to 'SRR11862696_1.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
==========================
Input filename: SRR11862696_1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.6
Cutadapt version: 4.1
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Maximum number of tolerated Ns: 2
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 50 bp
Running FastQC on the data once trimming has completed
Output file(s) will be GZIP compressed

Cutadapt seems to be fairly up-to-date (version 4.1). Setting -j 1
Writing final adapter and quality trimmed output to SRR11862696_1_trimmed.fq.gz


  >>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR11862696_1.fastq.gz <<< 
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
This is cutadapt 4.1 with Python 3.10.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR11862696_1.fastq.gz
Processing single-end reads on 1 core ...
Finished in 1353.17 s (29 µs/read; 2.08 M reads/minute).

=== Summary ===

Total reads processed:              46,831,782
Reads with adapters:                15,832,779 (33.8%)
Reads written (passing filters):    46,831,782 (100.0%)

Total basepairs processed: 4,730,009,982 bp
Quality-trimmed:              45,195,419 bp (1.0%)
Total written (filtered):  4,644,800,323 bp (98.2%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 15832779 times

Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 29.1%
  C: 33.1%
  G: 21.6%
  T: 15.3%
  none/other: 1.0%

Overview of removed sequences
length  count   expect  max.err error counts
1       10545866        11707945.5      0       10545866
2       3526202 2926986.4       0       3526202
3       947984  731746.6        0       947984
4       207831  182936.6        0       207831
5       70578   45734.2 0       70578
6       34326   11433.5 0       34326
7       27723   2858.4  0       27723
8       25762   714.6   0       25762
9       23820   178.6   0       23189 631
10      21148   44.7    1       19978 1170
11      18346   11.2    1       17295 1051
12      16859   2.8     1       16455 404
13      14322   0.7     1       14086 236
14      14148   0.7     1       13862 286
15      13064   0.7     1       12807 257
16      13012   0.7     1       12753 259
17      12842   0.7     1       12598 244
18      12040   0.7     1       11787 253
19      9449    0.7     1       9243 206
20      8880    0.7     1       8713 167
21      8038    0.7     1       7853 185
22      6802    0.7     1       6651 151
23      6453    0.7     1       6261 192
24      5955    0.7     1       5803 152
25      5584    0.7     1       5415 169
26      5190    0.7     1       5018 172
27      5085    0.7     1       4924 161
28      4953    0.7     1       4808 145
29      4594    0.7     1       4427 167
30      4196    0.7     1       4048 148
31      3299    0.7     1       3176 123
32      3192    0.7     1       3072 120
33      2944    0.7     1       2808 136
34      2785    0.7     1       2612 173
35      2601    0.7     1       2458 143
36      2053    0.7     1       1923 130
37      2335    0.7     1       2155 180
38      2403    0.7     1       2260 143
39      2047    0.7     1       1855 192
40      2027    0.7     1       1880 147
41      1750    0.7     1       1537 213
42      1814    0.7     1       1595 219
43      1645    0.7     1       1545 100
44      994     0.7     1       914 80
45      1238    0.7     1       1163 75
46      975     0.7     1       880 95
47      1148    0.7     1       1021 127
48      1106    0.7     1       991 115
49      958     0.7     1       864 94
50      980     0.7     1       839 141
51      924     0.7     1       825 99
52      828     0.7     1       721 107
53      958     0.7     1       742 216
54      1029    0.7     1       741 288
55      917     0.7     1       785 132
56      529     0.7     1       464 65
57      698     0.7     1       537 161
58      740     0.7     1       519 221
59      829     0.7     1       535 294
60      1114    0.7     1       576 538
61      992     0.7     1       596 396
62      1052    0.7     1       456 596
63      1551    0.7     1       483 1068
64      2633    0.7     1       533 2100
65      4064    0.7     1       619 3445
66      2119    0.7     1       542 1577
67      1983    0.7     1       503 1480
68      2846    0.7     1       449 2397
69      5405    0.7     1       485 4920
70      10363   0.7     1       662 9701
71      47660   0.7     1       968 46692
72      34144   0.7     1       2204 31940
73      15329   0.7     1       970 14359
74      8711    0.7     1       812 7899
75      3735    0.7     1       481 3254
76      3407    0.7     1       281 3126
77      3785    0.7     1       93 3692
78      2433    0.7     1       44 2389
79      1587    0.7     1       35 1552
80      964     0.7     1       17 947
81      709     0.7     1       14 695
82      469     0.7     1       8 461
83      325     0.7     1       5 320
84      295     0.7     1       5 290
85      198     0.7     1       3 195
86      161     0.7     1       6 155
87      184     0.7     1       9 175
88      143     0.7     1       3 140
89      124     0.7     1       2 122
90      142     0.7     1       1 141
91      131     0.7     1       1 130
92      164     0.7     1       3 161
93      168     0.7     1       0 168
94      188     0.7     1       1 187
95      188     0.7     1       1 187
96      267     0.7     1       0 267
97      312     0.7     1       2 310
98      390     0.7     1       0 390
99      404     0.7     1       0 404
100     767     0.7     1       0 767
101     4375    0.7     1       0 4375

RUN STATISTICS FOR INPUT FILE: SRR11862696_1.fastq.gz
=============================================
46831782 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)

Writing report to 'SRR11862696_2.fastq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS
==========================
Input filename: SRR11862696_2.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.6
Cutadapt version: 4.1
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Maximum number of tolerated Ns: 2
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 50 bp
Running FastQC on the data once trimming has completed
Output file(s) will be GZIP compressed

Cutadapt seems to be fairly up-to-date (version 4.1). Setting -j -j 1
Writing final adapter and quality trimmed output to SRR11862696_2_trimmed.fq.gz


  >>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR11862696_2.fastq.gz <<< 
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
This is cutadapt 4.1 with Python 3.10.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR11862696_2.fastq.gz
Processing single-end reads on 1 core ...
Finished in 1380.06 s (29 µs/read; 2.04 M reads/minute).

=== Summary ===

Total reads processed:              46,831,782
Reads with adapters:                15,576,554 (33.3%)
Reads written (passing filters):    46,831,782 (100.0%)

Total basepairs processed: 4,730,009,982 bp
Quality-trimmed:             103,667,050 bp (2.2%)
Total written (filtered):  4,596,284,410 bp (97.2%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 15576554 times

Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 29.7%
  C: 33.2%
  G: 21.6%
  T: 15.3%
  none/other: 0.2%

Overview of removed sequences
length  count   expect  max.err error counts
1       10415056        11707945.5      0       10415056
2       3549673 2926986.4       0       3549673
3       930335  731746.6        0       930335
4       203432  182936.6        0       203432
5       70152   45734.2 0       70152
6       34001   11433.5 0       34001
7       28487   2858.4  0       28487
8       25528   714.6   0       25528
9       23898   178.6   0       23109 789
10      22040   44.7    1       20613 1427
11      18086   11.2    1       16804 1282
12      17308   2.8     1       16717 591
13      15092   0.7     1       14434 658
14      15841   0.7     1       15497 344
15      11921   0.7     1       11544 377
16      12960   0.7     1       12493 467
17      14323   0.7     1       14027 296
18      9700    0.7     1       9441 259
19      11057   0.7     1       10834 223
20      7979    0.7     1       7799 180
21      7047    0.7     1       6893 154
22      6828    0.7     1       6651 177
23      7131    0.7     1       6180 951
24      6703    0.7     1       6513 190
25      5485    0.7     1       5227 258
26      8361    0.7     1       5068 3293
27      5341    0.7     1       4702 639
28      5433    0.7     1       5188 245
29      5099    0.7     1       4100 999
30      8235    0.7     1       8058 177
31      574     0.7     1       431 143
32      3205    0.7     1       3094 111
33      1589    0.7     1       1483 106
34      2116    0.7     1       2014 102
35      2219    0.7     1       2095 124
36      2189    0.7     1       1992 197
37      2105    0.7     1       1963 142
38      2127    0.7     1       1985 142
39      2108    0.7     1       1912 196
40      1853    0.7     1       1753 100
41      2786    0.7     1       1641 1145
42      2087    0.7     1       2001 86
43      974     0.7     1       885 89
44      2286    0.7     1       1231 1055
45      1723    0.7     1       1619 104
46      777     0.7     1       690 87
47      920     0.7     1       864 56
48      912     0.7     1       829 83
49      1068    0.7     1       869 199
50      1074    0.7     1       957 117
51      1098    0.7     1       1033 65
52      631     0.7     1       589 42
53      598     0.7     1       551 47
54      661     0.7     1       608 53
55      728     0.7     1       649 79
56      528     0.7     1       476 52
57      656     0.7     1       578 78
58      602     0.7     1       561 41
59      593     0.7     1       544 49
60      632     0.7     1       548 84
61      681     0.7     1       516 165
62      1995    0.7     1       637 1358
63      1889    0.7     1       1135 754
64      4483    0.7     1       1659 2824
65      13843   0.7     1       3320 10523
66      6624    0.7     1       2171 4453
67      1203    0.7     1       529 674
68      354     0.7     1       196 158
69      163     0.7     1       72 91
70      80      0.7     1       19 61
71      55      0.7     1       21 34
72      81      0.7     1       19 62
73      48      0.7     1       14 34
74      36      0.7     1       8 28
75      29      0.7     1       1 28
76      38      0.7     1       5 33
77      60      0.7     1       6 54
78      31      0.7     1       3 28
79      35      0.7     1       7 28
80      39      0.7     1       3 36
81      48      0.7     1       4 44
82      29      0.7     1       5 24
83      32      0.7     1       4 28
84      34      0.7     1       3 31
85      34      0.7     1       5 29
86      32      0.7     1       6 26
87      53      0.7     1       7 46
88      32      0.7     1       3 29
89      40      0.7     1       6 34
90      35      0.7     1       5 30
91      41      0.7     1       2 39
92      61      0.7     1       1 60
93      32      0.7     1       1 31
94      32      0.7     1       2 30
95      19      0.7     1       2 17
96      25      0.7     1       0 25
97      33      0.7     1       1 32
98      68      0.7     1       1 67
99      24      0.7     1       0 24
100     27      0.7     1       0 27
101     105     0.7     1       0 105

RUN STATISTICS FOR INPUT FILE: SRR11862696_2.fastq.gz
=============================================
46831782 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)

Validate paired-end files SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz
file_1: SRR11862696_1_trimmed.fq.gz, file_2: SRR11862696_2_trimmed.fq.gz


>>>>> Now validing the length of the 2 paired-end infiles: SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz <<<<<
Writing validated paired-end Read 1 reads to SRR11862696_1_val_1.fq.gz
Writing validated paired-end Read 2 reads to SRR11862696_2_val_2.fq.gz


Total number of sequences analysed: 46831782

Number of sequence pairs removed because at least one read was shorter than the length cutoff (50 bp): 1159884 (2.48%)
Number of sequence pairs removed because at least one read contained more N(s) than the specified limit of 2: 62468 (0.13%)


  >>> Now running FastQC on the validated data SRR11862696_1_val_1.fq.gz<<<

Started analysis of SRR11862696_1_val_1.fq.gz
Approx 5% complete for SRR11862696_1_val_1.fq.gz
Approx 10% complete for SRR11862696_1_val_1.fq.gz
Approx 15% complete for SRR11862696_1_val_1.fq.gz
Approx 20% complete for SRR11862696_1_val_1.fq.gz
Approx 25% complete for SRR11862696_1_val_1.fq.gz
Approx 30% complete for SRR11862696_1_val_1.fq.gz
Approx 35% complete for SRR11862696_1_val_1.fq.gz
Approx 40% complete for SRR11862696_1_val_1.fq.gz
Approx 45% complete for SRR11862696_1_val_1.fq.gz
Approx 50% complete for SRR11862696_1_val_1.fq.gz
Approx 55% complete for SRR11862696_1_val_1.fq.gz
Approx 60% complete for SRR11862696_1_val_1.fq.gz
Approx 65% complete for SRR11862696_1_val_1.fq.gz
Approx 70% complete for SRR11862696_1_val_1.fq.gz
Approx 75% complete for SRR11862696_1_val_1.fq.gz
Approx 80% complete for SRR11862696_1_val_1.fq.gz
Approx 85% complete for SRR11862696_1_val_1.fq.gz
Approx 90% complete for SRR11862696_1_val_1.fq.gz
Approx 95% complete for SRR11862696_1_val_1.fq.gz
Analysis complete for SRR11862696_1_val_1.fq.gz

  >>> Now running FastQC on the validated data SRR11862696_2_val_2.fq.gz<<<

Started analysis of SRR11862696_2_val_2.fq.gz
Approx 5% complete for SRR11862696_2_val_2.fq.gz
Approx 10% complete for SRR11862696_2_val_2.fq.gz
Approx 15% complete for SRR11862696_2_val_2.fq.gz
Approx 20% complete for SRR11862696_2_val_2.fq.gz
Approx 25% complete for SRR11862696_2_val_2.fq.gz
Approx 30% complete for SRR11862696_2_val_2.fq.gz
Approx 35% complete for SRR11862696_2_val_2.fq.gz
Approx 40% complete for SRR11862696_2_val_2.fq.gz
Approx 45% complete for SRR11862696_2_val_2.fq.gz
Approx 50% complete for SRR11862696_2_val_2.fq.gz
Approx 55% complete for SRR11862696_2_val_2.fq.gz
Approx 60% complete for SRR11862696_2_val_2.fq.gz
Approx 65% complete for SRR11862696_2_val_2.fq.gz
Approx 70% complete for SRR11862696_2_val_2.fq.gz
Approx 75% complete for SRR11862696_2_val_2.fq.gz
Approx 80% complete for SRR11862696_2_val_2.fq.gz
Approx 85% complete for SRR11862696_2_val_2.fq.gz
Approx 90% complete for SRR11862696_2_val_2.fq.gz
Approx 95% complete for SRR11862696_2_val_2.fq.gz
Analysis complete for SRR11862696_2_val_2.fq.gz
Deleting both intermediate output files SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz

====================================================================================================
```

</details>

Produces quality control reports after trimming adapters from the raw data   using FastQC as `SRR11862696_1_val_1_fastqc.html` and `SRR11862696_2_val_2_fastqc.html`

You can view quality control report using the following command:
```bash
$ firefox SRR11862696_1_val_1_fastqc.html
```

```{image} ../img/trim_galore_report.png

```

</details>

#### ALIGNMENT
Once high-quality data are obtained from pre-processing, the next step is the read mapping or alignment.When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome **(genome mapping)** or transcriptome **(transcriptome mapping)**. Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts.

**INDEX**

Before doing the mapping, we have to prepare an index from the reference DNA sequence that a chosen algorithm will use.
Like the index at the end of a book, an index of a large DNA sequence allows one to rapidly find shorter sequences embedded in it. Different tools use different approaches at genome/transcriptome indexing.

**Splice-aware aligners to a reference genome**

These aligners are able to map to the splicing junctions described in the annotation and even to detect novel ones.
Some of them can detect gene fusions and SNPs and also RNA editing. For some of these tools, the downstream analysis requires the assignation of the aligned reads to a given gene/transcript.

[HISAT2](https://daehwankimlab.github.io/hisat2/) is the next generation of spliced aligner from the same group that have developed TopHat. It is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). The indexing scheme is called a Hierarchical Graph FM index (HGFM).

Wrapper

The **hisat2**, **hisat2-build** and **hisat2-inspect** executables are actually wrapper scripts that call binary programs as appropriate. The wrappers shield users from having to distinguish between “small” and “large” index formats, discussed briefly in the following section. Also, the hisat2 wrapper provides some key functionality, like the ability to handle compressed inputs, and the functionality for --un, --al and related options.

It is recommended that you always run the hisat2 wrappers and not run the binaries directly.
For more understanding about working on Hisat2 refer [http://daehwankimlab.github.io/hisat2/manual/](http://daehwankimlab.github.io/hisat2/manual/) 


#### GENE EXPRESSION QUANTIFICATION
The simplest approach to quantifying gene expression by RNA-seq is to count the number of reads that map (i.e. align) to each gene (read count). This gene-level quantification approach utilises a gene transfer format (GTF) file containing gene models, with each model representing the structure of transcripts produced by a given gene.

Raw read counts are affected by factors such as transcript length (longer transcripts have higher read counts, at the same expression level) and total number of reads. Thus, if we want to compare expression levels between samples, we need to normalise the raw read counts. The measure RPKM (reads per kilobase of exon model per million reads) and its derivative FPKM (fragments per kilobase of exon model per million reads mapped) account for both gene length and library size effects

Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples. However, it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads (at the same expression level).


[HTseq-count](https://htseq.readthedocs.io/en/release_0.11.1/) Analyses high-throughput sequencing data with Python

**HTSeq** is a Python package that provides infrastructure to process data from high-throughput sequencing assays.

#### DIFFERENTIAL EXPRESSION ANALYSIS

Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, we use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.

Differential expression analysis of RNA-seq expression profiles with biological replication. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. As well as RNA-seq, it be applied to differential signal analysis of other types of genomic data that produce read counts, including ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE and CAGE.