# Bioinformatics Pipelines Bioinformatics is an interdisciplinary field focused on developing software and hardware tools and methods to support biological data storage, organization, and analysis, particularly related to genetic sequencing. RNA-sequencing Data analysis is a Bioinformatics pipeline that deals on the analysis of transcriptome, indicating which of the genes encoded in our DNA are turned on or off and to what extent. ## Bioinformatics Analysis of RNA-seq Data RNA-seq is a powerful platform for comprehensive investigation of the transcriptome. RNA sequencing (RNA-Seq) uses the capabilities of high-throughput sequencing methods to provide insight into the transcriptome of a cell. ### The General bioinformatics workflow for the quantitative analysis of RNA-seq data: - RNA-seq Quality Check; - Quality timming of Adapters; - mapping sequencing reads to a reference genome or transcriptome; - quantifying expression levels of individual genes and transcripts; and - identifying specific genes and transcripts that are differentially expressed between samples. ```{image} ../img/rna-seq_workflow.png ``` #### RAW DATA - FASTQ Files Raw RNA-seq data are typically formatted as **FASTQ** files. **FASTQ** is a text-based format storing the sequences of the reads as well as their sequencing quality. The file is organized in groups of four lines per read as shown below: > @NB500929:247:HL2TYBGX3:1:11101:25163:1060 > GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT > + > !’’*((((\*\*\*+))%%%++)(%%%%).1\*\*\*-+\*’’))**55CCF>>>>>>CCCCCCC65 The first line starts with “@” and is followed by a unique sequence identifier, which includes instrument ID (NB500929), run number (247), and flow cell ID (HL2TYBGX3), followed by the numbers specifying the location of the DNA fragment on the flowcell. In the case of paired-end sequencing, two FASTQ files for read 1 and read 2 include the same sequence identifiers plus the read number (1 or 2), which indicates whether the sequence comes from read 1 or read 2 of the DNA fragment. The second line contains the read sequence. The third line starts with a “+” character and can optionally be followed by the same sequence identifier and any additional description. The fourth line encodes the sequencing quality scores for each base, which are coded as individual symbols according to a [coding scheme](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm) #### QUALITY CHECK ON RAW DATA RNA-Seq has become one of the most widely used applications based on next-generation sequencing technology. However, raw RNA-Seq data may have quality issues, which can significantly distort analytical results and lead to erroneous conclusions. Therefore, the raw data must be subjected to vigorous quality control (QC) procedures before downstream analysis. Currently, an accurate and complete QC of RNA-Seq data requires of a suite of different QC tools used consecutively, which is inefficient in terms of usability, running time, file usage, and interpretability of the results. [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are: Import of data from FASTQ files (also accepts BAM and SAM alignment files) Quick overview of any likely sequencing problems Summary graphs and tables to quickly assess your data Export of results as an HTML-based report FastQC has a really well documented manual page with detailed explanations about every plot in the report. <details> <summary>Working on FastQC</summary> ```bash $ fastqc SRR20076358_1.fastq.gz ``` Running the above command produces the following output in console and Results in making reports as two files `SRR20076358_1_fastqc.zip` and `SRR20076358_1_fastqc.html` ```bash Started analysis of SRR20076358_1.fastq.gz Approx 5% complete for SRR20076358_1.fastq.gz Approx 10% complete for SRR20076358_1.fastq.gz Approx 15% complete for SRR20076358_1.fastq.gz Approx 20% complete for SRR20076358_1.fastq.gz Approx 25% complete for SRR20076358_1.fastq.gz Approx 30% complete for SRR20076358_1.fastq.gz Approx 35% complete for SRR20076358_1.fastq.gz Approx 40% complete for SRR20076358_1.fastq.gz Approx 45% complete for SRR20076358_1.fastq.gz Approx 50% complete for SRR20076358_1.fastq.gz Approx 55% complete for SRR20076358_1.fastq.gz Approx 60% complete for SRR20076358_1.fastq.gz Approx 65% complete for SRR20076358_1.fastq.gz Approx 70% complete for SRR20076358_1.fastq.gz Approx 75% complete for SRR20076358_1.fastq.gz Approx 80% complete for SRR20076358_1.fastq.gz Approx 85% complete for SRR20076358_1.fastq.gz Approx 90% complete for SRR20076358_1.fastq.gz Approx 95% complete for SRR20076358_1.fastq.gz Analysis complete for SRR20076358_1.fastq.gz ``` Run the following command to view the result of the Quality Check as shown in _fig.2.1_ ```bash $ firefox SRR20076358_1.fastq.gz ``` ```{image} ../img/fastqc_report.png ``` <p align="center"> fig.2.1 </p> </details> #### QUALITY TRIMMING OF ADAPTERS Trimming for adaptors and low quality bases is important part of the analysis pipeline for sequencing data. Typically, after you isolate and fragment your RNA sample, adaptors are attached to the ends of the sequences that are needed for sequencing .These adaptors need to be removed from the sequenced reads before downstream processing. An additional step that needs to be taken is removing low quality bases. Quality trimming decreases the overall number of reads, but increases to the total and proportion of uniquely mapped reads. Thus, you get more useful data for downstream analyses. There are many tools for trimming reads and removing adapters, such as Trim Galore!, Trimmomatic, Cutadapt, skewer, AlienTrimmer, BBDuk, and the most recent SOAPnuke and fastp. Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing). <details> <summary>Working on trim_galore</summary> ```bash $ trim_galore --gzip --fastqc --max_n 2 --paired --length 50 SRR11862696_1.fastq.gz SRR11862696_2.fastq.gz ``` <details> <summary>The above command will produce the following result in console </summary> ```bash # trim_galore --gzip --fastqc --max_n 2 --paired --length 50 SRR11862696_1.fastq.gz SRR11862696_2.fastq.gz Multicore support not enabled. Proceeding with single-core trimming. Path to Cutadapt set as: 'cutadapt' (default) Cutadapt seems to be working fine (tested command 'cutadapt --version') Cutadapt version: 4.1 single-core operation. No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default) AUTO-DETECTING ADAPTER TYPE =========================== Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> SRR11862696_1.fastq.gz <<) Found perfect matches for the following adapter sequences: Adapter type Count Sequence Sequences analysed Percentage Illumina 4229 AGATCGGAAGAGC 1000000 0.42 Nextera 7 CTGTCTCTTATA 1000000 0.00 smallRNA 3 TGGAATTCTCGG 1000000 0.00 Using Illumina adapter for trimming (count: 4229). Second best hit was Nextera (count: 7) Writing report to 'SRR11862696_1.fastq.gz_trimming_report.txt' SUMMARISING RUN PARAMETERS ========================== Input filename: SRR11862696_1.fastq.gz Trimming mode: paired-end Trim Galore version: 0.6.6 Cutadapt version: 4.1 Number of cores used for trimming: 1 Quality Phred score cutoff: 20 Quality encoding type selected: ASCII+33 Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected) Maximum trimming error rate: 0.1 (default) Maximum number of tolerated Ns: 2 Minimum required adapter overlap (stringency): 1 bp Minimum required sequence length for both reads before a sequence pair gets removed: 50 bp Running FastQC on the data once trimming has completed Output file(s) will be GZIP compressed Cutadapt seems to be fairly up-to-date (version 4.1). Setting -j 1 Writing final adapter and quality trimmed output to SRR11862696_1_trimmed.fq.gz >>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR11862696_1.fastq.gz <<< 10000000 sequences processed 20000000 sequences processed 30000000 sequences processed 40000000 sequences processed This is cutadapt 4.1 with Python 3.10.6 Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR11862696_1.fastq.gz Processing single-end reads on 1 core ... Finished in 1353.17 s (29 µs/read; 2.08 M reads/minute). === Summary === Total reads processed: 46,831,782 Reads with adapters: 15,832,779 (33.8%) Reads written (passing filters): 46,831,782 (100.0%) Total basepairs processed: 4,730,009,982 bp Quality-trimmed: 45,195,419 bp (1.0%) Total written (filtered): 4,644,800,323 bp (98.2%) === Adapter 1 === Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 15832779 times Minimum overlap: 1 No. of allowed errors: 1-9 bp: 0; 10-13 bp: 1 Bases preceding removed adapters: A: 29.1% C: 33.1% G: 21.6% T: 15.3% none/other: 1.0% Overview of removed sequences length count expect max.err error counts 1 10545866 11707945.5 0 10545866 2 3526202 2926986.4 0 3526202 3 947984 731746.6 0 947984 4 207831 182936.6 0 207831 5 70578 45734.2 0 70578 6 34326 11433.5 0 34326 7 27723 2858.4 0 27723 8 25762 714.6 0 25762 9 23820 178.6 0 23189 631 10 21148 44.7 1 19978 1170 11 18346 11.2 1 17295 1051 12 16859 2.8 1 16455 404 13 14322 0.7 1 14086 236 14 14148 0.7 1 13862 286 15 13064 0.7 1 12807 257 16 13012 0.7 1 12753 259 17 12842 0.7 1 12598 244 18 12040 0.7 1 11787 253 19 9449 0.7 1 9243 206 20 8880 0.7 1 8713 167 21 8038 0.7 1 7853 185 22 6802 0.7 1 6651 151 23 6453 0.7 1 6261 192 24 5955 0.7 1 5803 152 25 5584 0.7 1 5415 169 26 5190 0.7 1 5018 172 27 5085 0.7 1 4924 161 28 4953 0.7 1 4808 145 29 4594 0.7 1 4427 167 30 4196 0.7 1 4048 148 31 3299 0.7 1 3176 123 32 3192 0.7 1 3072 120 33 2944 0.7 1 2808 136 34 2785 0.7 1 2612 173 35 2601 0.7 1 2458 143 36 2053 0.7 1 1923 130 37 2335 0.7 1 2155 180 38 2403 0.7 1 2260 143 39 2047 0.7 1 1855 192 40 2027 0.7 1 1880 147 41 1750 0.7 1 1537 213 42 1814 0.7 1 1595 219 43 1645 0.7 1 1545 100 44 994 0.7 1 914 80 45 1238 0.7 1 1163 75 46 975 0.7 1 880 95 47 1148 0.7 1 1021 127 48 1106 0.7 1 991 115 49 958 0.7 1 864 94 50 980 0.7 1 839 141 51 924 0.7 1 825 99 52 828 0.7 1 721 107 53 958 0.7 1 742 216 54 1029 0.7 1 741 288 55 917 0.7 1 785 132 56 529 0.7 1 464 65 57 698 0.7 1 537 161 58 740 0.7 1 519 221 59 829 0.7 1 535 294 60 1114 0.7 1 576 538 61 992 0.7 1 596 396 62 1052 0.7 1 456 596 63 1551 0.7 1 483 1068 64 2633 0.7 1 533 2100 65 4064 0.7 1 619 3445 66 2119 0.7 1 542 1577 67 1983 0.7 1 503 1480 68 2846 0.7 1 449 2397 69 5405 0.7 1 485 4920 70 10363 0.7 1 662 9701 71 47660 0.7 1 968 46692 72 34144 0.7 1 2204 31940 73 15329 0.7 1 970 14359 74 8711 0.7 1 812 7899 75 3735 0.7 1 481 3254 76 3407 0.7 1 281 3126 77 3785 0.7 1 93 3692 78 2433 0.7 1 44 2389 79 1587 0.7 1 35 1552 80 964 0.7 1 17 947 81 709 0.7 1 14 695 82 469 0.7 1 8 461 83 325 0.7 1 5 320 84 295 0.7 1 5 290 85 198 0.7 1 3 195 86 161 0.7 1 6 155 87 184 0.7 1 9 175 88 143 0.7 1 3 140 89 124 0.7 1 2 122 90 142 0.7 1 1 141 91 131 0.7 1 1 130 92 164 0.7 1 3 161 93 168 0.7 1 0 168 94 188 0.7 1 1 187 95 188 0.7 1 1 187 96 267 0.7 1 0 267 97 312 0.7 1 2 310 98 390 0.7 1 0 390 99 404 0.7 1 0 404 100 767 0.7 1 0 767 101 4375 0.7 1 0 4375 RUN STATISTICS FOR INPUT FILE: SRR11862696_1.fastq.gz ============================================= 46831782 sequences processed in total The length threshold of paired-end sequences gets evaluated later on (in the validation step) Writing report to 'SRR11862696_2.fastq.gz_trimming_report.txt' SUMMARISING RUN PARAMETERS ========================== Input filename: SRR11862696_2.fastq.gz Trimming mode: paired-end Trim Galore version: 0.6.6 Cutadapt version: 4.1 Number of cores used for trimming: 1 Quality Phred score cutoff: 20 Quality encoding type selected: ASCII+33 Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected) Maximum trimming error rate: 0.1 (default) Maximum number of tolerated Ns: 2 Minimum required adapter overlap (stringency): 1 bp Minimum required sequence length for both reads before a sequence pair gets removed: 50 bp Running FastQC on the data once trimming has completed Output file(s) will be GZIP compressed Cutadapt seems to be fairly up-to-date (version 4.1). Setting -j -j 1 Writing final adapter and quality trimmed output to SRR11862696_2_trimmed.fq.gz >>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR11862696_2.fastq.gz <<< 10000000 sequences processed 20000000 sequences processed 30000000 sequences processed 40000000 sequences processed This is cutadapt 4.1 with Python 3.10.6 Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR11862696_2.fastq.gz Processing single-end reads on 1 core ... Finished in 1380.06 s (29 µs/read; 2.04 M reads/minute). === Summary === Total reads processed: 46,831,782 Reads with adapters: 15,576,554 (33.3%) Reads written (passing filters): 46,831,782 (100.0%) Total basepairs processed: 4,730,009,982 bp Quality-trimmed: 103,667,050 bp (2.2%) Total written (filtered): 4,596,284,410 bp (97.2%) === Adapter 1 === Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 15576554 times Minimum overlap: 1 No. of allowed errors: 1-9 bp: 0; 10-13 bp: 1 Bases preceding removed adapters: A: 29.7% C: 33.2% G: 21.6% T: 15.3% none/other: 0.2% Overview of removed sequences length count expect max.err error counts 1 10415056 11707945.5 0 10415056 2 3549673 2926986.4 0 3549673 3 930335 731746.6 0 930335 4 203432 182936.6 0 203432 5 70152 45734.2 0 70152 6 34001 11433.5 0 34001 7 28487 2858.4 0 28487 8 25528 714.6 0 25528 9 23898 178.6 0 23109 789 10 22040 44.7 1 20613 1427 11 18086 11.2 1 16804 1282 12 17308 2.8 1 16717 591 13 15092 0.7 1 14434 658 14 15841 0.7 1 15497 344 15 11921 0.7 1 11544 377 16 12960 0.7 1 12493 467 17 14323 0.7 1 14027 296 18 9700 0.7 1 9441 259 19 11057 0.7 1 10834 223 20 7979 0.7 1 7799 180 21 7047 0.7 1 6893 154 22 6828 0.7 1 6651 177 23 7131 0.7 1 6180 951 24 6703 0.7 1 6513 190 25 5485 0.7 1 5227 258 26 8361 0.7 1 5068 3293 27 5341 0.7 1 4702 639 28 5433 0.7 1 5188 245 29 5099 0.7 1 4100 999 30 8235 0.7 1 8058 177 31 574 0.7 1 431 143 32 3205 0.7 1 3094 111 33 1589 0.7 1 1483 106 34 2116 0.7 1 2014 102 35 2219 0.7 1 2095 124 36 2189 0.7 1 1992 197 37 2105 0.7 1 1963 142 38 2127 0.7 1 1985 142 39 2108 0.7 1 1912 196 40 1853 0.7 1 1753 100 41 2786 0.7 1 1641 1145 42 2087 0.7 1 2001 86 43 974 0.7 1 885 89 44 2286 0.7 1 1231 1055 45 1723 0.7 1 1619 104 46 777 0.7 1 690 87 47 920 0.7 1 864 56 48 912 0.7 1 829 83 49 1068 0.7 1 869 199 50 1074 0.7 1 957 117 51 1098 0.7 1 1033 65 52 631 0.7 1 589 42 53 598 0.7 1 551 47 54 661 0.7 1 608 53 55 728 0.7 1 649 79 56 528 0.7 1 476 52 57 656 0.7 1 578 78 58 602 0.7 1 561 41 59 593 0.7 1 544 49 60 632 0.7 1 548 84 61 681 0.7 1 516 165 62 1995 0.7 1 637 1358 63 1889 0.7 1 1135 754 64 4483 0.7 1 1659 2824 65 13843 0.7 1 3320 10523 66 6624 0.7 1 2171 4453 67 1203 0.7 1 529 674 68 354 0.7 1 196 158 69 163 0.7 1 72 91 70 80 0.7 1 19 61 71 55 0.7 1 21 34 72 81 0.7 1 19 62 73 48 0.7 1 14 34 74 36 0.7 1 8 28 75 29 0.7 1 1 28 76 38 0.7 1 5 33 77 60 0.7 1 6 54 78 31 0.7 1 3 28 79 35 0.7 1 7 28 80 39 0.7 1 3 36 81 48 0.7 1 4 44 82 29 0.7 1 5 24 83 32 0.7 1 4 28 84 34 0.7 1 3 31 85 34 0.7 1 5 29 86 32 0.7 1 6 26 87 53 0.7 1 7 46 88 32 0.7 1 3 29 89 40 0.7 1 6 34 90 35 0.7 1 5 30 91 41 0.7 1 2 39 92 61 0.7 1 1 60 93 32 0.7 1 1 31 94 32 0.7 1 2 30 95 19 0.7 1 2 17 96 25 0.7 1 0 25 97 33 0.7 1 1 32 98 68 0.7 1 1 67 99 24 0.7 1 0 24 100 27 0.7 1 0 27 101 105 0.7 1 0 105 RUN STATISTICS FOR INPUT FILE: SRR11862696_2.fastq.gz ============================================= 46831782 sequences processed in total The length threshold of paired-end sequences gets evaluated later on (in the validation step) Validate paired-end files SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz file_1: SRR11862696_1_trimmed.fq.gz, file_2: SRR11862696_2_trimmed.fq.gz >>>>> Now validing the length of the 2 paired-end infiles: SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz <<<<< Writing validated paired-end Read 1 reads to SRR11862696_1_val_1.fq.gz Writing validated paired-end Read 2 reads to SRR11862696_2_val_2.fq.gz Total number of sequences analysed: 46831782 Number of sequence pairs removed because at least one read was shorter than the length cutoff (50 bp): 1159884 (2.48%) Number of sequence pairs removed because at least one read contained more N(s) than the specified limit of 2: 62468 (0.13%) >>> Now running FastQC on the validated data SRR11862696_1_val_1.fq.gz<<< Started analysis of SRR11862696_1_val_1.fq.gz Approx 5% complete for SRR11862696_1_val_1.fq.gz Approx 10% complete for SRR11862696_1_val_1.fq.gz Approx 15% complete for SRR11862696_1_val_1.fq.gz Approx 20% complete for SRR11862696_1_val_1.fq.gz Approx 25% complete for SRR11862696_1_val_1.fq.gz Approx 30% complete for SRR11862696_1_val_1.fq.gz Approx 35% complete for SRR11862696_1_val_1.fq.gz Approx 40% complete for SRR11862696_1_val_1.fq.gz Approx 45% complete for SRR11862696_1_val_1.fq.gz Approx 50% complete for SRR11862696_1_val_1.fq.gz Approx 55% complete for SRR11862696_1_val_1.fq.gz Approx 60% complete for SRR11862696_1_val_1.fq.gz Approx 65% complete for SRR11862696_1_val_1.fq.gz Approx 70% complete for SRR11862696_1_val_1.fq.gz Approx 75% complete for SRR11862696_1_val_1.fq.gz Approx 80% complete for SRR11862696_1_val_1.fq.gz Approx 85% complete for SRR11862696_1_val_1.fq.gz Approx 90% complete for SRR11862696_1_val_1.fq.gz Approx 95% complete for SRR11862696_1_val_1.fq.gz Analysis complete for SRR11862696_1_val_1.fq.gz >>> Now running FastQC on the validated data SRR11862696_2_val_2.fq.gz<<< Started analysis of SRR11862696_2_val_2.fq.gz Approx 5% complete for SRR11862696_2_val_2.fq.gz Approx 10% complete for SRR11862696_2_val_2.fq.gz Approx 15% complete for SRR11862696_2_val_2.fq.gz Approx 20% complete for SRR11862696_2_val_2.fq.gz Approx 25% complete for SRR11862696_2_val_2.fq.gz Approx 30% complete for SRR11862696_2_val_2.fq.gz Approx 35% complete for SRR11862696_2_val_2.fq.gz Approx 40% complete for SRR11862696_2_val_2.fq.gz Approx 45% complete for SRR11862696_2_val_2.fq.gz Approx 50% complete for SRR11862696_2_val_2.fq.gz Approx 55% complete for SRR11862696_2_val_2.fq.gz Approx 60% complete for SRR11862696_2_val_2.fq.gz Approx 65% complete for SRR11862696_2_val_2.fq.gz Approx 70% complete for SRR11862696_2_val_2.fq.gz Approx 75% complete for SRR11862696_2_val_2.fq.gz Approx 80% complete for SRR11862696_2_val_2.fq.gz Approx 85% complete for SRR11862696_2_val_2.fq.gz Approx 90% complete for SRR11862696_2_val_2.fq.gz Approx 95% complete for SRR11862696_2_val_2.fq.gz Analysis complete for SRR11862696_2_val_2.fq.gz Deleting both intermediate output files SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz ==================================================================================================== ``` </details> Produces quality control reports after trimming adapters from the raw data using FastQC as `SRR11862696_1_val_1_fastqc.html` and `SRR11862696_2_val_2_fastqc.html` You can view quality control report using the following command: ```bash $ firefox SRR11862696_1_val_1_fastqc.html ``` ```{image} ../img/trim_galore_report.png ``` </details> #### ALIGNMENT Once high-quality data are obtained from pre-processing, the next step is the read mapping or alignment.When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome **(genome mapping)** or transcriptome **(transcriptome mapping)**. Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts. **INDEX** Before doing the mapping, we have to prepare an index from the reference DNA sequence that a chosen algorithm will use. Like the index at the end of a book, an index of a large DNA sequence allows one to rapidly find shorter sequences embedded in it. Different tools use different approaches at genome/transcriptome indexing. **Splice-aware aligners to a reference genome** These aligners are able to map to the splicing junctions described in the annotation and even to detect novel ones. Some of them can detect gene fusions and SNPs and also RNA editing. For some of these tools, the downstream analysis requires the assignation of the aligned reads to a given gene/transcript. [HISAT2](https://daehwankimlab.github.io/hisat2/) is the next generation of spliced aligner from the same group that have developed TopHat. It is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). The indexing scheme is called a Hierarchical Graph FM index (HGFM). Wrapper The **hisat2**, **hisat2-build** and **hisat2-inspect** executables are actually wrapper scripts that call binary programs as appropriate. The wrappers shield users from having to distinguish between “small” and “large” index formats, discussed briefly in the following section. Also, the hisat2 wrapper provides some key functionality, like the ability to handle compressed inputs, and the functionality for --un, --al and related options. It is recommended that you always run the hisat2 wrappers and not run the binaries directly. For more understanding about working on Hisat2 refer [http://daehwankimlab.github.io/hisat2/manual/](http://daehwankimlab.github.io/hisat2/manual/) #### GENE EXPRESSION QUANTIFICATION The simplest approach to quantifying gene expression by RNA-seq is to count the number of reads that map (i.e. align) to each gene (read count). This gene-level quantification approach utilises a gene transfer format (GTF) file containing gene models, with each model representing the structure of transcripts produced by a given gene. Raw read counts are affected by factors such as transcript length (longer transcripts have higher read counts, at the same expression level) and total number of reads. Thus, if we want to compare expression levels between samples, we need to normalise the raw read counts. The measure RPKM (reads per kilobase of exon model per million reads) and its derivative FPKM (fragments per kilobase of exon model per million reads mapped) account for both gene length and library size effects Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples. However, it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads (at the same expression level). [HTseq-count](https://htseq.readthedocs.io/en/release_0.11.1/) Analyses high-throughput sequencing data with Python **HTSeq** is a Python package that provides infrastructure to process data from high-throughput sequencing assays. #### DIFFERENTIAL EXPRESSION ANALYSIS Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, we use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation. Differential expression analysis of RNA-seq expression profiles with biological replication. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. As well as RNA-seq, it be applied to differential signal analysis of other types of genomic data that produce read counts, including ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE and CAGE.