Newer
Older
Bioinformatics is an interdisciplinary field focused on developing software and hardware tools and methods to support biological data storage, organization, and analysis, particularly related to genetic sequencing.
RNA-sequencing Data analysis is a Bioinformatics pipeline that deals on the analysis of transcriptome, indicating which of the genes encoded in our DNA are turned on or off and to what extent.
RNA-seq is a powerful platform for comprehensive investigation of the transcriptome. RNA sequencing (RNA-Seq) uses the capabilities of high-throughput sequencing methods to provide insight into the transcriptome of a cell.
### The General bioinformatics workflow for the quantitative analysis of RNA-seq data:
- RNA-seq Quality Check;
- Quality timming of Adapters;
- mapping sequencing reads to a reference genome or transcriptome;
- quantifying expression levels of individual genes and transcripts; and
- identifying specific genes and transcripts that are differentially expressed between samples.
```{image} ../img/rna-seq_workflow.png
```
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
Raw RNA-seq data are typically formatted as **FASTQ** files. **FASTQ** is a text-based format storing the sequences of the reads as well as their sequencing quality. The file is organized in groups of four lines per read as shown below:
> @NB500929:247:HL2TYBGX3:1:11101:25163:1060
> GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
> +
> !’’*((((\*\*\*+))%%%++)(%%%%).1\*\*\*-+\*’’))**55CCF>>>>>>CCCCCCC65
The first line starts with “@” and is followed by a unique sequence identifier, which includes instrument ID (NB500929), run number (247), and flow cell ID (HL2TYBGX3), followed by the numbers specifying the location of the DNA fragment on the flowcell. In the case of paired-end sequencing, two FASTQ files for read 1 and read 2 include the same sequence identifiers plus the read number (1 or 2), which indicates whether the sequence comes from read 1 or read 2 of the DNA fragment. The second line contains the read sequence. The third line starts with a “+” character and can optionally be followed by the same sequence identifier and any additional description. The fourth line encodes the sequencing quality scores for each base, which are coded as individual symbols according to a [coding scheme](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm)
#### QUALITY CHECK ON RAW DATA
RNA-Seq has become one of the most widely used applications based on next-generation sequencing technology. However, raw RNA-Seq data may have quality issues, which can significantly distort analytical results and lead to erroneous conclusions. Therefore, the raw data must be subjected to vigorous quality control (QC) procedures before downstream analysis. Currently, an accurate and complete QC of RNA-Seq data requires of a suite of different QC tools used consecutively, which is inefficient in terms of usability, running time, file usage, and interpretability of the results.
[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
The main functions of FastQC are:
Import of data from FASTQ files (also accepts BAM and SAM alignment files)
Quick overview of any likely sequencing problems
Summary graphs and tables to quickly assess your data
Export of results as an HTML-based report
FastQC has a really well documented manual page with detailed explanations about every plot in the report.
<details>
<summary>Working on FastQC</summary>
```bash
$ fastqc SRR20076358_1.fastq.gz
```
Running the above command produces the following output in console and Results in making reports as two files `SRR20076358_1_fastqc.zip` and `SRR20076358_1_fastqc.html`
```bash
Started analysis of SRR20076358_1.fastq.gz
Approx 5% complete for SRR20076358_1.fastq.gz
Approx 10% complete for SRR20076358_1.fastq.gz
Approx 15% complete for SRR20076358_1.fastq.gz
Approx 20% complete for SRR20076358_1.fastq.gz
Approx 25% complete for SRR20076358_1.fastq.gz
Approx 30% complete for SRR20076358_1.fastq.gz
Approx 35% complete for SRR20076358_1.fastq.gz
Approx 40% complete for SRR20076358_1.fastq.gz
Approx 45% complete for SRR20076358_1.fastq.gz
Approx 50% complete for SRR20076358_1.fastq.gz
Approx 55% complete for SRR20076358_1.fastq.gz
Approx 60% complete for SRR20076358_1.fastq.gz
Approx 65% complete for SRR20076358_1.fastq.gz
Approx 70% complete for SRR20076358_1.fastq.gz
Approx 75% complete for SRR20076358_1.fastq.gz
Approx 80% complete for SRR20076358_1.fastq.gz
Approx 85% complete for SRR20076358_1.fastq.gz
Approx 90% complete for SRR20076358_1.fastq.gz
Approx 95% complete for SRR20076358_1.fastq.gz
Analysis complete for SRR20076358_1.fastq.gz
```
Run the following command to view the result of the Quality Check as shown in _fig.2.1_
```bash
$ firefox SRR20076358_1.fastq.gz
```
```{image} ../img/fastqc_report.png
```
<p align="center">
fig.2.1
</p>
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
#### QUALITY TRIMMING OF ADAPTERS
Trimming for adaptors and low quality bases is important part of the analysis pipeline for sequencing data. Typically, after you isolate and fragment your RNA sample, adaptors are attached to the ends of the sequences that are needed for sequencing .These adaptors need to be removed from the sequenced reads before downstream processing. An additional step that needs to be taken is removing low quality bases.
Quality trimming decreases the overall number of reads, but increases to the total and proportion of uniquely mapped reads. Thus, you get more useful data for downstream analyses.
There are many tools for trimming reads and removing adapters, such as Trim Galore!, Trimmomatic, Cutadapt, skewer, AlienTrimmer, BBDuk, and the most recent SOAPnuke and fastp.
Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing).
<details>
<summary>Working on trim_galore</summary>
```bash
$ trim_galore --gzip --fastqc --max_n 2 --paired --length 50 SRR11862696_1.fastq.gz SRR11862696_2.fastq.gz
```
<details>
<summary>The above command will produce the following result in console
</summary>
```bash
# trim_galore --gzip --fastqc --max_n 2 --paired --length 50 SRR11862696_1.fastq.gz SRR11862696_2.fastq.gz
Multicore support not enabled. Proceeding with single-core trimming.
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.1
single-core operation.
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)
AUTO-DETECTING ADAPTER TYPE
===========================
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> SRR11862696_1.fastq.gz <<)
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Illumina 4229 AGATCGGAAGAGC 1000000 0.42
Nextera 7 CTGTCTCTTATA 1000000 0.00
smallRNA 3 TGGAATTCTCGG 1000000 0.00
Using Illumina adapter for trimming (count: 4229). Second best hit was Nextera (count: 7)
Writing report to 'SRR11862696_1.fastq.gz_trimming_report.txt'
SUMMARISING RUN PARAMETERS
==========================
Input filename: SRR11862696_1.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.6
Cutadapt version: 4.1
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Maximum number of tolerated Ns: 2
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 50 bp
Running FastQC on the data once trimming has completed
Output file(s) will be GZIP compressed
Cutadapt seems to be fairly up-to-date (version 4.1). Setting -j 1
Writing final adapter and quality trimmed output to SRR11862696_1_trimmed.fq.gz
>>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR11862696_1.fastq.gz <<<
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
This is cutadapt 4.1 with Python 3.10.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR11862696_1.fastq.gz
Processing single-end reads on 1 core ...
Finished in 1353.17 s (29 µs/read; 2.08 M reads/minute).
=== Summary ===
Total reads processed: 46,831,782
Reads with adapters: 15,832,779 (33.8%)
Reads written (passing filters): 46,831,782 (100.0%)
Total basepairs processed: 4,730,009,982 bp
Quality-trimmed: 45,195,419 bp (1.0%)
Total written (filtered): 4,644,800,323 bp (98.2%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 15832779 times
Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 29.1%
C: 33.1%
G: 21.6%
T: 15.3%
none/other: 1.0%
Overview of removed sequences
length count expect max.err error counts
1 10545866 11707945.5 0 10545866
2 3526202 2926986.4 0 3526202
3 947984 731746.6 0 947984
4 207831 182936.6 0 207831
5 70578 45734.2 0 70578
6 34326 11433.5 0 34326
7 27723 2858.4 0 27723
8 25762 714.6 0 25762
9 23820 178.6 0 23189 631
10 21148 44.7 1 19978 1170
11 18346 11.2 1 17295 1051
12 16859 2.8 1 16455 404
13 14322 0.7 1 14086 236
14 14148 0.7 1 13862 286
15 13064 0.7 1 12807 257
16 13012 0.7 1 12753 259
17 12842 0.7 1 12598 244
18 12040 0.7 1 11787 253
19 9449 0.7 1 9243 206
20 8880 0.7 1 8713 167
21 8038 0.7 1 7853 185
22 6802 0.7 1 6651 151
23 6453 0.7 1 6261 192
24 5955 0.7 1 5803 152
25 5584 0.7 1 5415 169
26 5190 0.7 1 5018 172
27 5085 0.7 1 4924 161
28 4953 0.7 1 4808 145
29 4594 0.7 1 4427 167
30 4196 0.7 1 4048 148
31 3299 0.7 1 3176 123
32 3192 0.7 1 3072 120
33 2944 0.7 1 2808 136
34 2785 0.7 1 2612 173
35 2601 0.7 1 2458 143
36 2053 0.7 1 1923 130
37 2335 0.7 1 2155 180
38 2403 0.7 1 2260 143
39 2047 0.7 1 1855 192
40 2027 0.7 1 1880 147
41 1750 0.7 1 1537 213
42 1814 0.7 1 1595 219
43 1645 0.7 1 1545 100
44 994 0.7 1 914 80
45 1238 0.7 1 1163 75
46 975 0.7 1 880 95
47 1148 0.7 1 1021 127
48 1106 0.7 1 991 115
49 958 0.7 1 864 94
50 980 0.7 1 839 141
51 924 0.7 1 825 99
52 828 0.7 1 721 107
53 958 0.7 1 742 216
54 1029 0.7 1 741 288
55 917 0.7 1 785 132
56 529 0.7 1 464 65
57 698 0.7 1 537 161
58 740 0.7 1 519 221
59 829 0.7 1 535 294
60 1114 0.7 1 576 538
61 992 0.7 1 596 396
62 1052 0.7 1 456 596
63 1551 0.7 1 483 1068
64 2633 0.7 1 533 2100
65 4064 0.7 1 619 3445
66 2119 0.7 1 542 1577
67 1983 0.7 1 503 1480
68 2846 0.7 1 449 2397
69 5405 0.7 1 485 4920
70 10363 0.7 1 662 9701
71 47660 0.7 1 968 46692
72 34144 0.7 1 2204 31940
73 15329 0.7 1 970 14359
74 8711 0.7 1 812 7899
75 3735 0.7 1 481 3254
76 3407 0.7 1 281 3126
77 3785 0.7 1 93 3692
78 2433 0.7 1 44 2389
79 1587 0.7 1 35 1552
80 964 0.7 1 17 947
81 709 0.7 1 14 695
82 469 0.7 1 8 461
83 325 0.7 1 5 320
84 295 0.7 1 5 290
85 198 0.7 1 3 195
86 161 0.7 1 6 155
87 184 0.7 1 9 175
88 143 0.7 1 3 140
89 124 0.7 1 2 122
90 142 0.7 1 1 141
91 131 0.7 1 1 130
92 164 0.7 1 3 161
93 168 0.7 1 0 168
94 188 0.7 1 1 187
95 188 0.7 1 1 187
96 267 0.7 1 0 267
97 312 0.7 1 2 310
98 390 0.7 1 0 390
99 404 0.7 1 0 404
100 767 0.7 1 0 767
101 4375 0.7 1 0 4375
RUN STATISTICS FOR INPUT FILE: SRR11862696_1.fastq.gz
=============================================
46831782 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)
Writing report to 'SRR11862696_2.fastq.gz_trimming_report.txt'
SUMMARISING RUN PARAMETERS
==========================
Input filename: SRR11862696_2.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.6
Cutadapt version: 4.1
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Maximum number of tolerated Ns: 2
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 50 bp
Running FastQC on the data once trimming has completed
Output file(s) will be GZIP compressed
Cutadapt seems to be fairly up-to-date (version 4.1). Setting -j -j 1
Writing final adapter and quality trimmed output to SRR11862696_2_trimmed.fq.gz
>>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR11862696_2.fastq.gz <<<
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
This is cutadapt 4.1 with Python 3.10.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR11862696_2.fastq.gz
Processing single-end reads on 1 core ...
Finished in 1380.06 s (29 µs/read; 2.04 M reads/minute).
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
Total reads processed: 46,831,782
Reads with adapters: 15,576,554 (33.3%)
Reads written (passing filters): 46,831,782 (100.0%)
Total basepairs processed: 4,730,009,982 bp
Quality-trimmed: 103,667,050 bp (2.2%)
Total written (filtered): 4,596,284,410 bp (97.2%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 15576554 times
Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 29.7%
C: 33.2%
G: 21.6%
T: 15.3%
none/other: 0.2%
Overview of removed sequences
length count expect max.err error counts
1 10415056 11707945.5 0 10415056
2 3549673 2926986.4 0 3549673
3 930335 731746.6 0 930335
4 203432 182936.6 0 203432
5 70152 45734.2 0 70152
6 34001 11433.5 0 34001
7 28487 2858.4 0 28487
8 25528 714.6 0 25528
9 23898 178.6 0 23109 789
10 22040 44.7 1 20613 1427
11 18086 11.2 1 16804 1282
12 17308 2.8 1 16717 591
13 15092 0.7 1 14434 658
14 15841 0.7 1 15497 344
15 11921 0.7 1 11544 377
16 12960 0.7 1 12493 467
17 14323 0.7 1 14027 296
18 9700 0.7 1 9441 259
19 11057 0.7 1 10834 223
20 7979 0.7 1 7799 180
21 7047 0.7 1 6893 154
22 6828 0.7 1 6651 177
23 7131 0.7 1 6180 951
24 6703 0.7 1 6513 190
25 5485 0.7 1 5227 258
26 8361 0.7 1 5068 3293
27 5341 0.7 1 4702 639
28 5433 0.7 1 5188 245
29 5099 0.7 1 4100 999
30 8235 0.7 1 8058 177
31 574 0.7 1 431 143
32 3205 0.7 1 3094 111
33 1589 0.7 1 1483 106
34 2116 0.7 1 2014 102
35 2219 0.7 1 2095 124
36 2189 0.7 1 1992 197
37 2105 0.7 1 1963 142
38 2127 0.7 1 1985 142
39 2108 0.7 1 1912 196
40 1853 0.7 1 1753 100
41 2786 0.7 1 1641 1145
42 2087 0.7 1 2001 86
43 974 0.7 1 885 89
44 2286 0.7 1 1231 1055
45 1723 0.7 1 1619 104
46 777 0.7 1 690 87
47 920 0.7 1 864 56
48 912 0.7 1 829 83
49 1068 0.7 1 869 199
50 1074 0.7 1 957 117
51 1098 0.7 1 1033 65
52 631 0.7 1 589 42
53 598 0.7 1 551 47
54 661 0.7 1 608 53
55 728 0.7 1 649 79
56 528 0.7 1 476 52
57 656 0.7 1 578 78
58 602 0.7 1 561 41
59 593 0.7 1 544 49
60 632 0.7 1 548 84
61 681 0.7 1 516 165
62 1995 0.7 1 637 1358
63 1889 0.7 1 1135 754
64 4483 0.7 1 1659 2824
65 13843 0.7 1 3320 10523
66 6624 0.7 1 2171 4453
67 1203 0.7 1 529 674
68 354 0.7 1 196 158
69 163 0.7 1 72 91
70 80 0.7 1 19 61
71 55 0.7 1 21 34
72 81 0.7 1 19 62
73 48 0.7 1 14 34
74 36 0.7 1 8 28
75 29 0.7 1 1 28
76 38 0.7 1 5 33
77 60 0.7 1 6 54
78 31 0.7 1 3 28
79 35 0.7 1 7 28
80 39 0.7 1 3 36
81 48 0.7 1 4 44
82 29 0.7 1 5 24
83 32 0.7 1 4 28
84 34 0.7 1 3 31
85 34 0.7 1 5 29
86 32 0.7 1 6 26
87 53 0.7 1 7 46
88 32 0.7 1 3 29
89 40 0.7 1 6 34
90 35 0.7 1 5 30
91 41 0.7 1 2 39
92 61 0.7 1 1 60
93 32 0.7 1 1 31
94 32 0.7 1 2 30
95 19 0.7 1 2 17
96 25 0.7 1 0 25
97 33 0.7 1 1 32
98 68 0.7 1 1 67
99 24 0.7 1 0 24
100 27 0.7 1 0 27
101 105 0.7 1 0 105
RUN STATISTICS FOR INPUT FILE: SRR11862696_2.fastq.gz
=============================================
46831782 sequences processed in total
The length threshold of paired-end sequences gets evaluated later on (in the validation step)
Validate paired-end files SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz
file_1: SRR11862696_1_trimmed.fq.gz, file_2: SRR11862696_2_trimmed.fq.gz
>>>>> Now validing the length of the 2 paired-end infiles: SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz <<<<<
Writing validated paired-end Read 1 reads to SRR11862696_1_val_1.fq.gz
Writing validated paired-end Read 2 reads to SRR11862696_2_val_2.fq.gz
Total number of sequences analysed: 46831782
Number of sequence pairs removed because at least one read was shorter than the length cutoff (50 bp): 1159884 (2.48%)
Number of sequence pairs removed because at least one read contained more N(s) than the specified limit of 2: 62468 (0.13%)
>>> Now running FastQC on the validated data SRR11862696_1_val_1.fq.gz<<<
Started analysis of SRR11862696_1_val_1.fq.gz
Approx 5% complete for SRR11862696_1_val_1.fq.gz
Approx 10% complete for SRR11862696_1_val_1.fq.gz
Approx 15% complete for SRR11862696_1_val_1.fq.gz
Approx 20% complete for SRR11862696_1_val_1.fq.gz
Approx 25% complete for SRR11862696_1_val_1.fq.gz
Approx 30% complete for SRR11862696_1_val_1.fq.gz
Approx 35% complete for SRR11862696_1_val_1.fq.gz
Approx 40% complete for SRR11862696_1_val_1.fq.gz
Approx 45% complete for SRR11862696_1_val_1.fq.gz
Approx 50% complete for SRR11862696_1_val_1.fq.gz
Approx 55% complete for SRR11862696_1_val_1.fq.gz
Approx 60% complete for SRR11862696_1_val_1.fq.gz
Approx 65% complete for SRR11862696_1_val_1.fq.gz
Approx 70% complete for SRR11862696_1_val_1.fq.gz
Approx 75% complete for SRR11862696_1_val_1.fq.gz
Approx 80% complete for SRR11862696_1_val_1.fq.gz
Approx 85% complete for SRR11862696_1_val_1.fq.gz
Approx 90% complete for SRR11862696_1_val_1.fq.gz
Approx 95% complete for SRR11862696_1_val_1.fq.gz
Analysis complete for SRR11862696_1_val_1.fq.gz
>>> Now running FastQC on the validated data SRR11862696_2_val_2.fq.gz<<<
Started analysis of SRR11862696_2_val_2.fq.gz
Approx 5% complete for SRR11862696_2_val_2.fq.gz
Approx 10% complete for SRR11862696_2_val_2.fq.gz
Approx 15% complete for SRR11862696_2_val_2.fq.gz
Approx 20% complete for SRR11862696_2_val_2.fq.gz
Approx 25% complete for SRR11862696_2_val_2.fq.gz
Approx 30% complete for SRR11862696_2_val_2.fq.gz
Approx 35% complete for SRR11862696_2_val_2.fq.gz
Approx 40% complete for SRR11862696_2_val_2.fq.gz
Approx 45% complete for SRR11862696_2_val_2.fq.gz
Approx 50% complete for SRR11862696_2_val_2.fq.gz
Approx 55% complete for SRR11862696_2_val_2.fq.gz
Approx 60% complete for SRR11862696_2_val_2.fq.gz
Approx 65% complete for SRR11862696_2_val_2.fq.gz
Approx 70% complete for SRR11862696_2_val_2.fq.gz
Approx 75% complete for SRR11862696_2_val_2.fq.gz
Approx 80% complete for SRR11862696_2_val_2.fq.gz
Approx 85% complete for SRR11862696_2_val_2.fq.gz
Approx 90% complete for SRR11862696_2_val_2.fq.gz
Approx 95% complete for SRR11862696_2_val_2.fq.gz
Analysis complete for SRR11862696_2_val_2.fq.gz
Deleting both intermediate output files SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz
====================================================================================================
```
</details>
Produces quality control reports after trimming adapters from the raw data using FastQC as `SRR11862696_1_val_1_fastqc.html` and `SRR11862696_2_val_2_fastqc.html`
You can view quality control report using the following command:
```bash
$ firefox SRR11862696_1_val_1_fastqc.html
```
```{image} ../img/trim_galore_report.png
```
</details>
#### ALIGNMENT
Once high-quality data are obtained from pre-processing, the next step is the read mapping or alignment.When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome **(genome mapping)** or transcriptome **(transcriptome mapping)**. Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts.
**INDEX**
Before doing the mapping, we have to prepare an index from the reference DNA sequence that a chosen algorithm will use.
Like the index at the end of a book, an index of a large DNA sequence allows one to rapidly find shorter sequences embedded in it. Different tools use different approaches at genome/transcriptome indexing.
**Splice-aware aligners to a reference genome**
These aligners are able to map to the splicing junctions described in the annotation and even to detect novel ones.
Some of them can detect gene fusions and SNPs and also RNA editing. For some of these tools, the downstream analysis requires the assignation of the aligned reads to a given gene/transcript.
[HISAT2](https://daehwankimlab.github.io/hisat2/) is the next generation of spliced aligner from the same group that have developed TopHat. It is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). The indexing scheme is called a Hierarchical Graph FM index (HGFM).
Wrapper
The **hisat2**, **hisat2-build** and **hisat2-inspect** executables are actually wrapper scripts that call binary programs as appropriate. The wrappers shield users from having to distinguish between “small” and “large” index formats, discussed briefly in the following section. Also, the hisat2 wrapper provides some key functionality, like the ability to handle compressed inputs, and the functionality for --un, --al and related options.
It is recommended that you always run the hisat2 wrappers and not run the binaries directly.
For more understanding about working on Hisat2 refer [http://daehwankimlab.github.io/hisat2/manual/](http://daehwankimlab.github.io/hisat2/manual/)
#### GENE EXPRESSION QUANTIFICATION
The simplest approach to quantifying gene expression by RNA-seq is to count the number of reads that map (i.e. align) to each gene (read count). This gene-level quantification approach utilises a gene transfer format (GTF) file containing gene models, with each model representing the structure of transcripts produced by a given gene.
Raw read counts are affected by factors such as transcript length (longer transcripts have higher read counts, at the same expression level) and total number of reads. Thus, if we want to compare expression levels between samples, we need to normalise the raw read counts. The measure RPKM (reads per kilobase of exon model per million reads) and its derivative FPKM (fragments per kilobase of exon model per million reads mapped) account for both gene length and library size effects
Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples. However, it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads (at the same expression level).
[HTseq-count](https://htseq.readthedocs.io/en/release_0.11.1/) Analyses high-throughput sequencing data with Python
**HTSeq** is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
#### DIFFERENTIAL EXPRESSION ANALYSIS
Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, we use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.
Differential expression analysis of RNA-seq expression profiles with biological replication. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. As well as RNA-seq, it be applied to differential signal analysis of other types of genomic data that produce read counts, including ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE and CAGE.