diff --git a/docs/source/getting_started/bioinformatics.md b/docs/source/getting_started/bioinformatics.md deleted file mode 100644 index 54bbf7ab0eb49ef0933e2752f8ede386bef67dd3..0000000000000000000000000000000000000000 --- a/docs/source/getting_started/bioinformatics.md +++ /dev/null @@ -1,586 +0,0 @@ -# Bioinformatics Pipelines -Bioinformatics is an interdisciplinary field focused on developing software and hardware tools and methods to support biological data storage, organization, and analysis, particularly related to genetic sequencing. - -RNA-sequencing Data analysis is a Bioinformatics pipeline that deals on the analysis of transcriptome, indicating which of the genes encoded in our DNA are turned on or off and to what extent. - -## Bioinformatics Analysis of RNA-seq Data -RNA-seq is a powerful platform for comprehensive investigation of the transcriptome. RNA sequencing (RNA-Seq) uses the capabilities of high-throughput sequencing methods to provide insight into the transcriptome of a cell. - -### The General bioinformatics workflow for the quantitative analysis of RNA-seq data: -- RNA-seq Quality Check; -- Quality timming of Adapters; -- mapping sequencing reads to a reference genome or transcriptome; -- quantifying expression levels of individual genes and transcripts; and -- identifying specific genes and transcripts that are differentially expressed between samples. - -```{image} ../img/rna-seq_workflow.png -``` - -#### RAW DATA - FASTQ Files -Raw RNA-seq data are typically formatted as **FASTQ** files. **FASTQ** is a text-based format storing the sequences of the reads as well as their sequencing quality. The file is organized in groups of four lines per read as shown below: - -> @NB500929:247:HL2TYBGX3:1:11101:25163:1060 -> GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT -> + -> !’’*((((\*\*\*+))%%%++)(%%%%).1\*\*\*-+\*’’))**55CCF>>>>>>CCCCCCC65 - -The first line starts with “@†and is followed by a unique sequence identifier, which includes instrument ID (NB500929), run number (247), and flow cell ID (HL2TYBGX3), followed by the numbers specifying the location of the DNA fragment on the flowcell. In the case of paired-end sequencing, two FASTQ files for read 1 and read 2 include the same sequence identifiers plus the read number (1 or 2), which indicates whether the sequence comes from read 1 or read 2 of the DNA fragment. The second line contains the read sequence. The third line starts with a “+†character and can optionally be followed by the same sequence identifier and any additional description. The fourth line encodes the sequencing quality scores for each base, which are coded as individual symbols according to a [coding scheme](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm) - -#### QUALITY CHECK ON RAW DATA -RNA-Seq has become one of the most widely used applications based on next-generation sequencing technology. However, raw RNA-Seq data may have quality issues, which can significantly distort analytical results and lead to erroneous conclusions. Therefore, the raw data must be subjected to vigorous quality control (QC) procedures before downstream analysis. Currently, an accurate and complete QC of RNA-Seq data requires of a suite of different QC tools used consecutively, which is inefficient in terms of usability, running time, file usage, and interpretability of the results. - -[FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) provides a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. - -The main functions of FastQC are: - - Import of data from FASTQ files (also accepts BAM and SAM alignment files) - Quick overview of any likely sequencing problems - Summary graphs and tables to quickly assess your data - Export of results as an HTML-based report - -FastQC has a really well documented manual page with detailed explanations about every plot in the report. - -<details> -<summary>Working on FastQC</summary> - -```bash - $ fastqc SRR20076358_1.fastq.gz -``` -Running the above command produces the following output in console and Results in making reports as two files `SRR20076358_1_fastqc.zip` and `SRR20076358_1_fastqc.html` -```bash -Started analysis of SRR20076358_1.fastq.gz - Approx 5% complete for SRR20076358_1.fastq.gz - Approx 10% complete for SRR20076358_1.fastq.gz - Approx 15% complete for SRR20076358_1.fastq.gz - Approx 20% complete for SRR20076358_1.fastq.gz - Approx 25% complete for SRR20076358_1.fastq.gz - Approx 30% complete for SRR20076358_1.fastq.gz - Approx 35% complete for SRR20076358_1.fastq.gz - Approx 40% complete for SRR20076358_1.fastq.gz - Approx 45% complete for SRR20076358_1.fastq.gz - Approx 50% complete for SRR20076358_1.fastq.gz - Approx 55% complete for SRR20076358_1.fastq.gz - Approx 60% complete for SRR20076358_1.fastq.gz - Approx 65% complete for SRR20076358_1.fastq.gz - Approx 70% complete for SRR20076358_1.fastq.gz - Approx 75% complete for SRR20076358_1.fastq.gz - Approx 80% complete for SRR20076358_1.fastq.gz - Approx 85% complete for SRR20076358_1.fastq.gz - Approx 90% complete for SRR20076358_1.fastq.gz - Approx 95% complete for SRR20076358_1.fastq.gz - Analysis complete for SRR20076358_1.fastq.gz -``` -Run the following command to view the result of the Quality Check as shown in _fig.2.1_ - -```bash -$ firefox SRR20076358_1.fastq.gz -``` -```{image} ../img/fastqc_report.png -``` -<p align="center"> -fig.2.1 -</p> -</details> - -#### QUALITY TRIMMING OF ADAPTERS -Trimming for adaptors and low quality bases is important part of the analysis pipeline for sequencing data. Typically, after you isolate and fragment your RNA sample, adaptors are attached to the ends of the sequences that are needed for sequencing .These adaptors need to be removed from the sequenced reads before downstream processing. An additional step that needs to be taken is removing low quality bases. - -Quality trimming decreases the overall number of reads, but increases to the total and proportion of uniquely mapped reads. Thus, you get more useful data for downstream analyses. - -There are many tools for trimming reads and removing adapters, such as Trim Galore!, Trimmomatic, Cutadapt, skewer, AlienTrimmer, BBDuk, and the most recent SOAPnuke and fastp. - -Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing). -<details> -<summary>Working on trim_galore</summary> - -```bash -$ trim_galore --gzip --fastqc --max_n 2 --paired --length 50 SRR11862696_1.fastq.gz SRR11862696_2.fastq.gz -``` -<details> -<summary>The above command will produce the following result in console -</summary> - -```bash -# trim_galore --gzip --fastqc --max_n 2 --paired --length 50 SRR11862696_1.fastq.gz SRR11862696_2.fastq.gz - -Multicore support not enabled. Proceeding with single-core trimming. -Path to Cutadapt set as: 'cutadapt' (default) -Cutadapt seems to be working fine (tested command 'cutadapt --version') -Cutadapt version: 4.1 -single-core operation. -No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default) - - - -AUTO-DETECTING ADAPTER TYPE -=========================== -Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> SRR11862696_1.fastq.gz <<) - -Found perfect matches for the following adapter sequences: -Adapter type Count Sequence Sequences analysed Percentage -Illumina 4229 AGATCGGAAGAGC 1000000 0.42 -Nextera 7 CTGTCTCTTATA 1000000 0.00 -smallRNA 3 TGGAATTCTCGG 1000000 0.00 -Using Illumina adapter for trimming (count: 4229). Second best hit was Nextera (count: 7) - -Writing report to 'SRR11862696_1.fastq.gz_trimming_report.txt' - -SUMMARISING RUN PARAMETERS -========================== -Input filename: SRR11862696_1.fastq.gz -Trimming mode: paired-end -Trim Galore version: 0.6.6 -Cutadapt version: 4.1 -Number of cores used for trimming: 1 -Quality Phred score cutoff: 20 -Quality encoding type selected: ASCII+33 -Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected) -Maximum trimming error rate: 0.1 (default) -Maximum number of tolerated Ns: 2 -Minimum required adapter overlap (stringency): 1 bp -Minimum required sequence length for both reads before a sequence pair gets removed: 50 bp -Running FastQC on the data once trimming has completed -Output file(s) will be GZIP compressed - -Cutadapt seems to be fairly up-to-date (version 4.1). Setting -j 1 -Writing final adapter and quality trimmed output to SRR11862696_1_trimmed.fq.gz - - - >>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR11862696_1.fastq.gz <<< -10000000 sequences processed -20000000 sequences processed -30000000 sequences processed -40000000 sequences processed -This is cutadapt 4.1 with Python 3.10.6 -Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR11862696_1.fastq.gz -Processing single-end reads on 1 core ... -Finished in 1353.17 s (29 µs/read; 2.08 M reads/minute). - -=== Summary === - -Total reads processed: 46,831,782 -Reads with adapters: 15,832,779 (33.8%) -Reads written (passing filters): 46,831,782 (100.0%) - -Total basepairs processed: 4,730,009,982 bp -Quality-trimmed: 45,195,419 bp (1.0%) -Total written (filtered): 4,644,800,323 bp (98.2%) - -=== Adapter 1 === - -Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 15832779 times - -Minimum overlap: 1 -No. of allowed errors: -1-9 bp: 0; 10-13 bp: 1 - -Bases preceding removed adapters: - A: 29.1% - C: 33.1% - G: 21.6% - T: 15.3% - none/other: 1.0% - -Overview of removed sequences -length count expect max.err error counts -1 10545866 11707945.5 0 10545866 -2 3526202 2926986.4 0 3526202 -3 947984 731746.6 0 947984 -4 207831 182936.6 0 207831 -5 70578 45734.2 0 70578 -6 34326 11433.5 0 34326 -7 27723 2858.4 0 27723 -8 25762 714.6 0 25762 -9 23820 178.6 0 23189 631 -10 21148 44.7 1 19978 1170 -11 18346 11.2 1 17295 1051 -12 16859 2.8 1 16455 404 -13 14322 0.7 1 14086 236 -14 14148 0.7 1 13862 286 -15 13064 0.7 1 12807 257 -16 13012 0.7 1 12753 259 -17 12842 0.7 1 12598 244 -18 12040 0.7 1 11787 253 -19 9449 0.7 1 9243 206 -20 8880 0.7 1 8713 167 -21 8038 0.7 1 7853 185 -22 6802 0.7 1 6651 151 -23 6453 0.7 1 6261 192 -24 5955 0.7 1 5803 152 -25 5584 0.7 1 5415 169 -26 5190 0.7 1 5018 172 -27 5085 0.7 1 4924 161 -28 4953 0.7 1 4808 145 -29 4594 0.7 1 4427 167 -30 4196 0.7 1 4048 148 -31 3299 0.7 1 3176 123 -32 3192 0.7 1 3072 120 -33 2944 0.7 1 2808 136 -34 2785 0.7 1 2612 173 -35 2601 0.7 1 2458 143 -36 2053 0.7 1 1923 130 -37 2335 0.7 1 2155 180 -38 2403 0.7 1 2260 143 -39 2047 0.7 1 1855 192 -40 2027 0.7 1 1880 147 -41 1750 0.7 1 1537 213 -42 1814 0.7 1 1595 219 -43 1645 0.7 1 1545 100 -44 994 0.7 1 914 80 -45 1238 0.7 1 1163 75 -46 975 0.7 1 880 95 -47 1148 0.7 1 1021 127 -48 1106 0.7 1 991 115 -49 958 0.7 1 864 94 -50 980 0.7 1 839 141 -51 924 0.7 1 825 99 -52 828 0.7 1 721 107 -53 958 0.7 1 742 216 -54 1029 0.7 1 741 288 -55 917 0.7 1 785 132 -56 529 0.7 1 464 65 -57 698 0.7 1 537 161 -58 740 0.7 1 519 221 -59 829 0.7 1 535 294 -60 1114 0.7 1 576 538 -61 992 0.7 1 596 396 -62 1052 0.7 1 456 596 -63 1551 0.7 1 483 1068 -64 2633 0.7 1 533 2100 -65 4064 0.7 1 619 3445 -66 2119 0.7 1 542 1577 -67 1983 0.7 1 503 1480 -68 2846 0.7 1 449 2397 -69 5405 0.7 1 485 4920 -70 10363 0.7 1 662 9701 -71 47660 0.7 1 968 46692 -72 34144 0.7 1 2204 31940 -73 15329 0.7 1 970 14359 -74 8711 0.7 1 812 7899 -75 3735 0.7 1 481 3254 -76 3407 0.7 1 281 3126 -77 3785 0.7 1 93 3692 -78 2433 0.7 1 44 2389 -79 1587 0.7 1 35 1552 -80 964 0.7 1 17 947 -81 709 0.7 1 14 695 -82 469 0.7 1 8 461 -83 325 0.7 1 5 320 -84 295 0.7 1 5 290 -85 198 0.7 1 3 195 -86 161 0.7 1 6 155 -87 184 0.7 1 9 175 -88 143 0.7 1 3 140 -89 124 0.7 1 2 122 -90 142 0.7 1 1 141 -91 131 0.7 1 1 130 -92 164 0.7 1 3 161 -93 168 0.7 1 0 168 -94 188 0.7 1 1 187 -95 188 0.7 1 1 187 -96 267 0.7 1 0 267 -97 312 0.7 1 2 310 -98 390 0.7 1 0 390 -99 404 0.7 1 0 404 -100 767 0.7 1 0 767 -101 4375 0.7 1 0 4375 - -RUN STATISTICS FOR INPUT FILE: SRR11862696_1.fastq.gz -============================================= -46831782 sequences processed in total -The length threshold of paired-end sequences gets evaluated later on (in the validation step) - -Writing report to 'SRR11862696_2.fastq.gz_trimming_report.txt' - -SUMMARISING RUN PARAMETERS -========================== -Input filename: SRR11862696_2.fastq.gz -Trimming mode: paired-end -Trim Galore version: 0.6.6 -Cutadapt version: 4.1 -Number of cores used for trimming: 1 -Quality Phred score cutoff: 20 -Quality encoding type selected: ASCII+33 -Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected) -Maximum trimming error rate: 0.1 (default) -Maximum number of tolerated Ns: 2 -Minimum required adapter overlap (stringency): 1 bp -Minimum required sequence length for both reads before a sequence pair gets removed: 50 bp -Running FastQC on the data once trimming has completed -Output file(s) will be GZIP compressed - -Cutadapt seems to be fairly up-to-date (version 4.1). Setting -j -j 1 -Writing final adapter and quality trimmed output to SRR11862696_2_trimmed.fq.gz - - - >>> Now performing quality (cutoff '-q 20') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file SRR11862696_2.fastq.gz <<< -10000000 sequences processed -20000000 sequences processed -30000000 sequences processed -40000000 sequences processed -This is cutadapt 4.1 with Python 3.10.6 -Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC SRR11862696_2.fastq.gz -Processing single-end reads on 1 core ... -Finished in 1380.06 s (29 µs/read; 2.04 M reads/minute). - -=== Summary === - -Total reads processed: 46,831,782 -Reads with adapters: 15,576,554 (33.3%) -Reads written (passing filters): 46,831,782 (100.0%) - -Total basepairs processed: 4,730,009,982 bp -Quality-trimmed: 103,667,050 bp (2.2%) -Total written (filtered): 4,596,284,410 bp (97.2%) - -=== Adapter 1 === - -Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 15576554 times - -Minimum overlap: 1 -No. of allowed errors: -1-9 bp: 0; 10-13 bp: 1 - -Bases preceding removed adapters: - A: 29.7% - C: 33.2% - G: 21.6% - T: 15.3% - none/other: 0.2% - -Overview of removed sequences -length count expect max.err error counts -1 10415056 11707945.5 0 10415056 -2 3549673 2926986.4 0 3549673 -3 930335 731746.6 0 930335 -4 203432 182936.6 0 203432 -5 70152 45734.2 0 70152 -6 34001 11433.5 0 34001 -7 28487 2858.4 0 28487 -8 25528 714.6 0 25528 -9 23898 178.6 0 23109 789 -10 22040 44.7 1 20613 1427 -11 18086 11.2 1 16804 1282 -12 17308 2.8 1 16717 591 -13 15092 0.7 1 14434 658 -14 15841 0.7 1 15497 344 -15 11921 0.7 1 11544 377 -16 12960 0.7 1 12493 467 -17 14323 0.7 1 14027 296 -18 9700 0.7 1 9441 259 -19 11057 0.7 1 10834 223 -20 7979 0.7 1 7799 180 -21 7047 0.7 1 6893 154 -22 6828 0.7 1 6651 177 -23 7131 0.7 1 6180 951 -24 6703 0.7 1 6513 190 -25 5485 0.7 1 5227 258 -26 8361 0.7 1 5068 3293 -27 5341 0.7 1 4702 639 -28 5433 0.7 1 5188 245 -29 5099 0.7 1 4100 999 -30 8235 0.7 1 8058 177 -31 574 0.7 1 431 143 -32 3205 0.7 1 3094 111 -33 1589 0.7 1 1483 106 -34 2116 0.7 1 2014 102 -35 2219 0.7 1 2095 124 -36 2189 0.7 1 1992 197 -37 2105 0.7 1 1963 142 -38 2127 0.7 1 1985 142 -39 2108 0.7 1 1912 196 -40 1853 0.7 1 1753 100 -41 2786 0.7 1 1641 1145 -42 2087 0.7 1 2001 86 -43 974 0.7 1 885 89 -44 2286 0.7 1 1231 1055 -45 1723 0.7 1 1619 104 -46 777 0.7 1 690 87 -47 920 0.7 1 864 56 -48 912 0.7 1 829 83 -49 1068 0.7 1 869 199 -50 1074 0.7 1 957 117 -51 1098 0.7 1 1033 65 -52 631 0.7 1 589 42 -53 598 0.7 1 551 47 -54 661 0.7 1 608 53 -55 728 0.7 1 649 79 -56 528 0.7 1 476 52 -57 656 0.7 1 578 78 -58 602 0.7 1 561 41 -59 593 0.7 1 544 49 -60 632 0.7 1 548 84 -61 681 0.7 1 516 165 -62 1995 0.7 1 637 1358 -63 1889 0.7 1 1135 754 -64 4483 0.7 1 1659 2824 -65 13843 0.7 1 3320 10523 -66 6624 0.7 1 2171 4453 -67 1203 0.7 1 529 674 -68 354 0.7 1 196 158 -69 163 0.7 1 72 91 -70 80 0.7 1 19 61 -71 55 0.7 1 21 34 -72 81 0.7 1 19 62 -73 48 0.7 1 14 34 -74 36 0.7 1 8 28 -75 29 0.7 1 1 28 -76 38 0.7 1 5 33 -77 60 0.7 1 6 54 -78 31 0.7 1 3 28 -79 35 0.7 1 7 28 -80 39 0.7 1 3 36 -81 48 0.7 1 4 44 -82 29 0.7 1 5 24 -83 32 0.7 1 4 28 -84 34 0.7 1 3 31 -85 34 0.7 1 5 29 -86 32 0.7 1 6 26 -87 53 0.7 1 7 46 -88 32 0.7 1 3 29 -89 40 0.7 1 6 34 -90 35 0.7 1 5 30 -91 41 0.7 1 2 39 -92 61 0.7 1 1 60 -93 32 0.7 1 1 31 -94 32 0.7 1 2 30 -95 19 0.7 1 2 17 -96 25 0.7 1 0 25 -97 33 0.7 1 1 32 -98 68 0.7 1 1 67 -99 24 0.7 1 0 24 -100 27 0.7 1 0 27 -101 105 0.7 1 0 105 - -RUN STATISTICS FOR INPUT FILE: SRR11862696_2.fastq.gz -============================================= -46831782 sequences processed in total -The length threshold of paired-end sequences gets evaluated later on (in the validation step) - -Validate paired-end files SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz -file_1: SRR11862696_1_trimmed.fq.gz, file_2: SRR11862696_2_trimmed.fq.gz - - ->>>>> Now validing the length of the 2 paired-end infiles: SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz <<<<< -Writing validated paired-end Read 1 reads to SRR11862696_1_val_1.fq.gz -Writing validated paired-end Read 2 reads to SRR11862696_2_val_2.fq.gz - - - - - - - -Total number of sequences analysed: 46831782 - -Number of sequence pairs removed because at least one read was shorter than the length cutoff (50 bp): 1159884 (2.48%) -Number of sequence pairs removed because at least one read contained more N(s) than the specified limit of 2: 62468 (0.13%) - - - >>> Now running FastQC on the validated data SRR11862696_1_val_1.fq.gz<<< - -Started analysis of SRR11862696_1_val_1.fq.gz -Approx 5% complete for SRR11862696_1_val_1.fq.gz -Approx 10% complete for SRR11862696_1_val_1.fq.gz -Approx 15% complete for SRR11862696_1_val_1.fq.gz -Approx 20% complete for SRR11862696_1_val_1.fq.gz -Approx 25% complete for SRR11862696_1_val_1.fq.gz -Approx 30% complete for SRR11862696_1_val_1.fq.gz -Approx 35% complete for SRR11862696_1_val_1.fq.gz -Approx 40% complete for SRR11862696_1_val_1.fq.gz -Approx 45% complete for SRR11862696_1_val_1.fq.gz -Approx 50% complete for SRR11862696_1_val_1.fq.gz -Approx 55% complete for SRR11862696_1_val_1.fq.gz -Approx 60% complete for SRR11862696_1_val_1.fq.gz -Approx 65% complete for SRR11862696_1_val_1.fq.gz -Approx 70% complete for SRR11862696_1_val_1.fq.gz -Approx 75% complete for SRR11862696_1_val_1.fq.gz -Approx 80% complete for SRR11862696_1_val_1.fq.gz -Approx 85% complete for SRR11862696_1_val_1.fq.gz -Approx 90% complete for SRR11862696_1_val_1.fq.gz -Approx 95% complete for SRR11862696_1_val_1.fq.gz -Analysis complete for SRR11862696_1_val_1.fq.gz - - >>> Now running FastQC on the validated data SRR11862696_2_val_2.fq.gz<<< - -Started analysis of SRR11862696_2_val_2.fq.gz -Approx 5% complete for SRR11862696_2_val_2.fq.gz -Approx 10% complete for SRR11862696_2_val_2.fq.gz -Approx 15% complete for SRR11862696_2_val_2.fq.gz -Approx 20% complete for SRR11862696_2_val_2.fq.gz -Approx 25% complete for SRR11862696_2_val_2.fq.gz -Approx 30% complete for SRR11862696_2_val_2.fq.gz -Approx 35% complete for SRR11862696_2_val_2.fq.gz -Approx 40% complete for SRR11862696_2_val_2.fq.gz -Approx 45% complete for SRR11862696_2_val_2.fq.gz -Approx 50% complete for SRR11862696_2_val_2.fq.gz -Approx 55% complete for SRR11862696_2_val_2.fq.gz -Approx 60% complete for SRR11862696_2_val_2.fq.gz -Approx 65% complete for SRR11862696_2_val_2.fq.gz -Approx 70% complete for SRR11862696_2_val_2.fq.gz -Approx 75% complete for SRR11862696_2_val_2.fq.gz -Approx 80% complete for SRR11862696_2_val_2.fq.gz -Approx 85% complete for SRR11862696_2_val_2.fq.gz -Approx 90% complete for SRR11862696_2_val_2.fq.gz -Approx 95% complete for SRR11862696_2_val_2.fq.gz -Analysis complete for SRR11862696_2_val_2.fq.gz -Deleting both intermediate output files SRR11862696_1_trimmed.fq.gz and SRR11862696_2_trimmed.fq.gz - -==================================================================================================== -``` - -</details> - -Produces quality control reports after trimming adapters from the raw data using FastQC as `SRR11862696_1_val_1_fastqc.html` and `SRR11862696_2_val_2_fastqc.html` - -You can view quality control report using the following command: -```bash -$ firefox SRR11862696_1_val_1_fastqc.html -``` - -```{image} ../img/trim_galore_report.png - -``` - -</details> - -#### ALIGNMENT -Once high-quality data are obtained from pre-processing, the next step is the read mapping or alignment.When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome **(genome mapping)** or transcriptome **(transcriptome mapping)**. Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts. - -**INDEX** - -Before doing the mapping, we have to prepare an index from the reference DNA sequence that a chosen algorithm will use. -Like the index at the end of a book, an index of a large DNA sequence allows one to rapidly find shorter sequences embedded in it. Different tools use different approaches at genome/transcriptome indexing. - -**Splice-aware aligners to a reference genome** - -These aligners are able to map to the splicing junctions described in the annotation and even to detect novel ones. -Some of them can detect gene fusions and SNPs and also RNA editing. For some of these tools, the downstream analysis requires the assignation of the aligned reads to a given gene/transcript. - -[HISAT2](https://daehwankimlab.github.io/hisat2/) is the next generation of spliced aligner from the same group that have developed TopHat. It is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). The indexing scheme is called a Hierarchical Graph FM index (HGFM). - -Wrapper - -The **hisat2**, **hisat2-build** and **hisat2-inspect** executables are actually wrapper scripts that call binary programs as appropriate. The wrappers shield users from having to distinguish between “small†and “large†index formats, discussed briefly in the following section. Also, the hisat2 wrapper provides some key functionality, like the ability to handle compressed inputs, and the functionality for --un, --al and related options. - -It is recommended that you always run the hisat2 wrappers and not run the binaries directly. -For more understanding about working on Hisat2 refer [http://daehwankimlab.github.io/hisat2/manual/](http://daehwankimlab.github.io/hisat2/manual/) - - -#### GENE EXPRESSION QUANTIFICATION -The simplest approach to quantifying gene expression by RNA-seq is to count the number of reads that map (i.e. align) to each gene (read count). This gene-level quantification approach utilises a gene transfer format (GTF) file containing gene models, with each model representing the structure of transcripts produced by a given gene. - -Raw read counts are affected by factors such as transcript length (longer transcripts have higher read counts, at the same expression level) and total number of reads. Thus, if we want to compare expression levels between samples, we need to normalise the raw read counts. The measure RPKM (reads per kilobase of exon model per million reads) and its derivative FPKM (fragments per kilobase of exon model per million reads mapped) account for both gene length and library size effects - -Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples. However, it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads (at the same expression level). - - -[HTseq-count](https://htseq.readthedocs.io/en/release_0.11.1/) Analyses high-throughput sequencing data with Python - -**HTSeq** is a Python package that provides infrastructure to process data from high-throughput sequencing assays. - -#### DIFFERENTIAL EXPRESSION ANALYSIS - -Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, we use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation. - -Differential expression analysis of RNA-seq expression profiles with biological replication. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. As well as RNA-seq, it be applied to differential signal analysis of other types of genomic data that produce read counts, including ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE and CAGE. \ No newline at end of file diff --git a/docs/source/index.md b/docs/source/index.md index bca789aa12283def893224d4e71b24a895e15649..fec1ab18429da72f3432d4634d3a50809afd96d8 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -37,7 +37,6 @@ BiTIA CLI creates a zip file of the user input(pipeline) containing a unique has :hidden: true getting_started/Installation -getting_started/bioinformatics getting_started/Working ``` @@ -103,7 +102,8 @@ TODO: Order of searching configuration file. <details> <summary><b>GETTING_STARTED</b></summary> -- To Learn about Bio-Informatics pipelines and RNA-seq analysis refer [Bio-Informatics pipelines](getting_started/bioinformatics.md) +- To Learn about Bio-Informatics pipelines ofs RNA-seq analysis refer +[Bioinformatics pipelines](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/) - To Understand working with BiTIA refer [Working with BiTIA CLI](getting_started/Working.md)