@@ -11,6 +11,8 @@ RNA-seq is a powerful platform for comprehensive investigation of the transcript
- quantifying expression levels of individual genes and transcripts; and
- identifying specific genes and transcripts that are differentially expressed between samples.
```{image} ../img/rna-seq_workflow.png
```
#### RAW DATA - FASTQ Files
Raw RNA-seq data are typically formatted as **FASTQ** files. **FASTQ** is a text-based format storing the sequences of the reads as well as their sequencing quality. The file is organized in groups of four lines per read as shown below:
...
...
@@ -560,4 +562,23 @@ Wrapper
The **hisat2**, **hisat2-build** and **hisat2-inspect** executables are actually wrapper scripts that call binary programs as appropriate. The wrappers shield users from having to distinguish between “small” and “large” index formats, discussed briefly in the following section. Also, the hisat2 wrapper provides some key functionality, like the ability to handle compressed inputs, and the functionality for --un, --al and related options.
It is recommended that you always run the hisat2 wrappers and not run the binaries directly.
For more understanding about working on Hisat2 refer [http://daehwankimlab.github.io/hisat2/manual/](http://daehwankimlab.github.io/hisat2/manual/)
\ No newline at end of file
For more understanding about working on Hisat2 refer [http://daehwankimlab.github.io/hisat2/manual/](http://daehwankimlab.github.io/hisat2/manual/)
#### GENE EXPRESSION QUANTIFICATION
The simplest approach to quantifying gene expression by RNA-seq is to count the number of reads that map (i.e. align) to each gene (read count). This gene-level quantification approach utilises a gene transfer format (GTF) file containing gene models, with each model representing the structure of transcripts produced by a given gene.
Raw read counts are affected by factors such as transcript length (longer transcripts have higher read counts, at the same expression level) and total number of reads. Thus, if we want to compare expression levels between samples, we need to normalise the raw read counts. The measure RPKM (reads per kilobase of exon model per million reads) and its derivative FPKM (fragments per kilobase of exon model per million reads mapped) account for both gene length and library size effects
Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples. However, it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads (at the same expression level).
[HTseq-count](https://htseq.readthedocs.io/en/release_0.11.1/) Analyses high-throughput sequencing data with Python
**HTSeq** is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
#### DIFFERENTIAL EXPRESSION ANALYSIS
Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, we use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.
Differential expression analysis of RNA-seq expression profiles with biological replication. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests. As well as RNA-seq, it be applied to differential signal analysis of other types of genomic data that produce read counts, including ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE and CAGE.