Skip to content
Snippets Groups Projects
Commit 2baed570 authored by dilawar's avatar dilawar :ant:
Browse files

Merge branch '5-enable-docs' into 'main'

Update Documentation

See merge request !8
parents 45b610dc 133d69dd
No related branches found
No related tags found
1 merge request!8Update Documentation
Pipeline #3948 passed with stages
in 3 minutes and 59 seconds
sphinx sphinx
myst_parser myst_parser
sphinx-tabs
\ No newline at end of file
...@@ -35,7 +35,7 @@ release = "0.0.1" ...@@ -35,7 +35,7 @@ release = "0.0.1"
# Add any Sphinx extension module names here, as strings. They can be # Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones. # ones.
extensions = ["myst_parser", "sphinx.ext.autodoc"] extensions = ["myst_parser", "sphinx.ext.autodoc",'sphinx_tabs.tabs']
napoleon_google_docstring = False napoleon_google_docstring = False
...@@ -53,9 +53,17 @@ exclude_patterns = [] ...@@ -53,9 +53,17 @@ exclude_patterns = []
# The theme to use for HTML and HTML Help pages. See the documentation for # The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes. # a list of builtin themes.
# #
# html_theme = "furo" #html_theme = "furo"
# Add any paths that contain custom static files (such as style sheets) here, # Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files, # relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css". # so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"] html_static_path = ["_static"]
sphinx_tabs_valid_builders = ['linkcheck']
sphinx_tabs_disable_tab_closing = True
sphinx_tabs_disable_css_loading = True
\ No newline at end of file
# BiTIA Installation # BiTIA
BiTIA CLI is the Command Interface tool helpful in getting user inputs with commands and required files, directories etc. BiTIA CLI is the Command Line Interface Tool which helps in getting user inputs with commands, directories and required files. It sends tasks to the server, where bitia-runner processes them and generates results.
This submits jobs to the server, which contains bitia-runner that works on the job and produce results **To work with BiTIA, we need to install bitia-cli and submit tasks to the server**
**To work with BiTIA, we need to install bitia-cli and submit jobs to the server**
## Installing bitia ## Installing bitia
BiTIA CLI is available on [PyPi] as [BiTIA].
Installing bitia is very simple, you only need to have [python-pip] installed in your system Installing bitia is very simple, you only need to have [python-pip] installed in your system
BiTIA CLI is available on [PyPi] as [BiTIA].
### Ensure you have a working pip
As a first step, you should check that you have a working Python with pip
installed. This can be done by running the following commands and making
sure that the output looks similar.
``` bash
$ python --version
Python 3.N.N
$ pip --version
pip X.Y.Z from ... (python 3.N.N)
```
If that worked, congratulations! You have a working pip in your environment.
To Install **BiTIA** using pip, run this command:
```{eval-rst}
.. tabs::
.. group-tab:: Unix
.. code-block:: bash
$ python3 -m pip install bitia
.. group-tab:: Windows
.. code-block:: bash
$ python3 -m pip install bitia
```
**Configuration**
TODO: Order of searching configuration file.
```{eval-rst}
.. tabs::
.. group-tab:: Unix
.. code-block:: bash
To Install **BiTIA** using pip: 1. ./bitia.toml
2. ~/.bitia.toml
3. $HOME/.config/bitia.toml
4. /etc/bitia.toml
``` .. group-tab:: Windows
$ python3 -m pip install bitia
``` .. code-block:: bash
1. bitia.toml
2. %APPDATA%\bitia.toml
3. %PROGRAMDATA%\bitia.toml
```
[BiTIA]: https://pypi.org/project/bitia/ [BiTIA]: https://pypi.org/project/bitia/
[Pypi]: https://pypi.org/ [Pypi]: https://pypi.org/
[install Python]: https://realpython.com/installing-python/
[python-pip]: https://pypi.org/project/pip/ [python-pip]: https://pypi.org/project/pip/
\ No newline at end of file
# Working with BiTIA CLI # Working with BiTIA CLI
BiTIA has two commands to work on it: The three commands that are the core of BiTIA's functionality are:
- run - run
- submit - submit
- logs
### Bitia run can be employed in three ways;
#### Example 1 - Input as String
- to be filled ### Bitia run can be employed in following ways:
#### Example 2 - Input as pipeline file #### 1. Running commands
- to be filled ```bash
$ bitia run "samtools convert foo.fa#http://example.com/foo.fa"
```
The above code will:
- Add the given command `samtools run foo.fa#http://example.com/foo.fa` into a pipeline file.
- Extract the current working directory with the pipeline as a zip file that contains a unique hash and send it to the server `public.bitia.link`
- The BiTIA runner in the server will execute the pipeline by:
- Installing the given command `samtools`
- Downloading the given input file form the link `http://example.com/foo.fa` and name it as `foo.fa`
- executing the command `samtools convert foo.fa`
- Display the resulting output to the user console
#### Example 3- Input as a Path Directory #### 2. Running Pipelines
- to be filled ```bash
$ bitia run pipeline.sh
```
for `pipeline.sh`:
```bash
#!/usr/bin/env bash
fastqc *.gz
for file in *_1.fastq.gz
do
trim_galore --cores 8 --gzip --fastqc --max_n 2 --output_dir Trimmed_Data/ --paired $file ${file%_1.fastq.gz}_2.fastq.gz
done
hisat2-build -p 8 ./Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa index
```
The above code will:
- Extract the current working directory with the `pipeline.sh` as a zip file that contains a unique hash and sent it to the server `public.bitia.link`
- The BiTIA runner in the server will execute the pipeline
- Installing the given commands `fastqc`, `trim_galore`,`hisat2` .
- executing the commands given in the pipeline
- Displays the resulting output to the user console
# Bioinformatics Pipelines
Bioinformatics pipelines are an integral component of next-generation sequencing (NGS). Processing raw sequence data to detect genomic alterations has significant impact on disease management and patient care. Because of the lack of published guidance, there is currently a high degree of variability in how members of the global molecular genetics and pathology community establish and validate bioinformatics pipelines.
## Bioinformatics Analysis of NGS Data
NGS bioinformatics pipelines are frequently platform specific and may be customizable on the basis of laboratory needs. A bioinformatics pipeline consists of the following major steps:
### Sequence Generation
> Sequence generation (signal processing and base calling) is the process that converts sensor (optical and nonoptical) data from the sequencing platform and identifies the sequence of nucleotides for each of the short fragments of DNA in the sample prepared for analysis. For each nucleotide sequenced in these short fragments (ie, raw reads), a corresponding Phred-like quality score is assigned, which is sequencing platform specific. The read sequences along with the Phred-like quality scores are stored in a FASTQ file, which is a de facto standard for representing biological sequence information
</details>
### Sequence Alignment
> Sequence alignment is the process of determining where each short DNA sequence read (each typically <250 bp) aligns with a reference genome (eg, the human reference genome used in clinical laboratories). This computationally intensive process assigns a Phred-scale mapping quality score to each of the short sequence reads, indicating the confidence of the alignment process. This step also provides a genomic context (location in the reference genome) to each aligned sequence read, which can be used to calculate the proportion of mapped reads and depth (coverage) of sequencing for one or more loci in the sequenced region of interest. The sequence alignment data are usually stored in a de facto standard binary alignment map (BAM) file format, which is a binary version of the sequence alignment/map format. The newer compressed representation [Compressed and Reference-oriented Alignment Map (CRAM)] or its encrypted version [Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map (SECRAM)]6 is a viable alternative that saves space and secures genetic information, although laboratories need to carefully validate variant calling impact if lossy (as opposed to lossless) compression settings are used in generating CRAM (European Nucleotide Archive, CRAM format specification version 3.0; http://samtools.github.io/hts-specs/CRAMv3.pdf, last accessed November 23, 2016) and SECRAM files.
### Variant Calling
> Variant calling is the process of accurately identifying the differences or variations between the sample and the reference genome sequence. The typical input is a set of aligned reads in BAM or another similar format, which is traversed by the variant caller to identify sequence variants. Variant calling is a heterogeneous collection of algorithmic strategies based on the types of sequence variants, such as single-nucleotide variants (SNVs), small insertions and deletions (indels), copy number alterations, and large structural alterations (insertions, inversions, and translocations). The accuracy of variant calling is highly dependent on the quality of called bases and aligned reads. Therefore, prevariant calling processing, such as local realignment around expected indels and base quality score recalibration, is routinely used to ensure accurate and efficient variant calling. For SNVs and indels, the called variants are represented using the de facto standard variant call format (VCF; https://samtools.github.io/hts-specs/VCFv4.3.pdf, last accessed November 23, 2016). Alternative specifications exist for representing and storing variant calls [Genomic VCF Conventions, https://sites.google.com/site/gvcftools/home/about-gvcf/gvcf-conventions, last accessed November 23, 2016; The Sequence Ontology Genome Variation Format Version 1.10, https://github.com/The-Sequence-Ontology/Specifications/blob/master/gvf.md, last accessed November 23, 2016; The Human Genome Variation Society, Human Genome Variation Society (HGVS) Simple Version 15.11. 2016, http://varnomen.hgvs.org/bg-material/simple, last accessed November 23, 2016; Health GAfGa File Formats, https://www.ga4gh.org/ga4ghtoolkit/genomicdatatoolkit, last accessed November 27, 2017].
### Variant Filtering
> Variant filtering is the process by which variants representing false-positive artifacts of the NGS method are flagged or filtered from the original VCF file on the basis of several sequence alignment and variant calling associated metadata (eg, mapping quality, base-calling quality, strand bias, and others). This is usually a postvariant calling step, although some variant callers incorporate this step as part of the variant calling process. This automated process may be used as a hard filter to allow annotation and review of only the assumed true variants.
### Variant Annotation
> Variant annotation performs queries against multiple sequence and variant databases to characterize each called variant with a rich set of metadata, such as variant location, predicted cDNA and amino acid sequence change (HGVS nomenclature), minor allele frequencies in human populations, and prevalence in different variant databases [eg, Catalogue of Somatic Mutations in Cancer, The Cancer Genome Atlas, Single-Nucleotide Polymorphism (SNP) Database, and ClinVar]. This information is used to further prioritize or filter variants for classification and interpretation.
### Variant Prioritization
> Variant prioritization uses variant annotations to identify clinically insignificant variants (eg, synonymous, deep intronic variants, and established benign polymorphisms), thereby presenting the remaining variants (known or unknown clinical significance) for further review and interpretation. Clinical laboratories often develop variant knowledge bases to facilitate this process.
> Some clinical laboratories choose to apply hard filters on called variants on the basis of variant call metadata or from a data dictionary (variant filtering) as a component of the pipeline analysis software. Because its purpose is to hide certain variants from the view of the human interpreter, it is absolutely critical that filtering algorithms be thoroughly validated to ensure that only those variants meeting strict predefined criteria are being hidden from view. Otherwise, the human interpreter may miss clinically significant variants that may result in harming the patient.
docs/source/img/bitia.png

146 KiB

## BioInformatics Tool for Infrastructure Automation (BiTIA)
# BioInformatics Tool for Infrastructure Automation(BiTIA)
<p align="center"> <p align="center">
...@@ -12,32 +14,24 @@ ...@@ -12,32 +14,24 @@
``` ```
</p> </p>
Welcomme to BiTIA Documentation!
BiTIA is a tool that simplifies the infrastructure required to run complex bioinformatics pipelines. BiTIA plays well with the existing pipeline solution such as snakement. BiTIA is a tool that simplifies the infrastructure required to run complex bioinformatics pipelines. BiTIA plays well with the existing pipeline solution such as snakemake.
BiTIA v1.0 comes with client facing bitia-cli and bitita-runner which manages things at the server end.
BiTIA has two components: bitia-cli and bitia-runner. BiTIA has two components: **bitia-cli** and **bitia-runner**.
- **bitia-cli** BiTIA CLI is on the clientside and allows clients to submit tasks to the bitia server.
Creates a zip file of the user input(pipeline) containing a unique hash and ships it to the server. BiTIA runner is on the serverside and operates the given tasks to produce results.
- **bitia-runner**
Runs the following tasks in the server:
- Creates a docker container with the user input
- Runs the pipeline
- Sends log to the user
- Sends link of results(artifacts) to the user
- Interacts with common cache for reading data/reference files
## **bitia-cli**
Most users only need the `bitia-cli` tool to submit tasks.
BiTIA CLI creates a zip file of the user input(pipeline) containing a unique hash and ships it to the server.
[see some examples]: examples [see some examples]: examples
```{toctree} ```{toctree}
:caption: Getting started
:hidden: true :hidden: true
:name: getting_started
getting_started/Installation getting_started/Installation
getting_started/bioinformatics
getting_started/Working getting_started/Working
``` ```
...@@ -52,69 +46,67 @@ If you want to learn how to use BiTIA and installation, check out the following ...@@ -52,69 +46,67 @@ If you want to learn how to use BiTIA and installation, check out the following
> - Python 3.8+ > - Python 3.8+
</details> </details>
<details> <details>
<summary><b>INSTALLATION</b></summary> <summary><b>INSTALLATION & CONFIGURATION</b></summary>
<div class="termy">
```console To Install BiTIA using pip, run this command:
$ python3 -m pip install bitia ```{eval-rst}
.. tabs::
.. group-tab:: Unix
```
</div>
</details>
<details>
<summary><b>GETTING_STARTED</b></summary>
### Bioinformatics Pipelines .. code-block:: bash
Bioinformatics pipelines are an integral component of next-generation sequencing (NGS). Processing raw sequence data to detect genomic alterations has significant impact on disease management and patient care. Because of the lack of published guidance, there is currently a high degree of variability in how members of the global molecular genetics and pathology community establish and validate bioinformatics pipelines.
$ python3 -m pip install bitia
#### Bioinformatics Analysis of NGS Data
NGS bioinformatics pipelines are frequently platform specific and may be customizable on the basis of laboratory needs. A bioinformatics pipeline consists of the following major steps:
<details> .. group-tab:: Windows
<summary><b>Sequence Generation</b></summary>
.. code-block:: bash
$ python3 -m pip install bitia
```
> Sequence generation (signal processing and base calling) is the process that converts sensor (optical and nonoptical) data from the sequencing platform and identifies the sequence of nucleotides for each of the short fragments of DNA in the sample prepared for analysis. For each nucleotide sequenced in these short fragments (ie, raw reads), a corresponding Phred-like quality score is assigned, which is sequencing platform specific. The read sequences along with the Phred-like quality scores are stored in a FASTQ file, which is a de facto standard for representing biological sequence information **Configuration**
</details>
<details> TODO: Order of searching configuration file.
<summary><b>Sequence Alignment</b></summary>
> Sequence alignment is the process of determining where each short DNA sequence read (each typically <250 bp) aligns with a reference genome (eg, the human reference genome used in clinical laboratories). This computationally intensive process assigns a Phred-scale mapping quality score to each of the short sequence reads, indicating the confidence of the alignment process. This step also provides a genomic context (location in the reference genome) to each aligned sequence read, which can be used to calculate the proportion of mapped reads and depth (coverage) of sequencing for one or more loci in the sequenced region of interest. The sequence alignment data are usually stored in a de facto standard binary alignment map (BAM) file format, which is a binary version of the sequence alignment/map format. The newer compressed representation [Compressed and Reference-oriented Alignment Map (CRAM)] or its encrypted version [Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map (SECRAM)]6 is a viable alternative that saves space and secures genetic information, although laboratories need to carefully validate variant calling impact if lossy (as opposed to lossless) compression settings are used in generating CRAM (European Nucleotide Archive, CRAM format specification version 3.0; http://samtools.github.io/hts-specs/CRAMv3.pdf, last accessed November 23, 2016) and SECRAM files.
</details>
<details> ```{eval-rst}
<summary><b>Variant Calling</b></summary> .. tabs::
> Variant calling is the process of accurately identifying the differences or variations between the sample and the reference genome sequence. The typical input is a set of aligned reads in BAM or another similar format, which is traversed by the variant caller to identify sequence variants. Variant calling is a heterogeneous collection of algorithmic strategies based on the types of sequence variants, such as single-nucleotide variants (SNVs), small insertions and deletions (indels), copy number alterations, and large structural alterations (insertions, inversions, and translocations). The accuracy of variant calling is highly dependent on the quality of called bases and aligned reads. Therefore, prevariant calling processing, such as local realignment around expected indels and base quality score recalibration, is routinely used to ensure accurate and efficient variant calling. For SNVs and indels, the called variants are represented using the de facto standard variant call format (VCF; https://samtools.github.io/hts-specs/VCFv4.3.pdf, last accessed November 23, 2016). Alternative specifications exist for representing and storing variant calls [Genomic VCF Conventions, https://sites.google.com/site/gvcftools/home/about-gvcf/gvcf-conventions, last accessed November 23, 2016; The Sequence Ontology Genome Variation Format Version 1.10, https://github.com/The-Sequence-Ontology/Specifications/blob/master/gvf.md, last accessed November 23, 2016; The Human Genome Variation Society, Human Genome Variation Society (HGVS) Simple Version 15.11. 2016, http://varnomen.hgvs.org/bg-material/simple, last accessed November 23, 2016; Health GAfGa File Formats, https://www.ga4gh.org/ga4ghtoolkit/genomicdatatoolkit, last accessed November 27, 2017]. .. group-tab:: Unix
</details>
<details> .. code-block:: bash
<summary><b>Variant Filtering</b></summary>
> Variant filtering is the process by which variants representing false-positive artifacts of the NGS method are flagged or filtered from the original VCF file on the basis of several sequence alignment and variant calling associated metadata (eg, mapping quality, base-calling quality, strand bias, and others). This is usually a postvariant calling step, although some variant callers incorporate this step as part of the variant calling process. This automated process may be used as a hard filter to allow annotation and review of only the assumed true variants. 1. ./bitia.toml
</details> 2. ~/.bitia.toml
<details> 3. $HOME/.config/bitia.toml
<summary><b>Variant Annotation</b></summary> 4. /etc/bitia.toml
> Variant annotation performs queries against multiple sequence and variant databases to characterize each called variant with a rich set of metadata, such as variant location, predicted cDNA and amino acid sequence change (HGVS nomenclature), minor allele frequencies in human populations, and prevalence in different variant databases [eg, Catalogue of Somatic Mutations in Cancer, The Cancer Genome Atlas, Single-Nucleotide Polymorphism (SNP) Database, and ClinVar]. This information is used to further prioritize or filter variants for classification and interpretation. .. group-tab:: Windows
</details>
<details> .. code-block:: bash
<summary><b>Variant Prioritization</b></summary>
> Variant prioritization uses variant annotations to identify clinically insignificant variants (eg, synonymous, deep intronic variants, and established benign polymorphisms), thereby presenting the remaining variants (known or unknown clinical significance) for further review and interpretation. Clinical laboratories often develop variant knowledge bases to facilitate this process. 1. bitia.toml
> Some clinical laboratories choose to apply hard filters on called variants on the basis of variant call metadata or from a data dictionary (variant filtering) as a component of the pipeline analysis software. Because its purpose is to hide certain variants from the view of the human interpreter, it is absolutely critical that filtering algorithms be thoroughly validated to ensure that only those variants meeting strict predefined criteria are being hidden from view. Otherwise, the human interpreter may miss clinically significant variants that may result in harming the patient. 2. %APPDATA%\bitia.toml
3. %PROGRAMDATA%\bitia.toml
```
</details> </details>
#### Usage Examples <details>
<summary><b>GETTING_STARTED</b></summary>
- To Learn about Bio-Informatics pipelines ofs RNA-seq analysis refer
[Bioinformatics pipelines](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/)
- To Understand working with BiTIA refer [Working with BiTIA CLI](getting_started/Working.md)
</details> </details>
<!-- <!--
```{eval-rst} ```{eval-rst}
.. automodule:: bitia .. automodule:: bitia
......
...@@ -22,6 +22,7 @@ myst-parser = "^0.18.1" ...@@ -22,6 +22,7 @@ myst-parser = "^0.18.1"
pylint = "^2.15.3" pylint = "^2.15.3"
mypy = "^0.981" mypy = "^0.981"
twine = "^4.0.1" twine = "^4.0.1"
sphinx-tabs = "^3.4.1"
[build-system] [build-system]
requires = ["poetry-core"] requires = ["poetry-core"]
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment