Skip to content
Snippets Groups Projects
Commit 19142044 authored by dilawar's avatar dilawar :ant:
Browse files

Merge branch '5-enable-docs' into 'main'

Resolve "Enable docs"

Closes #5

See merge request !4
parents 769b9ec7 f8ee91db
No related branches found
Tags v0.1.2
1 merge request!4Resolve "Enable docs"
Pipeline #3808 passed with stages
in 4 minutes and 40 seconds
......@@ -33,3 +33,15 @@ build:windows:
- python -m pip install .
- python -m bitia --help
- python -m bitia run "ls -ltrh /"
pages:
tags:
- linux
stage: deploy
script:
- apt update && apt install -y graphviz make
- python3 -m pip install poetry
- make docs && cp -r docs/build/html public
artifacts:
paths:
- public
......@@ -8,12 +8,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [0.2.0] -
### Added
- dev: Session support. Same connection is used multiple time.
- Support for `create`, `logs` and `submit` endpoint.
- Support for `BITIA_SERVER` environment variable.
- Setting log level from the command line
### Fixed
- Session support. Same connection is used multiple time during the session.
## [0.1.3] - 2022-09-29
### Added
......
......@@ -32,9 +32,12 @@ release:
rm -rf dist/*.whl
bash ./.ci/realese.sh
doc html:
docs doc html:
poetry install
cd docs && make html
.PHONY : copr fix test install lint build \
doc docs html \
all check \
runner gitlab-runner image image_upload
runner gitlab-runner \
image image_upload
"""
BioInformatics Tool for Infrastructure Automation (BiTIA)
"""
from importlib.metadata import version as _version
import os
import logging
......
# Getting Started
# BiTIA Installation
BiTIA CLI is the Command Interface tool helpful in getting user inputs with commands and required files, directories etc.
This submits jobs to the server, which contains bitia-runner that works on the job and produce results
## Installation
To get started with using BiTIA, you should [install Python] and make sure [python-pip] is installed with it on your system.
**To work with BiTIA, we need to install bitia-cli and submit jobs to the server**
## Installing bitia
Installing bitia is very simple, you only need to have [python-pip] installed in your system
[install Python]: https://realpython.com/installing-python/
[python-pip]: https://pypi.org/project/pip/
BiTIA CLI is available on [PyPi] as [BiTIA].
### Ensure you have a working pip
As a first step, you should check that you have a working Python with pip
......@@ -24,7 +26,12 @@ If that worked, congratulations! You have a working pip in your environment.
To Install **BiTIA** using pip:
```bash
```
$ python3 -m pip install bitia
```
[BiTIA]: https://pypi.org/project/bitia/
[Pypi]: https://pypi.org/
[install Python]: https://realpython.com/installing-python/
[python-pip]: https://pypi.org/project/pip/
\ No newline at end of file
# Working with BiTIA CLI
BiTIA has two commands to work on it:
- run
- submit
### Bitia run can be employed in three ways;
#### Example 1 - Input as String
- to be filled
#### Example 2 - Input as pipeline file
- to be filled
#### Example 3- Input as a Path Directory
- to be filled
# Bioinformatics Pipelines
Bioinformatics pipelines are an integral component of next-generation sequencing (NGS). Processing raw sequence data to detect genomic alterations has significant impact on disease management and patient care. Because of the lack of published guidance, there is currently a high degree of variability in how members of the global molecular genetics and pathology community establish and validate bioinformatics pipelines.
## Bioinformatics Analysis of NGS Data
NGS bioinformatics pipelines are frequently platform specific and may be customizable on the basis of laboratory needs. A bioinformatics pipeline consists of the following major steps:
### Sequence Generation
> Sequence generation (signal processing and base calling) is the process that converts sensor (optical and nonoptical) data from the sequencing platform and identifies the sequence of nucleotides for each of the short fragments of DNA in the sample prepared for analysis. For each nucleotide sequenced in these short fragments (ie, raw reads), a corresponding Phred-like quality score is assigned, which is sequencing platform specific. The read sequences along with the Phred-like quality scores are stored in a FASTQ file, which is a de facto standard for representing biological sequence information
</details>
### Sequence Alignment
> Sequence alignment is the process of determining where each short DNA sequence read (each typically <250 bp) aligns with a reference genome (eg, the human reference genome used in clinical laboratories). This computationally intensive process assigns a Phred-scale mapping quality score to each of the short sequence reads, indicating the confidence of the alignment process. This step also provides a genomic context (location in the reference genome) to each aligned sequence read, which can be used to calculate the proportion of mapped reads and depth (coverage) of sequencing for one or more loci in the sequenced region of interest. The sequence alignment data are usually stored in a de facto standard binary alignment map (BAM) file format, which is a binary version of the sequence alignment/map format. The newer compressed representation [Compressed and Reference-oriented Alignment Map (CRAM)] or its encrypted version [Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map (SECRAM)]6 is a viable alternative that saves space and secures genetic information, although laboratories need to carefully validate variant calling impact if lossy (as opposed to lossless) compression settings are used in generating CRAM (European Nucleotide Archive, CRAM format specification version 3.0; http://samtools.github.io/hts-specs/CRAMv3.pdf, last accessed November 23, 2016) and SECRAM files.
### Variant Calling
> Variant calling is the process of accurately identifying the differences or variations between the sample and the reference genome sequence. The typical input is a set of aligned reads in BAM or another similar format, which is traversed by the variant caller to identify sequence variants. Variant calling is a heterogeneous collection of algorithmic strategies based on the types of sequence variants, such as single-nucleotide variants (SNVs), small insertions and deletions (indels), copy number alterations, and large structural alterations (insertions, inversions, and translocations). The accuracy of variant calling is highly dependent on the quality of called bases and aligned reads. Therefore, prevariant calling processing, such as local realignment around expected indels and base quality score recalibration, is routinely used to ensure accurate and efficient variant calling. For SNVs and indels, the called variants are represented using the de facto standard variant call format (VCF; https://samtools.github.io/hts-specs/VCFv4.3.pdf, last accessed November 23, 2016). Alternative specifications exist for representing and storing variant calls [Genomic VCF Conventions, https://sites.google.com/site/gvcftools/home/about-gvcf/gvcf-conventions, last accessed November 23, 2016; The Sequence Ontology Genome Variation Format Version 1.10, https://github.com/The-Sequence-Ontology/Specifications/blob/master/gvf.md, last accessed November 23, 2016; The Human Genome Variation Society, Human Genome Variation Society (HGVS) Simple Version 15.11. 2016, http://varnomen.hgvs.org/bg-material/simple, last accessed November 23, 2016; Health GAfGa File Formats, https://www.ga4gh.org/ga4ghtoolkit/genomicdatatoolkit, last accessed November 27, 2017].
### Variant Filtering
> Variant filtering is the process by which variants representing false-positive artifacts of the NGS method are flagged or filtered from the original VCF file on the basis of several sequence alignment and variant calling associated metadata (eg, mapping quality, base-calling quality, strand bias, and others). This is usually a postvariant calling step, although some variant callers incorporate this step as part of the variant calling process. This automated process may be used as a hard filter to allow annotation and review of only the assumed true variants.
### Variant Annotation
> Variant annotation performs queries against multiple sequence and variant databases to characterize each called variant with a rich set of metadata, such as variant location, predicted cDNA and amino acid sequence change (HGVS nomenclature), minor allele frequencies in human populations, and prevalence in different variant databases [eg, Catalogue of Somatic Mutations in Cancer, The Cancer Genome Atlas, Single-Nucleotide Polymorphism (SNP) Database, and ClinVar]. This information is used to further prioritize or filter variants for classification and interpretation.
### Variant Prioritization
> Variant prioritization uses variant annotations to identify clinically insignificant variants (eg, synonymous, deep intronic variants, and established benign polymorphisms), thereby presenting the remaining variants (known or unknown clinical significance) for further review and interpretation. Clinical laboratories often develop variant knowledge bases to facilitate this process.
> Some clinical laboratories choose to apply hard filters on called variants on the basis of variant call metadata or from a data dictionary (variant filtering) as a component of the pipeline analysis software. Because its purpose is to hide certain variants from the view of the human interpreter, it is absolutely critical that filtering algorithms be thoroughly validated to ensure that only those variants meeting strict predefined criteria are being hidden from view. Otherwise, the human interpreter may miss clinically significant variants that may result in harming the patient.
docs/source/img/bitia.png

146 KiB

# Welcome to BiTIA's documentation!
## BioInformatics Tool for Infrastructure Automation (BiTIA)
BiTIA makes it easy to orchestrate the infrastructure required to run
complex bioinformatics pipelines. BiTIA plays well with existing pipeline
solution such as snakement. [See some examples]
<p align="center">
```{image} https://img.shields.io/pypi/v/bitia.svg
:target: https://pypi.python.org/pypi/bitia
```
```{image} https://img.shields.io/pypi/pyversions/bitia.svg
:target: https://www.python.org
```
</p>
BiTIA is a tool that simplifies the infrastructure required to run complex bioinformatics pipelines. BiTIA plays well with the existing pipeline solution such as snakement.
BiTIA v1.0 comes with client facing bitia-cli and bitita-runner which manages things at the server end.
BiTIA has two components: bitia-cli and bitia-runner.
- **bitia-cli**
Creates a zip file of the user input(pipeline) containing a unique hash and ships it to the server.
- **bitia-runner**
Runs the following tasks in the server:
- Creates a docker container with the user input
- Runs the pipeline
- Sends log to the user
- Sends link of results(artifacts) to the user
- Interacts with common cache for reading data/reference files
[see some examples]: examples
```{toctree}
:hidden:
getting_started
developer
:caption: Getting started
:hidden: true
:name: getting_started
getting_started/Installation
getting_started/bioinformatics
getting_started/Working
```
***
If you want to learn how to use BiTIA and installation, check out the following resources:
<details >
<summary><b>REQUIREMENTS</b></summary>
> - Python 3.8+
</details>
<details>
<summary><b>INSTALLATION</b></summary>
<div class="termy">
```console
$ python3 -m pip install bitia
```
</div>
</details>
<details>
<summary><b>GETTING_STARTED</b></summary>
If you want to learn how to use BiTIA, check out the following resources:
### Bioinformatics Pipelines
Bioinformatics pipelines are an integral component of next-generation sequencing (NGS). Processing raw sequence data to detect genomic alterations has significant impact on disease management and patient care. Because of the lack of published guidance, there is currently a high degree of variability in how members of the global molecular genetics and pathology community establish and validate bioinformatics pipelines.
- [Getting Started](getting_started)
- [Features](features)
#### Bioinformatics Analysis of NGS Data
NGS bioinformatics pipelines are frequently platform specific and may be customizable on the basis of laboratory needs. A bioinformatics pipeline consists of the following major steps:
<details>
<summary><b>Sequence Generation</b></summary>
> Sequence generation (signal processing and base calling) is the process that converts sensor (optical and nonoptical) data from the sequencing platform and identifies the sequence of nucleotides for each of the short fragments of DNA in the sample prepared for analysis. For each nucleotide sequenced in these short fragments (ie, raw reads), a corresponding Phred-like quality score is assigned, which is sequencing platform specific. The read sequences along with the Phred-like quality scores are stored in a FASTQ file, which is a de facto standard for representing biological sequence information
</details>
<details>
<summary><b>Sequence Alignment</b></summary>
> Sequence alignment is the process of determining where each short DNA sequence read (each typically <250 bp) aligns with a reference genome (eg, the human reference genome used in clinical laboratories). This computationally intensive process assigns a Phred-scale mapping quality score to each of the short sequence reads, indicating the confidence of the alignment process. This step also provides a genomic context (location in the reference genome) to each aligned sequence read, which can be used to calculate the proportion of mapped reads and depth (coverage) of sequencing for one or more loci in the sequenced region of interest. The sequence alignment data are usually stored in a de facto standard binary alignment map (BAM) file format, which is a binary version of the sequence alignment/map format. The newer compressed representation [Compressed and Reference-oriented Alignment Map (CRAM)] or its encrypted version [Selective retrieval on Encrypted and Compressed Reference-oriented Alignment Map (SECRAM)]6 is a viable alternative that saves space and secures genetic information, although laboratories need to carefully validate variant calling impact if lossy (as opposed to lossless) compression settings are used in generating CRAM (European Nucleotide Archive, CRAM format specification version 3.0; http://samtools.github.io/hts-specs/CRAMv3.pdf, last accessed November 23, 2016) and SECRAM files.
</details>
<details>
<summary><b>Variant Calling</b></summary>
> Variant calling is the process of accurately identifying the differences or variations between the sample and the reference genome sequence. The typical input is a set of aligned reads in BAM or another similar format, which is traversed by the variant caller to identify sequence variants. Variant calling is a heterogeneous collection of algorithmic strategies based on the types of sequence variants, such as single-nucleotide variants (SNVs), small insertions and deletions (indels), copy number alterations, and large structural alterations (insertions, inversions, and translocations). The accuracy of variant calling is highly dependent on the quality of called bases and aligned reads. Therefore, prevariant calling processing, such as local realignment around expected indels and base quality score recalibration, is routinely used to ensure accurate and efficient variant calling. For SNVs and indels, the called variants are represented using the de facto standard variant call format (VCF; https://samtools.github.io/hts-specs/VCFv4.3.pdf, last accessed November 23, 2016). Alternative specifications exist for representing and storing variant calls [Genomic VCF Conventions, https://sites.google.com/site/gvcftools/home/about-gvcf/gvcf-conventions, last accessed November 23, 2016; The Sequence Ontology Genome Variation Format Version 1.10, https://github.com/The-Sequence-Ontology/Specifications/blob/master/gvf.md, last accessed November 23, 2016; The Human Genome Variation Society, Human Genome Variation Society (HGVS) Simple Version 15.11. 2016, http://varnomen.hgvs.org/bg-material/simple, last accessed November 23, 2016; Health GAfGa File Formats, https://www.ga4gh.org/ga4ghtoolkit/genomicdatatoolkit, last accessed November 27, 2017].
</details>
<details>
<summary><b>Variant Filtering</b></summary>
> Variant filtering is the process by which variants representing false-positive artifacts of the NGS method are flagged or filtered from the original VCF file on the basis of several sequence alignment and variant calling associated metadata (eg, mapping quality, base-calling quality, strand bias, and others). This is usually a postvariant calling step, although some variant callers incorporate this step as part of the variant calling process. This automated process may be used as a hard filter to allow annotation and review of only the assumed true variants.
</details>
<details>
<summary><b>Variant Annotation</b></summary>
> Variant annotation performs queries against multiple sequence and variant databases to characterize each called variant with a rich set of metadata, such as variant location, predicted cDNA and amino acid sequence change (HGVS nomenclature), minor allele frequencies in human populations, and prevalence in different variant databases [eg, Catalogue of Somatic Mutations in Cancer, The Cancer Genome Atlas, Single-Nucleotide Polymorphism (SNP) Database, and ClinVar]. This information is used to further prioritize or filter variants for classification and interpretation.
</details>
<details>
<summary><b>Variant Prioritization</b></summary>
> Variant prioritization uses variant annotations to identify clinically insignificant variants (eg, synonymous, deep intronic variants, and established benign polymorphisms), thereby presenting the remaining variants (known or unknown clinical significance) for further review and interpretation. Clinical laboratories often develop variant knowledge bases to facilitate this process.
> Some clinical laboratories choose to apply hard filters on called variants on the basis of variant call metadata or from a data dictionary (variant filtering) as a component of the pipeline analysis software. Because its purpose is to hide certain variants from the view of the human interpreter, it is absolutely critical that filtering algorithms be thoroughly validated to ensure that only those variants meeting strict predefined criteria are being hidden from view. Otherwise, the human interpreter may miss clinically significant variants that may result in harming the patient.
</details>
#### Usage Examples
</details>
<!--
```{eval-rst}
.. automodule:: bitia
``` -->
......@@ -9,8 +9,8 @@ def test_sanity(capsys):
assert len(version) >= 3, version
def almost_equal(s1, s2, threshold=0.9) -> bool:
return SequenceMatcher(a=s1, b=s2).ratio() > threshold
def assert_almost_equal(s1, s2, threshold=0.9):
assert SequenceMatcher(a=s1, b=s2).ratio() > threshold
def test_run_repeat(capsys):
......@@ -19,7 +19,7 @@ def test_run_repeat(capsys):
l1 = capsys.readouterr().out # reset the internal buffer.
bitia.__main__.run_user_input("ls -ltr /", rerun=False)
l2 = capsys.readouterr().out
assert almost_equal(l1, l2, 0.95)
assert_almost_equal(l1, l2, 0.85)
def test_run_simple(capsys):
......@@ -31,4 +31,4 @@ def test_run_simple(capsys):
bitia.__main__.run_user_input("ls -ltr /", rerun=True)
captured = capsys.readouterr()
l2 = captured.out
assert almost_equal(l1, l2, 0.88)
assert_almost_equal(l1, l2, 0.85)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment