# Week6 Variant Database
<p align="justify"> Statement before class: All the contents here marked are belong to the course [<font color=orange>Bioinformatics: Introduction and Methods</font>](https://www.coursera.org/learn/bioinformatics-pku/home/welcome). If you like this high-quality course contents, please turn to the official course for more details. Here, i recored the knowledge learnt for convenience. If any copyright problem exist, please tell me, i’ll delete all immediately, Thanks!</p>
<br>
## Overview of rhe Problem
<br>
### 1. Genetic Variations

- chromosomal aneuploidy;
- structual variations (SVs);
- Indel;
- SNV (Single Nucleotide Variation: about 3 minllion SNVs in one person's genome, equivalent ~1/1000 frequency);
- SNVs within coding regions (stop codon gain-> nonsense, non-synonymous -> missense, synonymous -> same sense/silent, affect splicing, stop codon loss)
<br>
### 2. Sum
- <font color=orange>`Nonesense SNVS are usually considered deleterious`</font>
- even though is is not always the case...
- <font color=orange>`synonymous, intronic, and intergenic variations are often ignored.`</font>
- GWAS show 88% of trait-associated variants of weak effect are non-coding
- remain unclear-studied and new methods are needed
- <font color=orange>`Most studied so far had focused on missense SNVS`</font>
- <font color=orange>`Known deleterious mutations are enriched in missense mutations`</font>
- ~50 of all known mutations of Medelian discorders are missense mutations
- <font color=orange>`Although missense SNVs changen the protein seq, many do not cause phrnotypic changes`</font>
- On average, a healty indicidual has
| Class | Number |
| :-----------------------: | :-------: |
| SNP | 3,019,909 |
| Indel | 361,669 |
| Deletions | 15,893 |
| Deplications | 407 |
| Mobile element insertions | 4,775 |
- While in protein-coding regions
| Calss | Number |
| :--------------------------------: | :------: |
| Genes disturped by large deletions | 147 |
| Stop-introducing SNPs | 1,057 |
| Stop losses | 77 |
| Small frameshift indels | 954 |
| Small in-frome indels | 714 |
| Non-synonymous SNPs | 68,300 |
| Synonymous SNPs | 60,157 |
<br>
## Variant Database
<br>
### 1. The development of common variarnt DBs

<br>
## <font color=orange>`Functional predicton of genetic variants`</font>
<br>
### <font color=orange>`1. Concervation-based methos: SIFT`</font>
SIFT: Sort Intolerant From Tolerant subtitutions (http://sift/jcvi.org/). <font color=orange>`① Mutations in important positions (eg active sites) tend to be conserve, tend to be deleterious; ② Mutations in positons that have high degree of diversity across specie, tend to be neutral. `</font>



<font color=orange>`Accuacy of SIFT: False negtive rate-> 31%, False positive rate -> 20R%, Coverage -> 60%`</font>
<br>
### <font color=orange>`2. The definication of accuacy`</font>
<font color=orange>`This figure below explain accuacy well.`</font>

<br>
### <font color=orange>`3. A rule-based method: PolyPhen`</font>
<br>
PolyPhen: Polymorphism Phenotyping (http://genetics.bwh.harvard.edu/pph2/), which predicts impact of AA variants based on both <font color=orange>`multi-sequence alignmemnt and protein 3D structure features `</font>
- AA variants at <font color=orange>`concerved positions`</font> are more likely to cause <font color=orange>` functional changes`</font>
- AA variants that affect <font color=orange>`active sites, interaction sites, solubility, or stablity of a protein`</font> are likely to affect protein structure.
- Changes in protein structure are likely to casue in protein function, which are likely to cause changes in phenotype.
- Empirically derived rules to predict damaging or begin
<br>
### <font color=orange>`4. Claasifier-based method: SAPRED`</font>
<br>
SAPRED: Single Amino Acid Polymorphisms disease-associated Predictor (www.spared.cbi.pku.edu.cn).

- PDB - get 3D protein structures
- Homology Modeling - predict structure
- Biologically -Intuitive features
- <font color=orange>`Structure neighbor profile: vector`</font>
- <font color=orange>`Nearby functional sites`</font>
- <font color=orange>`Hyfrogen bond change`</font>
<br>
<table><tr><td bgcolor=#ebffeb> <font color=orange>`Wherever there are challenges, there are opprtunities.`</font> </td></tr></table>
<br>
## Supplementary
<br>
### Introduction to SVM
<br>
<font color=orange>`Supervised machine learning: trainning data -> common machine learning methods(eg. SVM, HMM...) -> features -> new data -> model -> prediction`</font>

<br>
<font color=orange>`Classifying data is a common task in machine learning.`</font>
<br>
SVM: Support Vector Machines (回顾分析+模式识别).
Three typical characters.
- A surprised learning model -> <font color=orange>`classification and regression analysis.`</font>
- Select a small of number of critical boundary instances (<font color=orange>`surpport vectors`</font> ) -> build <font color=orange>`linear discriminant functions`</font> -> seprate them as widely as possible.
- <font color=orange>`Kenerl trick`</font> -> perform <font color=orange>`non-linear classification`</font> with mapping inputs into <font color=orange>`high-dimensional fearture spaces.`</font>
<br>
1. **Decision Boundary**
Three key points will get most suitable line to seprate two spaces. Thus, this <font color=orange>`most suitable line called decision boundary.`</font>

<br>
2. **Support Vector**
Above <font color=orange>` three key points`</font> , which used to get the only decision boundary for obtaning classificated spaces <font color=orange>`called support vector.`</font> And be free from discrete points.

<br>
3. **Mathmatics**
<center>
The hyperplane is
$$
w^Tx + b = 0
$$
</center>
So, the classification function is
<center>
$$
f(x) = w^Tx + b \\\\
\\\\
y = \begin{cases}
-x,\quad x\leq 0 \\\\
x,\quad x>0
\end{cases}
$$
</center>
① Solve with <font color=orange>`Quadratic Programming`</font>
<center>
$$
min\frac{1}{2}||w||^2 \quad\quad s.t.y_i(w_Tx_i+b)≥ 1, \quad\quad i = 1,2,...,n.
$$
</center>
② Solve with <font color=orange>`Lagrange multipilers `</font>
<center>
$$
L(w,b,\alpha) = \frac{1}{2}||w||^2-\sum_{i=1}^{n}\alpha_i[y_i(w^Tx_i+b)-1] \\\\
\frac{\delta L}{\delta b} = 0 \quad=> \quad w = \sum_{i=1}^{n}\alpha_iy_ix_i \\\\
\frac{\delta L}{\delta b} = 0 \quad => \quad \sum_{i=1}^{n}\alpha_iy_i =0
$$
</center>
<font color=orange>`Finally, the calssification function con be rewrite as`</font>
<center>
$$
f(x) = \left(\sum_{i=1}^{n} \alpha_iy_ix_i\right)^Tx+b \\\\
= \sum_{i=1}^{n}\alpha_iy_i(x_i,x)+b
$$
</center>
<br>
4. **Kenel function (核函数)**
Non-linear cases need more flexible <font color=orange>`hypothetical space.`</font> We can use <font color=orange>`φ`</font> to map <font color=orange>`x`</font> to a <font color=orange>`higher dimension space, in which all the points can be linear separable.`</font>

<br>
So the classification function can be extened as:
<center>
$$
f(x) = \sum_{i=1}^{n}\alpha_iy_i(\Phi(x_i),\Phi(x))+b \\\\
$$
</center>
Will get the <font color=orange>`kernel function:`</font>
<center>
$$
K(x,z) = (\Phi(x),\Phi(z))
$$
</center>

<table><tr><td bgcolor=#ebffeb> <font color=orange>`Common used kernel functions:`</font></td></tr></table>

<br>
5. Apply SVM model in R
```R
> data <- read.table("var.tsv", header =T)
> head(data)
> tail(data)
-------------------------------->>>>
> x <- subset(data, select = -class)
> y <- factor(data$class)
> library(e1071)
-------------------------------->>>>
> model <- svm(x, y, cost=2, gamma=0.5, probability=T, scale=F, cross=10)
> summary(model) ### check the final "Total accuracy" in output
-------------------------------->>>>
> model$SV
> pred <- predict(model, x, probability=T)
> prob <- attr(pred, "probabilities")[,2]
> head(prob)
> prob
-------------------------------->>>>
> library(pROC)
> roc <- roc(y, prob, smooth=T)
> plot(roc, col="blue", print.thres=T, print.auc=T, grid=c(0.2,0.2))
-------------------------------->>>>
> library(rgl)
> x <- subset(x, select=c(a1,a2,a3))
> model <- svm(x, y, kernel="linear", cost=2, scale=F, cross=10)
> palette(c("green", "blue"))
> plot3d(x, col=as.integer(y), type="s", radius=0.05)
> plot3d(model$SV, col=as.integer(y[model$index]), type="s", radius=0.15, add=T)
-------------------------------->>>>
> library(misc3d)
> every.list <- lapply(x, function(x){seq(min(x),max(x),len=100)})
> every <- expand.grid(every.list)
> pred <- predict(model, every, decision.values=T)
> every.dv <- attr(pred, "desicion.values")
> every.dv <- arrary(every.dv, dim=rep(100,3))
> contour3d(every.dv, level=0, x=every.list$a1, y=every.list$a2, z=every.lisr$a3, alpha=0.5, add=T)
> plot3d(spin3d())
-------------------------------->>>>
> model <- svm(x, y, cost=2, gamma=0.5, scale=F, cross=10)
> plot3d(x, col=as.integer(y), type="s", radius=0.15)
> plot3d(model$SV, col=as.integer(y[model$index]), type="s", radius=0.15, add=T)
> pred <- predict(model, every, decision.values=T)
> every.dv <- attr(pred, "decision.values")
> every.dv <- array(every.dv, dim=rep(100,3))
> contour3d(every.dv, level=0, x=every.list$a1, y=every.list$a2, z=every.lisr$a3, alpha=0.5, add=T)
> plot3d(spin3d())
```

<br>
## Presentation
<br>
### Comparative protein structure model
<br>

<br>
The steps:
- Fold assignment and template slection
- Target - templete alignmrnt
- Model Building
- Model evaluation
<br>
1. Fold assignment and template slection

Templete selection:
- higher seq similarity
- family of proteins
- quality of template structure
- Solvent, pH, ligamds...
<br>
2. Target - templete alignmrnt
Once templates have been selected, a specilized method should be used to align target seq with template structures. Alignment become difficlut in "twilight zone" of less than 30% seq identity (similarity of BLUSUM62 is 62%, also ~45 & ~80).
<br>
3. Model Building

<br>
4. Model Evaluation
Typical errors in comparative methods:
- Errors in side-chain packing
- Distortions and shift in correctly aligned regions
- Errors in regionswithout a template
- Errors due to misalignments
- Incorrect template
<br>
Evaluation
- Having the correct fold or not
- The target-template seq similarity
- The environment
- Having good stereochemistry on nor
- Distributions if many spatial features
<br>
5. Application of comparative modeling




6. Final Conclusion

<br>
**<center>Correlated**
<br>
[Bioinformatics/ Introduction and Methos (Week 1 Bioinformatics Introduction)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethosweek1md)
[Bioinformatics/ Introduction and Methos (Week 2 Sequence Alignment)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethosweek2md)
[Bioinformatics/ Introduction and Methos (Week 3 Seq DB and BLAST Algorithnm)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek3md)
[Bioinformatics/ Introduction and Methos (Week 4 Markov Model)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek4md)
[Bioinformatics/ Introduction and Methos (Week 5 From Sequencing to NGS)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek5md)
[Bioinformatics/ Introduction and Methos (Week6 Variant Database)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek6md)
[Bioinformatics/ Introduction and Methos (Week7 Transcriptome Analysis, and RNA-Seq)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek7md)
[Bioinformatics/ Introduction and Methos (Week8 Prediction and Analysis of Noncoding RNA)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek8md)
[Bioinformatics/ Introduction and Methos (Week9 Ontology, Gene Ontology and KEGG Pathway Database)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek9md)
[Bioinformatics/ Introduction and Methos (Week 10 Bioinformatics Database and Resources)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek10md)
[Bioinformatics/ Introduction and Methos (Week 11 New Gene Evolution Detected by Genomic Computation: Basic Concepts and Examples)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek11md)
[Bioinformatics/ Introduction and Methos (Week 12 From Dry to Wet, an Evolutionary Story. Evolution function analysis of DNA methyltransferase)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek12md)</center>

Bioinformatics: Introduction and Methods (Week 6)