# Week 9 Ontology, Gene Ontology and KEGG Pathway Database
<br>
<p align="justify"> Statement before class: All the contents here marked are belong to the course [<font color=orange>Bioinformatics: Introduction and Methods</font>](https://www.coursera.org/learn/bioinformatics-pku/home/welcome). If you like this high-quality course contents, please turn to the official course for more details. Here, i recored the knowledge learnt for convenience. If any copyright problem exist, please tell me, i’ll delete all immediately, Thanks!</p>
<br>
## Annotations in Gene Ontology
<br>
<table><tr><td bgcolor=#ebffeb><font color=orange>`A computer is only as smart as we make it to be!`</font> </td></tr></table>

<br>
1. The Ontology Concept


2. <font color=orange>`Three categories of Gene Ontology`</font>
- **Molecular Function (MF)**: =element activity/task, the tasks performed by individual gene products; examples are carbohydrate biding and ATPase activity
- **Biological Process (BP)**: = biological goal or object, broad biological goal, such as mitosis pr purine metabolism, that are accomplished by ordered assemblies of molecular functions
- **Cellular Component (CC)**: = location or complex, subcellular strucutures, lcoations and macromolecular complexes; examples include mucleus, telmoere, and RNA polymerase II holoenzyme.
<br>
<font color=orange>`Each of the three categories of the Gene Ontology can be represented as a Directed Acyclic Graph. `</font>

3. <u>OBO File Format</u>: [Term]: id, name, namespace, def, synonym

<u>XML File Format</u>: go:term: go:accession, go:name,go:synonym,go:defination, go:isa, go:dbxref

<br>
4. GO relationship
Type1: **<u>is a</u>**
B is a A” means B is a subtype of A. A is sometimes called a mother or parent node, and B is sometimes called a child node

Type2: <u>**part of**</u>
In the Directed Acyclic Graph the “part of” relationship is shown with a letter “P” on the edge, with the arrow pointing to the mother node.

Type3: **regulate**
<p align="justify"> Such as “B regulates A”. There are two subrelationships: “positively regulates” and “negatively regulates” For examples,The “R” on the edge specifies that “regulation of pigmentation during development” “regulates” “pigmentation during development”. The R is on a black background. An “R” with a green background on an edge specifies that “positive regulation of pigmentation during development” “positively regulates” “pigmentation during development”. An “R” with a red background on an edge specifies that “negative regulation of pigmentation during development “negatively regulates” “pigmentation during development”. </p>

Relationships


<br>
## KEGG Pathway Database
<br>
1. What is a bilogical pathway?
A series of actions among molecules in a cell taht leads to a certain product or a change in a cell.
Main types of biological pathways
- Metabolic pathways: like a factory's production assembly line
- Gene regulation pathways: like the production management
- Signal transduction pathways: like the monitoring of the naudible in cells and transmission of the information to the supply manager and product manager.
<br>
2. Coomon Used Pathway Databases
<u>**KEGG,**</u> BioCarta, BioCyc, Protein Analysis Through Evolutionary Relationships (PANTHER), Pathway Interaction Database (PID), Reactme
<br>
3. KEGG Database
<p align="justify"> KEGG pathway are divided into seven categories. Including metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases and drug development. Some of these categories were relatively new, and not yet comprehensive, such as human diseases. But categories like metabolism are extremely useful.</p>
<u>**Interactions:**</u>


<br>
4. KEGG pathway entry
Type1: <u>**flat file format**</u>

Type2: <u>**KEGG Markup Language (KGML) format**</u>



<br>
5. Search and color pathways
Input a list of genes that she wants to highlight by specifying the gene ID and the background and foreground color and the gene will be highlighted in the pathway map

6. KO
KEGG ontology, or KO, like gene ontology (GO), that describes gene functions in a hierarchical control of a capillary.

## Annotations in Gene Ontology
<br>
Three types of GO annotations
- Annotation through manually-reviewed experimental evidence
- Annotation through manually-reviewed computational analysis evidence
- Annotation by electronically-generated computational analysis evidence
<br>
1. Manually-reviewed experimental
| Abbreviation | Full Name (Inferred from) | Examples |
| :----------: | :-----------------------: | :----------------------------------------------------------: |
| IDA | Direct Assay | Enzymatic assays, in vitro reconstitution, immunofluorescence, salt fractionation |
| IPI | Physical Interaction | 2-hybrid interactions, co-purification, co-immunoprecipitation, and ion/protein binding experiments |
| IMP | Mutant Phenotype | Polymorphisms or allelic variations, any procedure that disturbs the expression or function of the gene,overexpression or ectopic expression of wild-type or mutant gene, and so on |
| IGI | Genetic Interaction | Traditional genetic interactions, functional complementation, rescue experiments, and interference from a different gene |
| IEP | Expression Pattern | Transcriptional levels or timing, protein expression levels |
| EXP | Experiment | That are not covered in the above list |
<br>
2. Manually-reviewed computational analysis
| Abbreviation | Full Name (Inferred from) | Examples |
| :-------------: | :-----------------------------: | :----------------------------------------------------------: |
| ISO | Seq Orthology | Either experimentally or phylogenetically to an orthologous gene that already has a GO annotation |
| ISA | Seq Alignment | By a pairwise or multiple alignment tools such as BLAST or ClustalW |
| ISM | Seq Model | By prediction methods for protein domains, protein features, motifs, and non-coding RNAs |
| ISS | Seq or structural similarity | Not covered above |
| IGC | Genomic context | peron structure, syntenic regions, pathway analysis, and genome scale analysis of processes |
| IBA | Biological aspect of ancestor | Phylogenetic evidence |
| IBD | Biological aspect of descendant | Defined in GO, but for some reason, no gene is associated with it |
| RCA | Reviewed computational analysis | Different from above in that this refers specifically to analysis of large scale experimental datasets or integration of datasets of several types |
| IKR (not in GO) | Key residues | Although homologous to a particular protein family had lost essential residues |
| IRD (not in GO) | Rapid divergence | Exists phylogenetic evidence of rapid divergence from ancestral sequence |
<br>
3. Electronically-generated computational analysis
<p align="justify"> The speed at which genes and genome are sequenced far exist the speed at which experimental biologists can study them in the lab or the speed at which any curator teams can review the evidence, many genes were annotated by completely electronically-generated computational analysis without manual review. These are usually considered weaker evidence than manually review the evidence, but they're invaluable in covering order of magnitude more genes.They are labeled as IEA for Inferred from Electronic Annotation</p>
| Abbreviation | Full Name (Inferred from) | Examples |
| :----------: | :----------------------------: | :-----------------------------------------------: |
| TAS | Traceable author statement | if the statement is justified with the citation |
| NAS | Non-traceable author statement | if the statement is not justified with a citation |
| IC | Inferred by curator | Curator without specifying the source of evidence |
| ND | No biological data available | No Biological Data Available |
| NR | Not recorded | Not recorded |
<br>
4. Percentage of annotations

<br>
## Pathway identification
<br>
1. KOBAS

- <p align="justify"> For the well used model orgnisms, KOBAS just needs the Genban GI (or Entrez Gene ID/Ensembl Gene ID/UniProtKB AC) to perform KEGG pathway enrichment </p>
- <p align="justify"> If the orgnism is not well used model orgnism, KOBAS could perfoorm sequence similarity mapping (BLAST with evalue < 10^-5, rank ≤ 5) for the query seqs, like newly discovered genes or genes in a poorly annotated species.</p>
- <p align="justify"> Most genes in KEGG are linked to KOs, and the KOs are linked to pathways. By following the links from your query sequence to its blast hits in the KEGG, the KOs, the pathways, you can annotate your query genes with its pathways </p>
- <font color=orange>`Evaluation of pathway annotation by seq similarity`</font>
<center>
- $$
precision = \frac{TP}{TP+FP}
$$
</center>
<center>
- $$
coverage = \frac{TP}{N}
$$
</center>
- Over 89% coverage and 93% precision in the worst case
- "New" annotaion of some genes
- <font color=orange>`The most significant pathway?`</font>
- just by chance passibility = null hypothesis = not special for experiment
- <font color=orange>`P-value: `</font> is the probability that the data have occurred just by chance,
assuming that the null hypothesis is true. If the <u>**p-value is very small**</u>, say less than 1 in 100, or less than 1 in 20, then your observation is <u>**unlikely to have occurred just by chance**</u>.
- It is likely that this particular pathway is special for your experiment.
<font color=orange>`You reject the null hypothesis, the smaller the p-value the better.`</font>
- **<u>hypergenometric distribution</u>**
<center>
$$
p-value = \sum_{i=m}^{M}\frac{\begin{pmatrix}M \\\\ i\end{pmatrix} \begin{pmatrix}N-M\\\\n-i\end{pmatrix}}{\begin{pmatrix}N\\\\n\end{pmatrix}}
= 1-\sum_{i=0}^{m-1}\frac{\begin{pmatrix}M \\\\ i\end{pmatrix} \begin{pmatrix}N-M\\\\n-i\end{pmatrix}}{\begin{pmatrix}N\\\\n\end{pmatrix}}
$$
</center>
<table><tr><td bgcolor=#ebffeb><font color=orange>`Good statistics keeps us honest!`</font></td></tr></table>



<br>
## Breif Introduction to Database
<br>
<u>**Database**</u> is the collection of data, **<u>Data management system (DBMS)</u>** is the collection of interrelated data and a set of programs to access those data, providing a efficient, reliable, convenient and safe multi-user storage of and access to massive amounts of presistent data.
Data models:
- Relational model
- Entity-relationship model
- Object-based data model
- Semistructured data model

Database Languages:
- Data-Defination Language (DDL)
- Data-Mainpulation Language (DML)
<font color=orange>`SQL for DDL`</font>
```sql
CREATE TABLE Pathways
(
pid INTEGER Pathway KEY,
db TEXT,
id TEXT,
name TEXT
);
```

<br>
**<center>Correlated**
<br>
[Bioinformatics/ Introduction and Methos (Week 1 Bioinformatics Introduction)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethosweek1md)
[Bioinformatics/ Introduction and Methos (Week 2 Sequence Alignment)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethosweek2md)
[Bioinformatics/ Introduction and Methos (Week 3 Seq DB and BLAST Algorithnm)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek3md)
[Bioinformatics/ Introduction and Methos (Week 4 Markov Model)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek4md)
[Bioinformatics/ Introduction and Methos (Week 5 From Sequencing to NGS)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek5md)
[Bioinformatics/ Introduction and Methos (Week6 Variant Database)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek6md)
[Bioinformatics/ Introduction and Methos (Week7 Transcriptome Analysis, and RNA-Seq)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek7md)
[Bioinformatics/ Introduction and Methos (Week8 Prediction and Analysis of Noncoding RNA)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek8md)
[Bioinformatics/ Introduction and Methos (Week9 Ontology, Gene Ontology and KEGG Pathway Database)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek9md)
[Bioinformatics/ Introduction and Methos (Week 10 Bioinformatics Database and Resources)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek10md)
[Bioinformatics/ Introduction and Methos (Week 11 New Gene Evolution Detected by Genomic Computation: Basic Concepts and Examples)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek11md)
[Bioinformatics/ Introduction and Methos (Week 12 From Dry to Wet, an Evolutionary Story. Evolution function analysis of DNA methyltransferase)](https://www.haoxi.info/archives/bioinformaticsintroductionandmethodsweek12md)</center>

Bioinformatics: Introduction and Methods (Week 9)