We take bioinformatics to mean the emerging field of science growing from the application of mathematics, statistics, and information technology, including computers and the theory surrounding them, to the study and analysis of very large biological, and particularly genetic, data sets. The field has been fueled by the increase in DNA data generation leading to the massive data sets already generated, and yet to be generated, in particular, the data from the human genome project, as well as other genome projects. Bioinformatics does not aim to lay down fundamental mathematical laws that govern biological systems.
Instead, the use of mathematics in the field is in the creation of tools that investigators can use to analyze data. One of the most important uses for it is the statistical analysis of the similarity between two or more DNA or protein sequences. Background Biology Deoxyribonucleic acid (DNA) is the basic information macromolecule of life. It consists of a string of nucleotides, in which each nucleotide is made up of a standard deoxyribose sugar and phosphate group unit, connected o a nitrogenous base of one of four types: adenine, guanine, cytosine, or thymine (abbreviated as A, G, C, and T respectively).
The sequence in which the different bases occur in a particular strand of DNA represents the genetic information encoded on that strand. In the cell, DNA is organized into chromosomes, each of which is a continuous length of double stranded DNA that can be hundreds of millions base pairs long. A human chromosome consists mostly of “junk DNA,” whose function, if any, is not well understood. Interspersed in this junk DNA are genes, the classic unit of genetic information.
A protein is comprised of a sequence of amino acids, which are represented by letters a, b, c, and so on. There are twenty amino acids that commonly appear in proteins. Proteins go on to perform a variety of functions in the cell, covering all aspects of cellular functions from metabolism to growth to division. Basic Probability and Probabilistic Models Some basic results in using probabilities are necessary for understanding sequences. A probabilistic model is one that produces different outcomes with different probabilities.
A probabilistic model can simulate a whole class of objects, assigning each an associated probability. In bioinformatics the objects are often sequences and a model may describe a family of related sequences. Consider an extremely simple model of any protein or DNA sequence. Biological sequences are strings from a finite alphabet of residues, generally either four nucleotides or twenty amino acids. Assume that a residue occurs at random with probability qa, independent of all other esidues in the sequence.
If the protein or DNA sequence is denoted x1. n, the probability of the whole sequence is then the product: [pic] Maximum Likelihood Estimation The parameters for a probabilistic model are typically estimated from large sets of trusted examples, often called a training set. For instance, the probability qa for amino acid a can be estimated as the observed frequency of residues in a database of known protein sequences, such as SWISS-PROT, where the frequencies for the twenty amino acids are obtained rom counting up some twenty million individual residues in the database.
As long as the training sequences are not systematically biased towards a peculiar residue composition, it is expected that the frequencies to be reasonable estimates of the underlying probabilities of our model. This way of estimating model is called maximum likelihood estimation. When estimating parameters for a model from a limited amount of data, there is a danger of overfitting, which means that the model becomes very well adapted to the training data but will not generalize well to new data.