John L. Houle,a Wanda Cadigan,b Sylvain Henry,b Anu Pinnamanenib and Sonny Lundahlc
aScientific Author, bScientific Reviewer, cSenior Manager
Bio-databases.com, Amita Corporation, 1420 Blair Place, Suite 500, Ottawa, Ontario, Canada, K1J 9L8
info@bio-databases.com
Abstract
The Human Genome Initiative is an international research program for the creation of detailed genetic and physical maps of the human genome. Genome research projects generate enormous quantities of data. Database mining is the process of finding and extracting useful information from raw datasets. Computational genomics has identified a classification of three successive levels for the management and analysis of genetic data in scientific databases:
- Genomics.
- Gene expression.
- Proteomics.
Genome database mining is the identification of the protein-encoding regions of a genome and the assignment of functions to these genes on the basis of sequence similarity homologies against other genes of known function. Gene expression database mining is the identification of intrinsic patterns and relationships in transcriptional expression data generated by large-scale gene expression experiments. Proteome database mining is the identification of intrinsic patterns and relationships in translational expression data generated by large-scale proteomics experiments. Improvements in genome, gene expression and proteome database mining algorithms will enable the prediction of protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. Thus, the final objective of such higher-level functional analysis will be the elucidation of high-resolution structural and functional maps of the human genome.
Contents
- 1. Human Genome Initiative
- 2. Computational Molecular Biology and Scientific Databases
- 3. Genomics
- 3.1 Genome Databases
- 3.2 Genome Database Mining
- 3.2.1 Computational Gene Discovery
- 3.2.2 Sequence Similarity Searching
- 4.Gene Expression
- 4.1 Gene Expression Databases
- 4.2 Gene Expression Database Mining
- 5. Proteomics
- 5.1 Proteome Databases
- 5.2 Proteome Database Mining
- 6. Conclusions
References |
1. Human Genome Initiative
The Human Genome Initiative is an international research program for the creation of detailed genetic and physical maps for each of the twenty four different human chromosomes and the elucidation of the complete deoxyribonucleic acid (DNA) sequence of the human genome. A genetic map depicts the linear arrangement of genes or genetic marker sites along a chromosome. Two types of genetic maps are identified: genetic linkage maps and physical maps. Genetic linkage maps are based on the frequency with which genetic markers are coinherited. Physical maps determine actual distances between genes on a chromosome.
As described in the survey of Pearson and Soll,(1) the Human Genome Initiative has six scientific objectives:
- Construction of a high-resolution genetic map of the human genome.
- Production of a variety of physical maps of the human genome.
- Determination of the complete sequence of human DNA.
- Parallel analysis of the genomes of a selected number of well-characterized nonhuman model organisms.
- Creation of instrumentation technologies to automate genetic mapping, physical mapping and DNA sequencing for the large-scale analysis of complete genomes.
- Development of computational tools such as algorithms, software and databases for the collection, interpretation and dissemination of the vast quantities of complex mapping and sequencing data that are generated by human genome research.
Genetic maps serve as resources in the search for genes responsible for genetically-mediated diseases as well as for the further study of gene structure, function and expression. Thus, the advent of a high-resolution genetic map of the human genome will generate advances in six areas of medicine:
- Genetic counseling.
- Prediction of genetic disease susceptibility.
- Diagnostic tests.
- Gene therapy.
- Rational drug design.
- Pharmacogenomic drug customization.
The Human Genome Initiative has been reviewed.(2-15)
2. Computational Molecular Biology and Scientific Databases
Genome research projects generate enormous quantities of data. Genbank is the National Institutes of Health (NIH) molecular database which is composed of an annotated collection of all publicly available DNA sequences.(16) The February 2000 release of the Genbank molecular database contained 5,691,000 DNA sequences which are further composed of approximately 5,805,000,000 deoxyribonucleotides.(17)
A major objective of the Human Genome Initiative is the development of more advanced DNA sequencing technologies. Concerted genome sequencing using these advanced DNA sequencing technologies will result in even further increases in DNA sequence generation rates. Genbank statistics on DNA sequence curation demonstrate exponential growth rates.(18)
Computational molecular biology is defined as the mathematical and computational analysis of biological macromolecules.(19-21) Computational genomics refers to the applications of computational molecular biology in large-scale genome research.(22-28) On the basis of the central dogma of molecular biology, computational genomics has identified a classification of three successive levels for the management and analysis of genetic data in scientific databases:
- Genomics.
- Gene expression.
- Proteomics.
These application domains will be subsequently discussed. The objective of human genome database analysis is the elucidation of structural and functional maps of the human genome. Database mining is defined as the process of finding and extracting useful information from raw datasets.(29-33) Large-scale genome database mining is an open research problem that could be addressed by the application of supercomputer technologies. Thus, human genome mapping has been identified as a Grand Challenge problem in medical supercomputing.(34-36)